|

THE DATA STORAGE EVOLUTION
Has disk capacity outgrown its usefulness?
by Ron Yellin
DISK STORAGE CAPACITIES DOUBLE every 12 to 18 months as density rapidly increases, and the amount of available storage is growing exponentially with each disk size capacity increase (figure 1). Users looking to decrease their storage costs would typically look at higher capacity drives as a positive industry trend; however, it is critical to understand the performance implications of rapidly increasing disk capacity.
Like capacity, disk performance is increasing,
but at a much slower rate. While capacity has almost doubled every
12 months, disk-seek times (time to position the disk head on
the required track) only decreased by approximately 12% during
that time. Disk transfer times (the rate at which data is moved
from or to the disk) have increased by approximately 40% over
the same period. As disk capacity evolved from 2GB to 73GB (a
36-fold increase), average I/O access times (seek, latency—time
for disk to revolve, and transfer) have decreased from 13.25 milliseconds
to 6.5 milliseconds, just a two-fold improvement (figure 2).
In the past six years, Teradata has seen improvements
in disk capacity, disk array bandwidth (throughput in MB/sec.),
bandwidth per disk and bandwidth per GB of storage capacity.
Disk capacities in the Teradata environment
have increased approximately eight-fold. The disk array bandwidth
has seen greater than a five-fold improvement during this same
period.
With capacity growing more quickly than disk
bandwidth, the bandwidth per GB of storage capacity has actually
decreased by 50% (figure 3).
Given this trend, it is challenging to take
advantage of abundant storage capacity while maintaining required
performance levels.
Managing the workload
As capacity and performance increase at different rates, the actual
impact on performance depends on the workload and the nature of
the data being transferred. In a data warehousing environment,
I/O bandwidth is the critical performance metric, whereas I/Os/sec.
may be more relevant in an operational system or OLTP environment.
In the Teradata environment, a typical workload
generates an extremely random disk access pattern. I/O sizes vary,
with an average of approximately 48 kilobytes. Cache in a disk
array is not effective since any data likely to be reused is already
cached within the memory of a Teradata compute node, which eliminates
the need to issue that I/O in the first place.
Given this workload, the disk performance is
the limiting factor in the Teradata environment. Here, a 15,000-rpm
disk can generate approximately 5.5 MB/sec. of bandwidth. Based
on the I/O bandwidth requirement of each generation of Teradata
compute node, the number of disks required can be easily calculated.
As disk capacities increase, businesses could
reduce the number of disks they use, but with a finite amount
of I/O bandwidth available per disk, they often find it necessary
to configure more disks for performance than are required for
capacity.
The performance and capacity technology trend
is not only a problem in data warehousing environments but it
is also an issue in transactional system environments (e.g. OLTP),
which are more sensitive to the number of I/Os per second that
can be performed. Regardless of capacity, a 15,000-rpm disk is
capable of performing approximately 154 small block (512 byte)
random I/Os per second (based on a 6.5 millisecond access time
as shown in figure 2).
As disk capacity doubles, disk performance
(I/Os per second) essentially remains static. As such, there is
now twice as much data capacity that the same number of I/Os per
second must service. This is leading a number of OLTP vendors
to recommend that some of the capacity on the larger disks not
be used.
Balancing the technology
Three methods address these disk trends. First, you can do everything
possible to extract maximum performance from a disk. Second, you
can use advanced database technologies to reduce I/O demand. Third,
and inevitably, you can identify good uses for the onslaught of
disk capacity. Let’s examine each of these methods.
1) How does Teradata extract
the maximum performance possible from a disk?
Teradata uses high-performance disk array controllers and storage
interconnects to deliver the disk’s performance capabilities
without intervening bottlenecks.
Teradata recommends using RAID-1 instead of
RAID-5 redundancy. RAID-5 has a high I/O overhead for updates
and writes of intermediate tables. Furthermore, RAID-5 exacerbates
the capacity problem.
Teradata uses high-performance enterprise class
disks, with high rpm, low seek times and fast transfer rates.
Avoid performance-wasting, high-capacity, slow rpm, non-enterprise
PC-class SATA disks that require extra revolutions to acquire
data in the face of normal vibration.
Teradata tracks external and internal form
factor reductions in enterprise-class disks to apply maximum spindles
and disk actuators to disk capacity.
Where appropriate, Teradata uses larger transfer
sizes, via large block sizes or Teradata Database Cylinder Read
to access maximum data with minimum overhead.
2) How does Teradata use
advanced database technologies to reduce I/O demand?
The partitioned primary index (PPI) is a Teradata Database V2R5
feature that organizes data such that examining only a small portion
of data can satisfy many queries. When applied to range query
workloads, Teradata customers have achieved a 10-fold to 100-fold
reduction in I/O per query.
Secondary indexes in general are an excellent
way to trade ample capacity for reduction in I/O to base tables.
The Teradata Index Wizard can identify opportunities for the use
of indexing.
The Teradata Database V2R5 sparse join index
feature is not only highly efficient on capacity, but it can also
be very effective in reducing I/O.
Teradata Tools and Utilities, such as Teradata
Dynamic Query Manager, Teradata Priority Scheduler and Teradata
Manager, manage the workload and, by extension, the system resources.
These capabilities can identify waste and inefficiency, tuning
opportunities, and workloads that can be shifted to low-demand
periods.
3) How can you mitigate
the inevitable onslaught of capacity?
Despite our best efforts to extract maximum disk performance and
reduce I/O demand, the disk capacity wave is unstoppable. Shall
we throw in the towel like OLTP vendors and suggest that customers
use only a fraction of each disk drive? Absolutely not! There
is one more solution: Find good, low-access uses for the excess
capacity.
Tempering the trend
Temperature is used to represent frequency of data access—the
more frequently accessed the data, the hotter its temperature.
Likewise, as access frequency decreases, so does the temperature.
The Teradata Multi-Temperature Warehouse describes an implementation
where the Teradata Database manages both the active (hot) data
and the inactive (cool) data, which traditionally has not been
kept in the data warehouse. By sharing and managing the storage
across temperature ranges, the active data can maintain its required
performance levels while rarely accessed data becomes available
to users.
To combat the capacity-versus-performance issue,
the Teradata Multi-Temperature Warehouse allows you to direct
the performance of the disk at the hot data while using the remaining
capacity to store and access cool data. (See figure 4.)
Industry-specific requirements, government
regulations (e.g., Sarbanes-Oxley Act) and unique business needs
generate an endless supply of cool data that can be introduced
into the data warehouse.
As data ages, its temperature typically cools.
Some companies already maintain some historic data within the
data warehouse, but many find a strategic value in increasing
the amount of history they maintain. Where six to 18 months’
worth of history might have been typical for some customers, many
now find value in keeping several years’ worth of historical
data online.
Another possible use for excess capacity is
to trade it for higher system and data availability. The Teradata
Fallback feature duplicates Teradata objects on an independent
hardware domain within the system to maximize single-system availability.
The Fallback copy receives only updates, whereas the primary copy
receives both updates and reads. So, by definition, the Fallback
copy is cooler in nature. Fallback systems can tolerate a large
set of unexpected catastrophic failure scenarios. To learn more
about Fallback, download the white paper “Single System
Availability Features” from the Library section of teradata.com.
Large objects supported in Teradata Database
V2R5.1 offer yet another potential source of cool data because
Teradata can support up to 2GB character or binary objects for
applications requiring data types such as video, picture and audio.
Using excess capacity
High-capacity disks, Teradata Priority Scheduler, Teradata Dynamic
Query Manager, PPI and multi-value compression are a few of the
technologies that enable you to manage and access multi-temperature
data.
High-capacity disks provide a cost-effective
means of storing a mix of hot and cool data. Since higher-capacity
disks offer a lower overall storage cost per MB, their higher
initial cost is negligible.
Teradata Dynamic Query Manager can control
query issuance. For example, users accessing cool data may be
managed to a low number of queries that can be executed at one
time (known as low query concurrence) thus preserving resources
for users accessing hot data. Other controls can be implemented
based on user role, time of day or query cost.
Teradata Priority Scheduler allocates system
resources among the various workload constituents once queries
are issued. Low-priority, full-table scans of all cool data may
occur in the background, allowing resources to focus on the active
data warehouse queries against hot data.
PPI allows a hash-distributed table to be physically
partitioned for storage on each virtual AMP according to the partitioning
columns. In practice, customers will use time as the partitioning
key and store both lightly accessed historic data and more heavily
accessed recent data together on high capacity disks. PPI allows
the resources to be focused on more recent data. When a customer
query is issued with a date-range specification, the Teradata
Database Optimizer eliminates the need to scan partitions known
not to contain the date range desired by the query.
Multi-value compression, like higher capacity
disks, is a cost-effective way to add more data into the warehouse
by liberating capacity while improving performance. Liberated
capacity can be used for other purposes, such as deeper history
or performance enhancing indexes. Compression can result in smaller
rows and, hence, more rows per data block. Increased rows-per-data-block
causes fewer disk I/Os for scan-oriented query workloads.
With
disk capacity growing more rapidly than disk performance, it becomes
challenging to configure high-performing data warehousing systems
without requiring excess storage capacity.
The Teradata Multi-Temperature Warehouse addresses
this industry-wide trend. Through this implementation, customers
can configure their systems for high performance while utilizing
the excess capacity to increase business value. T
Ron Yellin, director of Storage Product Management
at Teradata, has focused on storage for the past six years, primarily
managing Teradata's external disk storage solutions and planning
future offers. E-mail him at ron.yellin@
teradata-ncr.com.
PHOTO BY CHARLIE SWICK
|