Multi-temperature data warehousing lets businesses utilize capacity from larger disk drives without sacrificing performance.
by Paul A. Barsch
In the 1986 movie Ferris Bueller's Day Off, one of the main characters, Cameron Frye, has a father who keeps his Ferrari 250 GT in the garage
but never uses it. The Ferrari, of course, is a top-of-the-line automobile with enormous performance potential. However, Cameron's father isn't
taking advantage of what he has already paid for. Paying for something and not using it is a losing value proposition.
In a similar vein, with the advent of larger-capacity disk drives, many CIOs are paying for extra data warehouse storage; however, they are
unable to utilize the storage due to potential performance issues. Teradata has a solution to this conundrum that can enable your company to
utilize disk space you have already paid for, without sacrificing performance—multi-temperature data warehousing.
The performance/value conundrum
Many business and technology trends are driving changes in the way data is stored and accessed.
From a technology perspective, for the past several years, enterprise class drives have doubled in capacity every 12 to 18 months. With future
drive capacities expected to continue growing, data warehousing systems will have far more usable storage than ever before. However, disk I/O
performance is increasing at a much slower rate, making it difficult for servers to adequately utilize this additional capacity without
incurring a performance penalty.
| enlarge |
|
Multi-temperature is a metaphor used to classify data based on frequency of access. The more frequently the data is
accessed, the hotter the data.
|
|
Regarding business trends, to equip and arm front-line employees to better serve customers and make faster decisions, companies are allowing
more users than ever to access the data warehouse. In many instances, front-line employees are accessing current data—for example, the last 90
days of a customer's history—while, simultaneously, back-office users such as marketing, sales and finance managers are accessing longer-term
historical data. Fulfilling the needs for both operational and strategic business users is driving the shift to active data warehousing.
Businesses are also finding value in storing and accessing historical data in the warehouse to improve decision support and customer service.
In addition, regulatory and legal requirements such as Sarbanes-Oxley, Basel II and the European Union Data Retention Directive are causing
companies to keep years of historical data immediately accessible.
CIOs and IT executives are always on the lookout to do "more with less." Return on investment (ROI) has never been a higher priority, and IT
executives are continually demanding improved price/performance from technology.
Strategies to improve data warehouse price/performance
Because ROI is so important, it might be startling to discover many data warehouse vendors configure systems to populate only a small portion
of available disk space in order to keep data warehouse performance at optimum levels. In fact, as disk capacities have grown over the years
(from 36GB to 73GB to 146GB drives and beyond), some data warehouse vendors allow just a small percentage of a disk drive (less than 10%) to
be populated for fear of performance degradation.
The need to leave a majority of the disk space available to maintain performance is a data warehouse industry problem. Regardless of the
performance rationale, this unused capacity has been paid for by the customer—thereby creating a losing value proposition.
With the demand to keep more historical data in the data warehouse for better decision making and compliance purposes, a logical choice is to
load up the unused disk space with more data. While logical, this would not be a wise strategy. Why? It's plainly a matter of physics.
Achieving optimum performance from a data warehouse requires a careful balance of processing power (CPU), disk capacity, I/O and memory. When
one of these variables is out of balance, the performance of the entire data warehouse can suffer. Simply adding more historical data to the
warehouse without taking the other variables into account will cause performance degradation.
One strategy to improve data warehouse price/performance might be to add more CPU. However, if the system is "CPU bound," adding more CPU will
result in only a small uptick in performance, as there will not be enough disk drives in a given system to keep the CPUs busy.
Another strategy could be to overconfigure for capacity, thinking that adding more disk capacity and/or using larger but fewer physical disk
drives will save money and provide more value for the limited IT dollar. However, this reduction in I/O is counterproductive when the objective
is to keep a balance between CPU, I/O, memory and capacity.
The wisest approach is a "balanced configuration," which offers the best data warehouse performance based on the efficient utilization of
processing power, I/O, memory and storage capacity. A balanced configuration has the right amount of these variables and is a high-performance
system.
But even with a balanced configuration approach, there isn't a way to utilize unused disk capacity. Balancing storage and performance
requirements is critical to gaining more value from your data warehouse and is accomplished through workload management and a concept
called multi-temperature data warehousing.
| enlarge |
|
Teradata's unique workload management capabilities help maintain a lower total cost of ownership by dynamically managing
the workload on the system as opposed to managing the data.
|
|
Multi-temperature data warehousing
As indicated earlier, with both strategic and operational users accessing data, not all data in the warehouse will have the same access
patterns, nor will all queries have the same level of priority.
Multi-temperature data warehousing is an implementation concept where data is classified by frequency of access, query access rates, frequency
of updates and data maintenance.
The word "temperature" is a metaphor used to represent the frequency of data access—the more frequently data is accessed, the hotter its
temperature. (See figure 1, above.) Operational queries will typically confine themselves to a subset of hot data. Strategic queries
access more data, and the temperature of that data is usually hot or warm. Strategic analysis on deep history requires data that may span cold
and/or dormant data, warm data and hot data.
The multi-temperature data warehouse has the ability to prioritize the use of system resources based on business rules while maximizing the
utilization of storage capacity. This prioritization is accomplished through workload management.
Teradata's workload management tools, techniques and utilities help control and constrain the workloads so that service level agreements across
user communities are met, with the added benefit of allowing utilization of larger disk capacities. This is clearly the best strategy for CIOs
who want the ultimate price/performance from their IT investment.
Teradata workload management in a multi-temperature environment
Workload management provides users with the ability to perform long-running strategic queries and short operational queries in the same data
warehouse environment. Gartner Dataquest analysts Mark Beyer and Donald Feinberg highlight the benefits of workload management in the recent
research report "Magic Quadrant for Data Warehouse Database Management Systems 2006" (September 2006). According to the report, "During the
next three years, mixed workload performance will become the single most important performance issue in data warehousing."
The Teradata Database is well-known for its dynamic workload management capabilities. Workload management is configured based upon
customer-specific requirements in accessing data—especially crucial in a multi-temperature environment. To make this come to life, let's
walk through three unique departments within a company—the call center, marketing and legal—to see how each accesses data and entrusts its
workload management to Teradata. (See figure 2, above.)
In the call center, agents are given priority access to hot data based on customer expectations that a question will be answered quickly.
Likewise, access to warmer historical data is also necessary. In this scenario a call center agent needs access to both hot and warm data
when a customer asks, "Which items has my family ordered over the past 12 months?"
Simultaneously, as the aforementioned call center query is occurring, a marketing manager is running a sales trend analysis for the past 12
months, thus requiring access to the same warm historical data. However, due to pre-defined business rules, the marketing strategic query
would not have the same priority designation as the operational query from the call center agent.
Workload management, represented in figure 2, places a "throttle" on the marketing queries, delaying them, in this example, until there are
10 or fewer concurrent marketing queries running. This enables the call center query, which requires access to warmer data, to be prioritized
accordingly.
Concurrent with the queries of the call center agent and the marketing manager, a truly cold data access compliance query occurs where years'
worth of data needs to be accessed. Such a request probably doesn't need to be performed in the same timeframe as the hot call center or warm
marketing queries. Thus, in this example, the throttle could be set so any compliance queries beyond the first two are placed in a
lower priority delayed queue.
Teradata's unique capabilities offer a reduction in the total cost of ownership by dynamically managing the workload on the system as opposed
to managing the data. Dynamic workload management allows the system to shift resources while queries are in flight as opposed to assigning a
priority weighting only when the query is issued. By sharing and managing storage across temperature ranges, applications accessing hot data
can maintain required performance levels while rarely accessed data is still available to users, but with reduced performance priority.
| Better data accessibility with workload management |
|
Workload balancing is addressed through a combination of Teradata technologies, such as:
| > |
Partitioned primary index
|
| > |
Multi-value compression
|
| > |
Teradata Priority Scheduler
|
| > |
Join indexing including pre-joined, aggregate and sparse
|
| > |
Teradata Dynamic Query Manager
|
| > |
Teradata Active System Management
|
With these tools, multiple employees within various departments of an organization can simultaneously access
the data they need when they need it.
|
|
Gaining more value from the data warehouse
Data, like any other asset, has a life cycle of usefulness. In traditional data warehousing, frequently accessed data is often stored on
faster but more expensive storage media, while cooler data can be stored on near-line (optical disks) or offline storage devices (tapes, CDs).
(See left side of figure 3, below.)
Storing data offline presents a challenge of accessibility to data, thereby potentially delaying important business or compliance queries. Yet
there are other repercussions to keeping data offline; over time, schema changes to tables sometimes occur, and reloading offline data to the
warehouse can be massive re-work for DBAs.
With multi-temperature data warehousing and the advent of larger disk capacities, keeping historical data in the data warehouse becomes much
more cost-effective because all data can now be kept online and accessible. (See right side of figure 3.) With multi-temperature
data warehousing, near-line media is no longer necessary, and with the exception of truly dormant data, all data can now be integrated into
the warehouse. Multi-temperature data warehousing enables the next leap forward: the ability to keep all business data in the warehouse—forever.
A winning value proposition
In addition to the ability to keep more data in the warehouse for decisioning purposes in virtually all industries, multi-temperature data
warehousing can enable business processes that previously did not have access to long-term historical data.
For example, health insurers with processes such as claims management can benefit from the analysis of historical data to reduce fraud,
prevent abuse, examine cost trends and improve care management.
Manufacturers can improve warranty management processes by keeping more historical data in the warehouse for product quality analytics or
failure correlations.
Financial institutions also use detailed customer transaction data dating back more than two years to scan for patterns of fraud, more than
five years for auditing, and seven or more years for a predictive model of consumer banking habits. Enterprise risk management regulations and
audit requirements need multi-temperature data to run analytic modeling for credit risk, probability of loan default, modeling economic
scenarios and risk management.
| enlarge |
|
These two pyramids show the differences between the traditional life cycle management and
multi-temperature. Left: In traditional, to keep storage costs low and data warehouse
access speeds high, currently accessed data is stored online while cooler data is stored
in large-capacity media such as tapes or CDs. Right: Multi-temperature is information life
cycle management made easy because data no longer needs to be moved or placed in different
storage media—it can all be kept in the data warehouse for better business decision
making.
|
|
Telecommunication providers are leveraging near real-time incoming call data to identify the best cost for the routing of "roaming traffic."
In addition, the most recent 90 days of call usage data are often analyzed to ensure that service quality commitments are satisfied. This
frequently accessed hot data coexists in the warehouse with cooler data, such as 12 months to 24 months of customer records, which can help to
resolve lawsuit or arbitration over inter-carrier claims.
In the European Union, companies are also adopting multi-temperature data warehousing to comply with the European Union Data Directive, which
mandates storage retention of call detail records for a minimum of six months.
Realizing potential
In Ferris Bueller's Day Off, the Ferrari 250 GT sitting in the garage was a costly machine with enormous unrealized performance potential.
The same holds true for your data warehouse—as mentioned earlier, paying for storage space that you won't use is a losing value proposition.
Teradata multi-temperature data warehousing turns a losing value proposition into a winning value proposition. Larger disk storage capacities
can now be used without sacrificing system performance. Moreover, large volumes of infrequently accessed yet valuable data can be kept in the
data warehouse—most of which may have been previously stored offline.
Adding historical depth of existing data to the warehouse can give your organization a competitive edge in the
marketplace. T
Paul A. Barsch is a marketing manager for Teradata Professional Services.
Teradata Magazine-June 2007
|