Parallel processing combined with database tuning can optimize your data warehouse.
by Keith Rimmereid
The widespread adoption of parallel processing has been a boon for data warehousing. It has allowed flexible analysis of increasingly more
granular data so businesses can optimize the organization's strategy and enhance its operational execution.
The demand on a data warehouse will continue to expand and evolve, requiring support for a diverse body of analytics and applications that
go beyond what has been traditionally thought of as data warehousing. These include operational applications to monitor and provide snapshot
reporting on key business processes, sales and service applications to facilitate direct support for customer-facing employees, and
self-service Web portal applications to directly support customers and suppliers.
These applications demand short, consistent response times. The addition of simple and effective database tuning capabilities to the massive
throughput power of parallel processing provides tremendous performance and efficiency gains and expands the types of applications, business
processes and number of users that the data warehouse can support. The ultimate results for businesses are improved competitiveness in the
market and increased return on investment (ROI) for the data warehouse.
Perfect pitch
Parallel processing and data warehousing work well together. When implemented on suitable hardware architecture, a parallel approach can be
an order of magnitude faster than a serial approach. To illustrate this concept, consider that analyzing 100 million transactions is quicker
if 100 tasks each analyze 1 million transactions in parallel than if one task analyzes all 100 million.
This tremendous capacity for analytical throughput empowered by parallel processing can minimize the need for database tuning, or at least
change the focus of tuning techniques for many queries in a data warehouse workload. However, large performance and efficiency gains can be
achieved with a few select tuning techniques.
Parallel processing enables flexible analysis of the transactions within a reasonable response time. But what if many business queries always
ask for financial totals by month? Does it make sense to compute those subtotals for every query as each is run? Or what if a service agent
in the inbound call center needs to research a particular customer's transactions while on the phone with the individual? Can parallel
processing alone provide the level of performance needed? And what might happen as more history is retained or the business expands and the
transaction count jumps to 200 million? Will doubling the runtimes be acceptable?
Predictable scalability is also enabled through parallel processing. To continue with the example above, increasing the hardware resources
to support 200 parallel process tasks can double the throughput and cut the runtime in half. This is a great attribute, but it comes with a
significant price tag for the additional hardware. And it would not meet the response time requirements of the customer service agent. Nor is
it a solution if the number of queries scanning this table dramatically increases.
Four techniques
The alternative to expanding the hardware platform is to apply tuning techniques that reduce throughput requirements and increase the
performance of the parallel system. The following techniques increase the efficiency of the parallel system by reducing the amount of data
that needs to be processed for many queries. System and/or people resources are necessary to gain the full benefit of each tuning technique.
Also, workload management facilities can be used to prioritize system resources to the appropriate queries in order to better meet user
expectations. (See table.)
1. TABLE PARTITIONING A table partition is a small portion or a subset of a table. Parallel databases physically partition tables
across the hardware CPU and storage components to support parallel processing and increase throughput by allowing the partitions to be
processed concurrently, greatly increasing data scan throughput. This is often referred to as horizontal partitioning.
Additional levels of partitioning within the horizontal partitions can often be defined to reduce the amount of data that needs to be
processed. This will reduce query runtime. As an example, a sales transaction table contains a row for each sales transaction that occurred
over the last 24 months. The table includes the item sold, its price and quantity, the sales date, the sales region, customer information
and other relevant data. If two additional levels of partitioning are defined, the first could be for the month when the transaction occurred
and the second could be for the sales region to which the customer is assigned. This additional partitioning greatly reduces the amount of
data required to be processed for queries seeking information for a given month or two.
When comparing a particular month's sales results with the same month's results from the previous year, partitioning by month will increase
the efficiency of the query by about 12 times—the database will need to process transactions for only two months out of 24. For a query
scoping the transactions for a sales region in a given month, the efficiency gain is even greater, reaching a 48-fold reduction if four sales
districts are defined. (See figure, below.)
2. SECONDARY INDEXES Secondary indexes are additional database-maintained structures that offer greater data selectivity than the
partitioning strategy. These indexes are useful in supporting response-time-sensitive queries. For the sales transaction table described, a
secondary index could be defined for Customer. Then, if a service agent needs to review a customer's transactions, the request to the
database goes directly to the Customer index and accesses only the sales transactions for that customer. Broadly employing parallelism in
place of a selective index will fail to meet the response time expectations and limit the ability to support high levels of concurrency.
Secondary indexes not only provide very quick response times, but also enable high levels of concurrency, because the resource requirement
for each query is small.
3. DATABASE-MAINTAINED SUMMARY TABLES Many business queries have two common requirements: summary totals of usual business metrics
(such as total sales by region) and joining the same tables (such as customer with transactions). When common database-maintained summaries
and joins have highly repetitive patterns, they provide performance and efficiency gains. Once defined, the database builds and automatically
maintains a structure that contains the specified columns from the joined tables and the counts and totals that comprise the business metric.
Individual queries are then satisfied from the database-maintained summary table, thus eliminating repetitive joining and aggregation.
4. WORKLOAD MANAGEMENT AND SCHEDULING Workload management may not be thought of as a tuning facility, but its goals are the same: to
meet user expectations of runtimes and ensure efficient system utilization. Queries are submitted to the data warehouse in bursts and
irregular patterns, which create many resource peaks and valleys. The mix of queries will include simple lookups, repetitive reporting and
resource-hungry data mining processes. The roles of workload management facilities are to smooth the utilization peaks and valleys and to
ensure the system resources are assigned to queries based on appropriate priorities so response time requirements of the user community are
sufficiently met.
Dynamic composition
As a data warehouse evolves and grows, the workload composition typically changes. Ad hoc exploration and experimentation of a particular
subject leads to standard repetitive report suites and dashboards. Each new subject area deployed drives the delivery of additional
repetitive analysis.
The repetitive and predictable characteristics of the workload offer new opportunities to take advantage of tuning features while efficiently
supporting the continued ad hoc exploration. It is a natural evolution to begin with minimal tuning and then, as it makes sense, to
selectively use the tuning techniques that provide the best balance of effort and reward.
Intelligently using a good balance of various database tuning techniques—parallelism, partitioning and indexing—will enable the data warehouse
to support a broad range of analytic and operational applications at the lowest possible cost. T
|
| Why the Teradata system for an easily tuned environment?
|
| As the only popular data
warehouse database built on a natively parallel architecture, the Teradata system brings great balance between the
throughput needed for massive complex queries and the selectivity needed for response-time-sensitive queries.
Straightforward tuning options focus the massive parallelism on narrow slices of the data or selectively use units of
parallelism for single- or few-row lookups:
| > |
Fine-grained parallelism is always on and automatically activated, and it does not
need to be configured, tuned or managed.
|
| > |
Cost-based optimizer is sophisticated and responds on its own to changes in the
physical database design to create the best-performing query execution plans.
|
| > |
Multi-level partitioning is easy to implement, greatly improves performance and
minimizes resource consumption by reducing the amount of data that needs to be considered for a wide
variety of queries.
|
| > |
Indexing options are easy to set up and maintenance-free, and they are automatically
used to improve query performance.
|
| > |
Join indexes and aggregate join indexes eliminate the need for repetitive on-the-fly
joins and aggregations, speeding query performance and reducing resource consumption.
|
| > |
Teradata Active System Management dynamically allocates system resources according
to priority rules, and it smooths the peaks and valleys of resource demand.
|
| > |
Teradata Professional Services Performance Center of Excellence provides customized
performance management assessments along with best-practices training to enable organizations to
maintain optimum database and application execution.
|
—K.R.
|
|
Keith Rimmereid is a senior consultant with Teradata Data Warehousing Sales Support in San Diego.
Photography by Inmagine
Teradata Magazine-December 2008
|