Register | Log in


Subscribe Now>>
Home News Tech2Tech Features Viewpoints Facts & Fun Teradata.com
Why Teradata
Download PDF|Send to Colleague

Turn up the volume

Parallel processing combined with database tuning can optimize your data warehouse.

by Keith Rimmereid

The widespread adoption of parallel processing has been a boon for data warehousing. It has allowed flexible analysis of increasingly more granular data so businesses can optimize the organization's strategy and enhance its operational execution.

Guitar

The demand on a data warehouse will continue to expand and evolve, requiring support for a diverse body of analytics and applications that go beyond what has been traditionally thought of as data warehousing. These include operational applications to monitor and provide snapshot reporting on key business processes, sales and service applications to facilitate direct support for customer-facing employees, and self-service Web portal applications to directly support customers and suppliers.

These applications demand short, consistent response times. The addition of simple and effective database tuning capabilities to the massive throughput power of parallel processing provides tremendous performance and efficiency gains and expands the types of applications, business processes and number of users that the data warehouse can support. The ultimate results for businesses are improved competitiveness in the market and increased return on investment (ROI) for the data warehouse.

Perfect pitch
Parallel processing and data warehousing work well together. When implemented on suitable hardware architecture, a parallel approach can be an order of magnitude faster than a serial approach. To illustrate this concept, consider that analyzing 100 million transactions is quicker if 100 tasks each analyze 1 million transactions in parallel than if one task analyzes all 100 million.

This tremendous capacity for analytical throughput empowered by parallel processing can minimize the need for database tuning, or at least change the focus of tuning techniques for many queries in a data warehouse workload. However, large performance and efficiency gains can be achieved with a few select tuning techniques.

Parallel processing enables flexible analysis of the transactions within a reasonable response time. But what if many business queries always ask for financial totals by month? Does it make sense to compute those subtotals for every query as each is run? Or what if a service agent in the inbound call center needs to research a particular customer's transactions while on the phone with the individual? Can parallel processing alone provide the level of performance needed? And what might happen as more history is retained or the business expands and the transaction count jumps to 200 million? Will doubling the runtimes be acceptable?

Predictable scalability is also enabled through parallel processing. To continue with the example above, increasing the hardware resources to support 200 parallel process tasks can double the throughput and cut the runtime in half. This is a great attribute, but it comes with a significant price tag for the additional hardware. And it would not meet the response time requirements of the customer service agent. Nor is it a solution if the number of queries scanning this table dramatically increases.

Four techniques
The alternative to expanding the hardware platform is to apply tuning techniques that reduce throughput requirements and increase the performance of the parallel system. The following techniques increase the efficiency of the parallel system by reducing the amount of data that needs to be processed for many queries. System and/or people resources are necessary to gain the full benefit of each tuning technique. Also, workload management facilities can be used to prioritize system resources to the appropriate queries in order to better meet user expectations. (See table.)

1. TABLE PARTITIONING
A table partition is a small portion or a subset of a table. Parallel databases physically partition tables across the hardware CPU and storage components to support parallel processing and increase throughput by allowing the partitions to be processed concurrently, greatly increasing data scan throughput. This is often referred to as horizontal partitioning.

Additional levels of partitioning within the horizontal partitions can often be defined to reduce the amount of data that needs to be processed. This will reduce query runtime. As an example, a sales transaction table contains a row for each sales transaction that occurred over the last 24 months. The table includes the item sold, its price and quantity, the sales date, the sales region, customer information and other relevant data. If two additional levels of partitioning are defined, the first could be for the month when the transaction occurred and the second could be for the sales region to which the customer is assigned. This additional partitioning greatly reduces the amount of data required to be processed for queries seeking information for a given month or two.

When comparing a particular month's sales results with the same month's results from the previous year, partitioning by month will increase the efficiency of the query by about 12 times—the database will need to process transactions for only two months out of 24. For a query scoping the transactions for a sales region in a given month, the efficiency gain is even greater, reaching a 48-fold reduction if four sales districts are defined. (See figure, below.)

2. SECONDARY INDEXES
Secondary indexes are additional database-maintained structures that offer greater data selectivity than the partitioning strategy. These indexes are useful in supporting response-time-sensitive queries. For the sales transaction table described, a secondary index could be defined for Customer. Then, if a service agent needs to review a customer's transactions, the request to the database goes directly to the Customer index and accesses only the sales transactions for that customer. Broadly employing parallelism in place of a selective index will fail to meet the response time expectations and limit the ability to support high levels of concurrency. Secondary indexes not only provide very quick response times, but also enable high levels of concurrency, because the resource requirement for each query is small.

3. DATABASE-MAINTAINED SUMMARY TABLES
Many business queries have two common requirements: summary totals of usual business metrics (such as total sales by region) and joining the same tables (such as customer with transactions). When common database-maintained summaries and joins have highly repetitive patterns, they provide performance and efficiency gains. Once defined, the database builds and automatically maintains a structure that contains the specified columns from the joined tables and the counts and totals that comprise the business metric. Individual queries are then satisfied from the database-maintained summary table, thus eliminating repetitive joining and aggregation.

4. WORKLOAD MANAGEMENT AND SCHEDULING
Workload management may not be thought of as a tuning facility, but its goals are the same: to meet user expectations of runtimes and ensure efficient system utilization. Queries are submitted to the data warehouse in bursts and irregular patterns, which create many resource peaks and valleys. The mix of queries will include simple lookups, repetitive reporting and resource-hungry data mining processes. The roles of workload management facilities are to smooth the utilization peaks and valleys and to ensure the system resources are assigned to queries based on appropriate priorities so response time requirements of the user community are sufficiently met.

Dynamic composition
As a data warehouse evolves and grows, the workload composition typically changes. Ad hoc exploration and experimentation of a particular subject leads to standard repetitive report suites and dashboards. Each new subject area deployed drives the delivery of additional repetitive analysis.

The repetitive and predictable characteristics of the workload offer new opportunities to take advantage of tuning features while efficiently supporting the continued ad hoc exploration. It is a natural evolution to begin with minimal tuning and then, as it makes sense, to selectively use the tuning techniques that provide the best balance of effort and reward.

Intelligently using a good balance of various database tuning techniques—parallelism, partitioning and indexing—will enable the data warehouse to support a broad range of analytic and operational applications at the lowest possible cost. T

Why the Teradata system for an easily tuned environment?
As the only popular data warehouse database built on a natively parallel architecture, the Teradata system brings great balance between the throughput needed for massive complex queries and the selectivity needed for response-time-sensitive queries. Straightforward tuning options focus the massive parallelism on narrow slices of the data or selectively use units of parallelism for single- or few-row lookups:

> Fine-grained parallelism is always on and automatically activated, and it does not need to be configured, tuned or managed.
> Cost-based optimizer is sophisticated and responds on its own to changes in the physical database design to create the best-performing query execution plans.
> Multi-level partitioning is easy to implement, greatly improves performance and minimizes resource consumption by reducing the amount of data that needs to be considered for a wide variety of queries.
> Indexing options are easy to set up and maintenance-free, and they are automatically used to improve query performance.
> Join indexes and aggregate join indexes eliminate the need for repetitive on-the-fly joins and aggregations, speeding query performance and reducing resource consumption.
> Teradata Active System Management dynamically allocates system resources according to priority rules, and it smooths the peaks and valleys of resource demand.
> Teradata Professional Services Performance Center of Excellence provides customized performance management assessments along with best-practices training to enable organizations to maintain optimum database and application execution.

—K.R.

Keith Rimmereid is a senior consultant with Teradata Data Warehousing Sales Support in San Diego.

Photography by Inmagine

Teradata Magazine-December 2008

Related Links

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:


Protegrity

Teradata.com | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.