Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
WHY TERADATA

PrintPrint

Send to colleagueSend to colleague
PDF (253 kb) E-mail us

The Law of "More"

To support increasing performance demands on all fronts, data warehouses must be truly scalable.

In 1965, Gordon Moore of Intel observed that the number of transistors per square inch of microchip doubled yearly, a trend he predicted would continue. The semiconductor industry has since interpreted “Moore’s Law” to suggest that processor performance should double every 18 months.

For data warehousing, Rob Armstrong, our director of Data Warehousing Sales Support, has another law: the Law of More. Armstrong’s Law of More states that a healthy data warehouse will constantly drive more users to want more applications that offer more complex and varied analysis, against more data with more frequent updates in a more timely manner. In fact, many successful data warehouses more than double in capacity/processing power every 18 months. Successful operation requires a data warehouse architecture and platform that can scale with these expanding business needs.

Figure 1
Figure 1
With a scalable data warehouse environment, instead of compromising on business needs (red, blue), you can size the system to accommodate all dimensions of growth independently (orange).

In the past, discussions of data warehouse scalability focused on data volume. That is a limited view, however. Storing data is not the same as exploiting data. Business value is not derived from data storage alone; rather, it results from the use of that data to answer business questions, support operations and perform analysis. To better understand data warehouse scalability, let’s consider all of the dimensions contributing to the business value of the data warehouse (see figure 1).

Grow to fit
There are five dimensions of data warehouse scalability:
Schema sophistication and integration
Query complexity and freedom
Concurrency level
Workload types
Data volume

A sophisticated data schema is flexible, extensible and adequately represents the data relationships, complexity and dynamics of the business. Data brought together from across the enterprise must be integrated in a manner that exposes and retains the complex relationships between data entities. As end users gain experience with the data warehouse, their needs will evolve. This will require new subject areas in the model, and existing subject areas will need to be completed or enhanced. An underlying normalized data model provides the greatest opportunity for cross-functional extension and integration of new information in the data warehouse. To be truly scalable, a data warehouse must be able to accommodate such an evolving data model.

In a growing data warehouse, end users will develop increasingly sophisticated and complex information needs. The first reports or queries against a new data warehouse typically consist of rewrites of previously existing reports. This important validation point provides end users with a sense of comfort and trust in the quality of the technology. But as more data is sourced, it will be exploited in increasingly complex queries and analyses. Database views may be structured to hide the complexity, yet that complexity still exists. It is not uncommon for organizations with a Teradata Warehouse to find users asking challenging business questions that generate queries involving dozens of joins and subqueries. A scalable data warehousing environment must be able to allow users the freedom to handle increasingly complicated queries and analyses.

As this process evolves, the number and variety of end users will grow. You can’t increase business value while denying your end users access to the data warehouse but more end users means increasing concurrent queries; furthermore, most data warehouses experience cyclical surges in query workload, such as at the beginning of the week, the end of the month, etc. An effective data warehouse must service hundreds and even thousands of concurrent queries with predictable and consistent results. It must also be able to support concurrent complex, analytical or even untuned queries. A genuinely scalable data warehousing environment will be able to respond effectively and efficiently.

Companies have increasingly begun moving their data warehouses into operational roles. Such data warehouses receive data loads multiple times per day, or even in real or near real time. The systems support both analytical workloads and highly tuned tactical workloads with quantified service-level agreements (SLAs).

These varied workload types share the same data and the same computing resources. A scalable data warehousing environment must be able to accommodate increasingly complex, varied and demanding workloads with appropriate tools for managing system resources.

As we have established, data volume provides only one measure of data warehouse scalability. Nonetheless, it remains an important factor because it can amplify the challenges associated with both managing and exploiting the data warehouse. True enterprise analytics require detailed data and a significant amount of history. Relying on aggregated data limits the output to reports of what happened rather than analyses of why it happened. Limited history precludes seasonal, year-over-year and trend analysis.

A scalable data warehousing environment must accommodate all of the dimensions discussed above, along with the database size required at any point in the evolution.

Technical growth inhibitors
A growing data warehouse requires an architecture and database management platform that can predictably scale to support concurrent growth along the five dimensions. If the architecture or platform cannot do this, you are forced to compromise business needs. With a true linearly scalable architecture, the system can be predictably sized to whatever capacity is necessary to accommodate all dimensions of growth.

If the technology platform or architecture does not provide that scalability, then you really have only two choices: refuse to provide what the business needs or attempt to “divide and conquer” through the use of data marts, hub-and-spoke or even grid architectures.

Divide-and-conquer all too quickly becomes divide-and-complicate, however. A data mart approach may appear to provide more data volume, but in fact most of this will be redundant data. It seems to support greater numbers of users, but will limit or preclude their access to the detailed data. As a result, information silos will begin to develop. Costs will increase as a result of data redundancy and increased platform administration staff. Worst of all, despite the financial and time investment, users will find that the approach only addresses a portion of their decision support needs.

Only shared-nothing architectures can truly approach linear scalability. Shared-memory and shared-disk architectures must address the inherent overhead associated with combining the use of these resources, as shown by shared-memory parallelism (SMP) servers. Clustered-disk architectures enable multiple servers to access shared data, but the access must be managed. That overhead requirement prevents linear scalability.

In shared-disk architectures, users attempt to work around these barriers by designing applications and configuring utilities to localize data access and minimize the negative scalability of these shared architectures. That approach places the burden on system administrators and application architects, however, and compromises the environment’s agility when workloads change.

To get the most out of your data warehouse, you have to think ahead. Is your data warehouse capable of growing at a rate deemed necessary by the business? Do you have a plan for increasing system value by adding user groups, adding subject areas and increasing the variety of queries and workloads? Are you providing for deeper analysis of more detailed data with more sophisticated business questions? Are you increasing the business impact through operational use of the data warehouse?

While the scalability of the technology platform or architecture is important, the real question is whether or not the business value justifies the expense. Ultimately, expanding your data warehouse should be a business decision, not a technology decision. T

Without limits

Not every organization needs a large data warehouse. In fact, most Teradata customers began with less than 1TB of user data and many have not expanded beyond 1TB or 2TB. If and when you need to grow, however, you should know that Teradata has experience helping clients achieve that level of performance.

Consider a telecommunications company that has extended its Teradata data warehouse for several years to keep pace with its growth as an organization. Here are some facts from its single, integrated, cross-functional data warehouse:

  • 50TB to 60TB of raw user data (80TB of Max Perm Space)
  • 20,000 users across a number of business units
  • More than 1,000 concurrent queries
  • 500,000 to 750,000 queries per day, on average
  • Subsecond completion time on 60% of queries; less than one-minute completion time on 95% of queries

Along with end-user access, these queries support more than 150 different applications sharing the same physical data model. That data model is third normal form (3NF or normalized) and includes 15,000 tables. As for query freedom, their motto is “any query, any time, any data” (always assuming appropriate security and privacy constraints).

Now that’s multi-dimensional scalability in action. —D.H.


Teradata scalability

How have so many Teradata customers been able to expand their Teradata Warehouse along the dimensions mentioned in this article? The answer lies in a myriad set of features and capabilities designed in a matched stack of hardware and software optimized for enterprise-class decision support. Here are some of the examples:

  • “Slope of 1” linear scalability, based on a shared-nothing architecture that eliminates overhead generated by sharing computing resources
  • Hash-based data distribution balances system load and automates data management
  • Parallelism, the first in a commercially available relational database management system (DBMS) that ensures extreme performance and maximum utilization of computing resources
  • Patented BYNET technology that scales and/or interconnects bandwidth as nodes are added to the configuration
  • Cost-based optimizer tool that balances extreme query and workload complexity
  • Workload management capabilities that control mixed workloads and allocate computing resources in accordance with user priorities; for example, guaranteeing resources for tactical workloads while providing concurrent access by lower-priority analytical workloads

D.H.


Roadmap to success

Data warehouse growth should occur in a planned, incremental and disciplined manner that is tied to business needs. A good tool for planning a value-oriented data warehouse evolution is the Teradata series of industry-specific Enterprise Data Warehouse Roadmaps (EDWr; see Ultimate Toolkit). These roadmaps can assist you in mapping strategic objectives to business improvement opportunities and, in turn, mapping those to the associated business questions and required data elements.

The roadmaps leverage Teradata’s industry logical data models, yet they can be customized to your particular needs. Using such tools, you can determine which data elements must be sourced or derived and provided in the data warehouse. This will allow new queries, analyses and applications that align with your strategic objectives. —D.H.

© Teradata Magazine-June 2006

RELATED LINKS:

Scaling the Enterprise Data Warehouse: Teradata's Integrated Solution
The wild world of mixed workload
What Makes an Enterprise Data Warehouse?


back to top




Copyright by Teradata Corporation 2001-2007.