When and why to put what data where and how.
by Rob Armstrong
Users now demand not only more complex and flexible analysis but also more timely information. Data must be available 24x7, and many business
users demand that data supporting decision making be accessible within hours—in some cases, minutes or even seconds—of when an event occurs.
Organizations also realize that the same data needs to be utilized by many different processes and thus many different workload profiles.
Ideally, organizations should keep detail data at the lowest form possible in a functionally neutral data model. This enables the business
community to ask any number of questions from a wide range of perspectives and processes. The basic premise is that you can always aggregate
detail data, but you can never decompose summary data. This does not imply that you should never use summary tables; rather, it means that you
should not replace detail data with only summary data.
If the desired analysis requires detail data, using only summary tables will fail no matter the technical justification. Implementers and
users must work together to understand the business requirements and what drives them; then they need to use the least intrusive process
possible to meet those requirements.
The primary reason for building summary tables, adding indexes, enforcing priority and denormalizing data is to increase performance. If you
could use any amount of data to ask any question and get an instant response, you would never need to build summaries or indexes. These
structures function as a workaround to other limitations. Their very existence requires more space, data management and time between the
occurrence of an event and the ability to take action. The question is how to balance providing higher performance with minimizing data
replication and management.
Query frequency and performance consistency are also important considerations. For example, 70% of one enterprise's 1,000 weekly queries asked
for summary-level questions against detail data. With a summary table, queries executed in approximately six seconds instead of four minutes,
a difference of 2,730 minutes of processing time. Even factoring in the 105 minutes required each week to maintain the summary table, the
resulting savings of 2,625 minutes per week clearly justified making a summary table. Over time, this company found that usage shifted from
resolving many queries with summary data to matching a majority of queries against detail data. With fewer queries benefiting from a summary
table, it was simply dropped without affecting other processes.
Optimizing performance
The process of optimizing the data management environment should not be undertaken without understanding the consequences. It is important to
remember that if the technology currently in place cannot support data capture, storage and usage at the lowest level, it would be wise to
seriously examine that technology's ability to support ongoing business needs.
When optimizing performance, the starting point is having a clear grasp of two aspects of the system. First, what is the unaided performance?
The database may be powerful enough that executing the query against the tables without additional indexes or denormalization gives the
required response time. This raises the second aspect.
An organization may indicate a major problem with the data warehouse is that the response time on reports is not fast enough. However, when
asked what the user would do if the reports were faster, the answer is: "I don't know."
If you are going to escalate up the ladder of performance optimization, then you must understand the value of the performance being requested.
What business outcome is changed because the response time goes from one minute to 10 seconds? If that question cannot be answered, then stop
all your efforts until you can justify the expense of added optimizations. If it can be answered, the IT group should follow a step-by-step
approach to balance cost of performance with benefit of analytic output.
The seven steps of optimization
-
Start with a neutral data model that is normalized and not specific to any function or group
|
Provides the lowest level of detail needed to support business objectives
|
|
Addresses future needs via the level of detail defined/established
|
|
Uses views that define detail tables for security and concurrency
|
|
Enables access views that allow selects (read without intent or dirty reads) against tables that are being updated to avoid
locking issues
|
-
Implement views to ease navigation
|
Helps tools create better structured query language (SQL) using third-party tools and manipulating views
|
|
Provides the structure that front-end tools may need to create optimized SQL and eliminate the dependence on the front-end
tools understanding each and every database on the market
|
|
Incorporates optimized join techniques in the views
|
|
Integrates a layer of security to the tables
|
|
Offers each business unit its own logical functionalization—presenting data in the way users want to see it
|
|
Allows data to be stored once and then converted through views, reducing data redundancy, ensuring consistency and
simplifying data management
|
-
Add indexes, which can range from simple techniques such as secondary indexes to complex structures such as covered, join or
aggregate indexes
|
Provides a major advantage in that the system maintains them concurrently with base data table
|
|
Attaches some overhead to the indexes, so understand the impact and trade-offs and identify them in a cost versus benefit
analysis
|
-
Enforce prioritization using the database prioritization framework to secure critical workload with resources when necessary
|
Optimizes index and view structures once the SQL has been optimized
|
|
Provides better use of available capacity but not more capacity
|
|
Offers better understanding of the workload
|
|
Becomes most noticeable when system is congested; it is for these congested times that you are prioritizing work
|
|
Allows business community to provide priority direction
|
-
Proceed to rational summaries and denormalization only when the need for higher performance outweighs analysis flexibility
|
Follows a strategy of coexistence rather than replacement, retaining detail tables for extended analysis while adding
denormalized structures
|
|
Encompasses multiple needs and can serve a variety of users and departments, justifying cost of management
|
|
Satisfies the greater good of the user community rather than a sole functional purpose
|
|
Results in data volume and volatility playing an important part in cost-benefit analysis
|
-
Consider "irrational" summaries and denormalization to customize data models for specific business processes and/or reports
|
Facilitates when desired performance outweighs the demand of flexible analysis and a structure is needed to address a
single function
|
|
Includes the higher cost of data management and resources when considering the cost-benefit analysis
|
|
Minimizes impact by maintaining the underlying detail data; when conditions change in the future, this irrational
structure can be eliminated with little impact
|
If you still have not achieved the desired performance then there is one more step. However, it has a relatively higher cost. So before
proceeding, make sure the anticipated business value will outweigh the cost of the additional data movement and management required to keep
the extracted data in agreement with detail data.
-
Explore, expand and export
|
Explore the business usage of the answer sets and validate that the change in business outcome will drive the expected
revenue increase or cost decrease, depending on the application. Are the business users willing to stand behind the
expectations, and does the enhanced performance justify the cost?
|
|
Expand the current platform. Is the performance requirement so critical that new capacity is warranted?
|
|
Export the data from the main data warehouse to an application-specific platform. In this situation, a dedicated
environment tuned specifically for its application will provide much more control over the individual application. Keep in
mind all of the cost of the duplicated data, added time lag to action, and the cost of a new platform and software
environment that will need to be managed and supported.
|
Justifying the steps
Taking these seven steps requires understanding the cost involved with each step and the benefits derived from doing so. It also requires
making decisions that support long- and short-term needs. In some cases you may create summary tables or add denormalized data models that you
will drop eventually as the functions evolve over time. This is acceptable as long as eliminating the tables does not cause interruptions or
massive application changes. One way to ensure this is to refrain from using the summary or denormalized tables as the input to more
downstream applications whenever possible.
When applying the seven steps, perform cost-benefit analysis for each proposed step, and include physical aspects such as disk space,
resources to manage the structure, and lost opportunities due to time delays to maintain the process. Improvements may be seen in:
|
Query performance and opportunity gained from faster response
|
|
User concurrency rates
|
|
Throughput
|
|
User satisfaction and productivity
|
|
Utilization of third-party tools
|
|
Resource consumption used to execute a query
|
Maximizing usability
Although the aforementioned benefits provide a good reality check, they are not the only factors to consider when deciding whether to take
the next step in optimization. Other considerations include business priorities and end-user requirements. Users must understand the technical
impact of asking for a response to all queries within seconds. The best way to gain this understanding is by outlining the costs associated
with creating and managing the tables.
Most data warehouse implementations focus on getting something up quickly. The longer-term goals of analysis and flexibility are overlooked in
the rush to provide initial value. To reach the goal of having phase one meet immediate needs, many quickly jump to summary tables and
functional models. But taking a relatively short amount of time to consider the data warehouse's real purpose, and building a correct
foundation for it, the IT group can assure a better future for the data warehouse, and thus the entire user community.
The bottom line is that the implementers, not the user community, should decide whether to use a summary table or functional model. On the
other hand, the users, not the IT group, must determine which business capability needs must be delivered to drive business opportunity.
The more you move or manipulate the data into answer sets and push data to process, the more inflexible the environment becomes to meeting
future needs. The better approach is to push process to data and work on simplifying the overall architecture by minimizing data movement and
maximizing data reusability. T
Rob Armstrong is director of the Teradata data warehousing group and has more than 20 years of experience justifying, designing and
implementing data warehouse systems. Contact him at rob.armstrong@teradata.com.
Teradata Magazine-June 2008
|