Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
 




















Contrary to popular belief, aging data does not have to be moved from the ODS to the EDW.








































































The focus is on minimizing rather than maximizing the number of units of parallelism that are asked to do work.





































































Teradata has no limits on the number of concurrent users and a very high limit on the number of concurrent active queries.

 



ODS ORIGAMI

by Todd Walter

ODS, or operational data store, is the buzzword of data warehouse conferences. Changing business requirements call for an integrated information source to answer operational questions for the enterprise. The industry has responded by creating this new architectural component. Teradata sees the same set of business information requirements but responds by folding the operational questions into the enterprise data warehouse (EDW) component, eliminating the need to create new copies of data and purchase, implement or operate new systems. The creation of separate ODS components is primarily a response to technological limitations of database products rather than an architectural necessity.

What is an ODS?
According to industry experts, “An ODS is a collection of detailed data that satisfies the collective, integrated, operational needs of the corporation. Generally, these needs arise in the following situations: as strategic decisions are made using the data warehouse and/or data mart and action is required; as integrated operational reporting is needed across multiple but related operational systems (e.g., distribution systems, customer systems, etc.). The ODS is:

Subject-oriented
Integrated
Volatile
Current-valued
Detail

“The ODS looks very much like a data warehouse when it comes to its first two characteristics: subject orientation and integration. However, the remaining characteristics of an ODS are quite different from a data warehouse. Because of the very nature of these fundamental differences in types of data and processing, it is never acceptable to combine an ODS and a data warehouse into the same physical environment.”

(Source: Corporate Information Factory, Second Edition, W.H. Inmon, Claudia Imhoff and Ryan Sousa.)


How does Teradata fold the ODS into the EDW?
Using Teradata, it is both possible and practical “to combine an ODS and a data warehouse into the same physical environment.” The desire to separate the ODS from the EDW, as previously expressed, is driven by the volatility, current value (freshness) and level of detail of the data. The type of questions asked of the ODS also differ; typically, they are focused, tactical questions requiring short response times and conti-nuous availability of the system versus the EDW’s strategic wide-scope questions, such as questions about a single customer or product versus questions about all customers or product categories. Teradata’s technology allows the execution of tactical questions simultaneously with strategic questions on a single copy of the data while maintaining the freshness required for the ODS workload.


Does current-value data conflict with the EDW model?
There is no logical conflict between current-value and historical data. It is actually much easier to manage the changing boundary between them if there is only one copy of the data. And contrary to popular belief, aging data does not have to be moved from the ODS to the EDW. Questions that require both the current state and the history are much easier to answer because data does not have to be integrated from two locations. Two of the reasons current data is often separated from historical data are successfully addressed by Teradata technology: data management and volatility.


How is Teradata’s data management different?
Mixing volatile data with static historical data can present significant management overhead and challenges. The Teradata file system fully automates data management on disk, eliminating these issues. Space is dynamically allocated and reclaimed as the data grows and shrinks. Fragmentation is handled automatically in the background. Table structures are not capable of developing the overflow pages, chaining and partially full pages that degrade performance and require reorganization. Indexing structures do not degrade as tables are updated and require no rebuilding or reorganization. Updates and inserts can be applied as easily to a large table as to a small one. Teradata’s data management technology makes it feasible to store both volatile and static data in the same store.


How does Teradata handle data volatility?
Traditional data warehousing assumes a read-only environment with data refreshed periodically—once per day at most. Teradata, on the other hand, provides TPump for continuous load and update of the EDW from transactional systems. TPump is capable of keeping data as fresh as is required while providing manage-ment functions to balance between the resources required by the updates and those required to answer questions. The number of connections (sessions) used is a primary control function that can be adjusted dynamically to manage the flow. TPump connections honor the Priority Scheduler resource allocation mechanism to ensure resources are consumed at an acceptable rate.


What do the terms “detailed” and “summary” mean in the ODS definition?
Both the ODS and EDW contain detailed atomic data. An EDW often has summarized data as well to provide fast access to information at a different level. And a data mart implementation is primarily focused around summarized data. In an ODS, summaries of up-to-the-minute volatile data take on a different characteristic. They represent the dynamically changing business rather than a known point in time, such as a nightly close of business.

In some cases, dynamic summaries could lead to faulty business decisions, e.g. when a large deposit is recorded but a large withdrawal in a subsequent transaction is not. However, there may well be dynamic sum-mary information that is valuable to an organization, such as total revenue or up-to-the-minute margins from a particular customer’s transactions.

Storing detailed data for either the EDW or ODS is a primary reason for utilizing Teradata’s scalability and parallel performance. Adding today’s detailed data to years of historical data actually requires relatively little additional scaling of the storage or the server.

Summaries of detailed data exist only to accelerate the performance of questions against that level of information in the enterprise. EDW implementations on Teradata typically use far fewer summaries than equivalent implementations on other technologies due to the scalability and performance of the aggregation, join and sort algorithms. Many of the answers typically retrieved from summaries can be answered directly from the detailed data.


How can summaries be maintained on Teradata when updates are continuous?
Recent Teradata releases have added support for automating the summaries that are still required. Aggregate Join Index technology keeps the summaries current as the underlying data is updated. This makes it possible to have a continuous flow of updates, keeping summary tables current to answer questions about the state of revenue, profitability, inventory or customer value.

Some summaries better serve the business if they represent a point in time. Daily updates or refreshes can be accomplished easily using INSERT SELECT or Multiload. INSERT SELECT is fully parallel and scalable, including when the source SELECT statement is an aggregate query. Multiload can be used to simultaneously update the detail data and the summaries in a single batch process.


If updates are continuous, how can users get reproducible results from their queries?
There are users who require the most current information possible and users who require a specific format, e.g. end of business day, end of week, end of quarter. In a continuous update environment, users requiring current information are satisfied, but others see a continuously changing state.

Views address this issue by limiting the scope of queries to exclude the current changing data. Transactions and summaries on time boundaries are particularly easy to manage since each row of data will already contain the date. A simple view to exclude the current timeframes gives the users their consistent answers. And the absence of overhead with Teradata’s view implementation makes using views for all the queries a popular convention.


What if tables do not have dates?
Tables such as Customer and Inventory require additional effort to manage point-in-time questions. In a traditional data warehouse, these tables contain only the current values, not what the values might have been before the last update. If the ODS or the EDW is used to answer questions about a prior state of the business, additional data modeling is required. Effective “begin” and “end” dates are required to identify when a value is relevant. And multiple copies of the record are required to capture the values as time passes and change occurs. Again, views are key to delivering data to users. In this case, the view is used to hide the complexity of returning the current row or the row that was current as of some specific date.

How are tactical queries different from strategic queries?
Strategic queries are wide in scope. They require analysis of a large amount of data from many areas of the database. They typically have a wide range of response time requirements, from minutes to days. Strategic requests are submitted by a small to moderate number of users with sophisticated query tools and even by writing SQL directly.

Tactical queries are typically narrow in scope. They require analysis of a relatively small number of rows from the database that might be drawn from a number of different business areas. They typically have response time requirements measured in seconds, even sub-second. Tactical requests are submitted by a very large number of end users in very high volume. The users are rarely, if ever, writing SQL statements to the database; instead they use an application that makes the database requests behind the scenes. Some people call this OLTP because it has some attributes similar to the workloads of the transactional systems. However, tactical requests differ from OLTP in several important ways (see Table 1).


How are tactical queries handled differently by Teradata?
Strategic queries put many units of parallelism to work simultaneously to deliver performance and scalability over very large volumes of data. Tactical queries accessing just a few rows are not efficient if they use many units of parallelism that don’t have applicable rows.

Teradata optimizes and executes tactical queries differently than strategic queries. The focus is on minimizing rather than maximizing the number of units of parallelism that are asked to do work. The Teradata optimizer recognizes tactical queries automati-cally, identifies which units of parallelism have the relevant data and creates a plan that most efficiently uses the parallelism of the system. The result can be seen using Explain and looking for plans that say “single AMP” or “group AMP” as the target of a plan step. Teradata requires few resources to execute the short queries, making the cost of adding this workload much lower than most expect.


What can help deliver the best performance and scalability for the tactical queries?
A Unique Secondary Index can be used when a unique alternate path to the data exists. It allows the optimizer to use two rather than all units of parallelism to retrieve a row. When a tactical query needs to join several tables together to get an answer, a Join Index (JI) might be an appropriate choice. A JI allows the join to be done during the load/update process so that the tactical queries can simply retrieve the pre-joined result. An Aggregate Join Index (AJI) can be used to pre-calculate aggregations. A tactical query then can retrieve a single aggregate result rather than reading and aggregating many rows.

Defining indexes always involves trade-offs that must be carefully considered. An index must be stored on disk, which requires space. It must be defined and built. There are costs to maintaining the index as each update is performed. These costs must be balanced against the improvement in cost and response time for the tactical queries. High-volume queries with constrained service level requirements are the places to focus effort to identify indexing opportunities.


How can all this work be run simultaneously on one system?
The Priority Scheduler Facility (PSF) is important for managing a large volume of strategic queries at different service levels. It is even more critical when adding the tactical queries with stringent response time requirements. The tactical query work can be assigned a high weight and a high priority, giving it first access to CPU and I/O resources. The very long-running analysis work can be constrained to a portion of the total CPU and I/O resources to keep it from interfering with work that has more constrained service levels. Policy decisions on the relative importance of the freshness of the data versus service levels for answers to questions can be implemented directly. For example, during peak periods it might be acceptable for the data to be somewhat less fresh in order for call center employees to get immediate responses to their inquiries.

Teradata has no limits on the number of concurrent users and a very high limit on the number of concurrent active queries. Memory, connections, queuing and executing in priority order are all managed automati-cally by Teradata. Users are managed via connections to the DBMS; no management of OS user IDs is required. Teradata specializes in running all aspects of concurrent work and making it easy to manage.


Summary
The ODS is a response to requirements to execute a high volume of tactical queries against the current state of the enterprise. ODSs are described in the industry as separate components in the data warehouse architecture—separate servers, copies of the data, users, tools and workloads. Teradata offers technology that allows the ODS to be just another workload on the EDW. Folding the workloads together onto one copy of the data delivers advantages in cost, complexity, time to market and flexibility, thus creating the Active Data Warehouse. T

E-MAIL ME
Looking for answers to life’s mysteries? Or would you just like to know more about the Teradata Warehouse and related applications? Ask the Expert! E-mail questions and comments to Todd at: todd.walter@ncr.com

Photo by Alex Hayden




Copyright by Teradata Corporation 2001-2007.