The ODS looks very much like a data warehouse when
it comes to its first two characteristics: subject orientation
and integration. However, the remaining characteristics
of an ODS are quite different from a data warehouse. Because
of the very nature of these fundamental differences in types
of data and processing, it is never acceptable to combine
an ODS and a data warehouse into the same physical environment.
(Source: Corporate Information Factory, Second Edition,
W.H. Inmon, Claudia Imhoff and Ryan Sousa.)
How does Teradata fold the ODS
into the EDW?
Using Teradata, it is both possible and practical
to combine an ODS and a data warehouse into the same
physical environment. The desire to separate the ODS
from the EDW, as previously expressed, is driven by the
volatility, current value (freshness) and level of detail
of the data. The type of questions asked of the ODS also
differ; typically, they are focused, tactical questions
requiring short response times and conti-nuous availability
of the system versus the EDWs strategic wide-scope
questions, such as questions about a single customer or
product versus questions about all customers or product
categories. Teradatas technology allows the execution
of tactical questions simultaneously with strategic questions
on a single copy of the data while maintaining the freshness
required for the ODS workload.
Does current-value data conflict
with the EDW model?
There is no logical conflict between current-value
and historical data. It is actually much easier to manage
the changing boundary between them if there is only one
copy of the data. And contrary to popular belief, aging
data does not have to be moved from the ODS to the EDW.
Questions that require both the current state and the history
are much easier to answer because data does not have to
be integrated from two locations. Two of the reasons current
data is often separated from historical data are successfully
addressed by Teradata technology: data management and volatility.
How is Teradatas data management
different?
Mixing volatile data with static historical data can present
significant management overhead and challenges. The Teradata
file system fully automates data management on disk, eliminating
these issues. Space is dynamically allocated and reclaimed
as the data grows and shrinks. Fragmentation is handled
automatically in the background. Table structures are not
capable of developing the overflow pages, chaining and partially
full pages that degrade performance and require reorganization.
Indexing structures do not degrade as tables are updated
and require no rebuilding or reorganization. Updates and
inserts can be applied as easily to a large table as to
a small one. Teradatas data management technology
makes it feasible to store both volatile and static data
in the same store.
How does Teradata handle data
volatility?
Traditional data warehousing assumes a read-only
environment with data refreshed periodicallyonce per
day at most. Teradata, on the other hand, provides TPump
for continuous load and update of the EDW from transactional
systems. TPump is capable of keeping data as fresh as is
required while providing manage-ment functions to balance
between the resources required by the updates and those
required to answer questions. The number of connections
(sessions) used is a primary control function that can be
adjusted dynamically to manage the flow. TPump connections
honor the Priority Scheduler resource allocation mechanism
to ensure resources are consumed at an acceptable rate.
What do the terms detailed
and summary mean in the ODS definition?
Both the ODS and EDW contain detailed atomic
data. An EDW often has summarized data as well to provide
fast access to information at a different level. And a data
mart implementation is primarily focused around summarized
data. In an ODS, summaries of up-to-the-minute volatile
data take on a different characteristic. They represent
the dynamically changing business rather than a known point
in time, such as a nightly close of business.
In some cases, dynamic summaries could lead to faulty business
decisions, e.g. when a large deposit is recorded but a large
withdrawal in a subsequent transaction is not. However,
there may well be dynamic sum-mary information that is valuable
to an organization, such as total revenue or up-to-the-minute
margins from a particular customers transactions.
Storing detailed data for either the EDW or ODS is a primary
reason for utilizing Teradatas scalability and parallel
performance. Adding todays detailed data to years
of historical data actually requires relatively little additional
scaling of the storage or the server.
Summaries of detailed data exist only to accelerate the
performance of questions against that level of information
in the enterprise. EDW implementations on Teradata typically
use far fewer summaries than equivalent implementations
on other technologies due to the scalability and performance
of the aggregation, join and sort algorithms. Many of the
answers typically retrieved from summaries can be answered
directly from the detailed data.
How can summaries be maintained
on Teradata when updates are continuous?
Recent Teradata releases have added support
for automating the summaries that are still required. Aggregate
Join Index technology keeps the summaries current as the
underlying data is updated. This makes it possible to have
a continuous flow of updates, keeping summary tables current
to answer questions about the state of revenue, profitability,
inventory or customer value.
Some summaries better serve the business if they represent
a point in time. Daily updates or refreshes can be accomplished
easily using INSERT SELECT or Multiload. INSERT SELECT is
fully parallel and scalable, including when the source SELECT
statement is an aggregate query. Multiload can be used to
simultaneously update the detail data and the summaries
in a single batch process.
If updates are continuous, how
can users get reproducible results from their queries?
There are users who require the most current information
possible and users who require a specific format, e.g. end
of business day, end of week, end of quarter. In a continuous
update environment, users requiring current information
are satisfied, but others see a continuously changing state.
Views address this issue by limiting the scope of queries
to exclude the current changing data. Transactions and summaries
on time boundaries are particularly easy to manage since
each row of data will already contain the date. A simple
view to exclude the current timeframes gives the users their
consistent answers. And the absence of overhead with Teradatas
view implementation makes using views for all the queries
a popular convention.
What if tables do not have dates?
Tables such as Customer and Inventory require
additional effort to manage point-in-time questions. In
a traditional data warehouse, these tables contain only
the current values, not what the values might have been
before the last update. If the ODS or the EDW is used to
answer questions about a prior state of the business, additional
data modeling is required. Effective begin and
end dates are required to identify when a value
is relevant. And multiple copies of the record are required
to capture the values as time passes and change occurs.
Again, views are key to delivering data to users. In this
case, the view is used to hide the complexity of returning
the current row or the row that was current as of some specific
date.
How are tactical queries different
from strategic queries?
Strategic queries are wide in scope. They require
analysis of a large amount of data from many areas of the
database. They typically have a wide range of response time
requirements, from minutes to days. Strategic requests are
submitted by a small to moderate number of users with sophisticated
query tools and even by writing SQL directly.
Tactical queries are typically narrow in scope. They require
analysis of a relatively small number of rows from the database
that might be drawn from a number of different business
areas. They typically have response time requirements measured
in seconds, even sub-second. Tactical requests are submitted
by a very large number of end users in very high volume.
The users are rarely, if ever, writing SQL statements to
the database; instead they use an application that makes
the database requests behind the scenes. Some people call
this OLTP because it has some attributes similar to the
workloads of the transactional systems. However, tactical
requests differ from OLTP in several important ways (see
Table 1).
 |
How are tactical queries handled
differently by Teradata?
Strategic queries put many units of parallelism
to work simultaneously to deliver performance and scalability
over very large volumes of data. Tactical queries accessing
just a few rows are not efficient if they use many units
of parallelism that dont have applicable rows.
Teradata optimizes and executes tactical queries differently
than strategic queries. The focus is on minimizing rather
than maximizing the number of units of parallelism that
are asked to do work. The Teradata optimizer recognizes
tactical queries automati-cally, identifies which units
of parallelism have the relevant data and creates a plan
that most efficiently uses the parallelism of the system.
The result can be seen using Explain and looking for plans
that say single AMP or group AMP
as the target of a plan step. Teradata requires few resources
to execute the short queries, making the cost of adding
this workload much lower than most expect.
What can help deliver the best
performance and scalability for the tactical queries?
A Unique Secondary Index can be used when a
unique alternate path to the data exists. It allows the
optimizer to use two rather than all units of parallelism
to retrieve a row. When a tactical query needs to join several
tables together to get an answer, a Join Index (JI) might
be an appropriate choice. A JI allows the join to be done
during the load/update process so that the tactical queries
can simply retrieve the pre-joined result. An Aggregate
Join Index (AJI) can be used to pre-calculate aggregations.
A tactical query then can retrieve a single aggregate result
rather than reading and aggregating many rows.
Defining indexes always involves trade-offs that must be
carefully considered. An index must be stored on disk, which
requires space. It must be defined and built. There are
costs to maintaining the index as each update is performed.
These costs must be balanced against the improvement in
cost and response time for the tactical queries. High-volume
queries with constrained service level requirements are
the places to focus effort to identify indexing opportunities.
How can all this work be run simultaneously
on one system?
The Priority Scheduler Facility (PSF) is important
for managing a large volume of strategic queries at different
service levels. It is even more critical when adding the
tactical queries with stringent response time requirements.
The tactical query work can be assigned a high weight and
a high priority, giving it first access to CPU and I/O resources.
The very long-running analysis work can be constrained to
a portion of the total CPU and I/O resources to keep it
from interfering with work that has more constrained service
levels. Policy decisions on the relative importance of the
freshness of the data versus service levels for answers
to questions can be implemented directly. For example, during
peak periods it might be acceptable for the data to be somewhat
less fresh in order for call center employees to get immediate
responses to their inquiries.
Teradata has no limits on the number of concurrent users
and a very high limit on the number of concurrent active
queries. Memory, connections, queuing and executing in priority
order are all managed automati-cally by Teradata. Users
are managed via connections to the DBMS; no management of
OS user IDs is required. Teradata specializes in running
all aspects of concurrent work and making it easy to manage.
Summary
The ODS is a response to requirements to execute
a high volume of tactical queries against the current state
of the enterprise. ODSs are described in the industry as
separate components in the data warehouse architectureseparate
servers, copies of the data, users, tools and workloads.
Teradata offers technology that allows the ODS to be just
another workload on the EDW. Folding the workloads together
onto one copy of the data delivers advantages in cost, complexity,
time to market and flexibility, thus creating the Active
Data Warehouse. T
E-MAIL ME
Looking for answers to lifes mysteries?
Or would you just like to know more about the Teradata Warehouse
and related applications? Ask the Expert! E-mail
questions and comments to Todd at: todd.walter@ncr.com
Photo by Alex Hayden