by Justin Langseth
Given the success of data warehousing over the last three
decades, and given the fast pace of todays economy,
many data warehousing teams have been pressured to make
near real-time warehouses a reality. First this meant
weekly loads instead of monthly; now it can mean loading
new data every few hours, every 15 minutes or even continuously.
This new desire for real-time data is breaking the long-standing
rule that data in a data warehouse is static except during
nightly downtime for data loading. This rule has been
the core tenant that allowed warehouses to function the
way they do.
ETL tools could assume that they were operating during
a batch window and that they can do as they please and
not worry about disrupting active user sessions.
Query tools could assume that data would not change halfway
through a complex, multi-pass SQL statement, and that
query results could be cached and re-used between data
loads.
Real-time warehousing breaks the rule and challenges
these assumptions. As a result, there are five major challenges
that are likely to be encountered by any data warehousing
team moving in the direction of real-time processing.
Fortunately, Teradata offers solutions.
Grab-it-and-go data
A data warehouse can only be considered real-time, or
near real-time, when all or part of the data is updated,
loaded or refreshed on an intra-day basis, without interrupting
user access to the system. However, most ETL tools, whether
based on off-the-shelf or custom-coded products, operate
in a batch mode. They assume that they have free reign
to drop tables, re-load tables and conduct other major
database operations without disturbing simultaneous end-user
queries.
To work around this challenge, data that changes throughout
the day can be loaded into a parallel set of tables, either
through a batch process that runs every several minutes
or through a continuous trickle feed. In a Teradata-based
system, a batch load is best conducted using either the
FastLoad or MultiLoad utility, and a trickle feed is best
performed using TPump. Once the load interval, say five
minutes, is up, the freshly loaded tables are simply swapped
into production and the tables with the now-stale data
are swapped out of production. This can be accomplished
through the dynamic updating of views or by simple table
renaming.
The downside of this type of n-minute cycle-loading process
is that the data in the warehouse is not truly real-time.
This type of system is more accurately described as near
real-time. For many data warehousing applications, near
real-time is good enough. Also, using a near real-time
approach helps relieve some of the other challenges of
real-time warehousing.
For applications where true real-time data is required,
the best approach is to trickle-feed the changing data
from the source system directly into the data warehouse.
With TPump, Teradata handles this type of real-time updating
much better than many other database platforms, and this
architecture can be used even in systems with large data
volumes and high numbers of concurrent users.
Model of the future
The introduction of real-time data into an existing data
warehouse or the modeling of real-time data for a new
data warehouse presents some interesting data modeling
issues. For instance, a warehouse that has all of its
data aggregated at various levels based on a time dimension
needs to consider the possibility that the aggregated
information may be out of sync with the real-time data.
Also, some metrics, such as month-to-date and week-to-date,
might behave strangely with partial-day data that continuously
changes.
Some tools handle these cases automatically; a calendar
table can be used to ensure that comparisons are always
made to the last full day of transaction data. The main
issues regarding modeling, however, revolve around where
the real-time data is stored and how best to link it into
the rest of the data model.
When using the real-time or near real-time ETL approach
described above, and when the real-time data is stored
along with the historical data in the same fact tables,
no special data modeling approaches are required. From
the query tools perspective, there is nothing fundamentally
different about a real-time data warehouse modeled in
this manner than from the way data would be modeled in
a non-real-time warehouse.
An alternate approach is to store the real-time data
in separate warehouse fact tables. Depending on the type
of fact table, many query tools that support table partitioning
will be smart enough to automatically retrieve the real-time
data when required. For tables where this will not workfor
example, if not all columns or calculations are available
in the real-time data setalternate facts and attributes
can be set up to point to the real-time data tables, and
the query tool can be set up to drill across to real-time
data templates from reports containing the historical
data.
Another approach is to store the real-time data in different
tables from historical data, but in the same table structure.
Then, by using database views, the historical and real-time
data tables are combined together so they look like one
logical table from the query tools perspective and
provide seamless access to historical and real-time data
from the users perspective. This helps alleviate
many of the problems associated with the separate partition
approach, as the query tool or end users do not need to
join two tables on their own.
Simplicity in a complex world
Todays OLAP and query tools were designed
to operate on top of unchanging, static historical data.
Based on this assumption, they dont take any precautions
to ensure that the results they produce are not negatively
influenced by data changes concurrent to query execution.
In some cases, this can lead to inconsistent and confusing
query results. Also, care needs to be taken to either
disable query-tool result caching or to automatically
purge the cache of reports that are based on real-time
data upon each intra-day data load.
Relational OLAP tools are particularly sensitive to this
problem because they perform all but the simplest data
analysis operations by issuing multi-pass SQL. A multi-pass
SQL statement is made up of many smaller SQL statements
that sequentially operate on a set of temporary tables.
For example, a simple request might show sales for a retailer
at the category level along with a percent-to-total calculation.
An OLAP tool would convert this request into as many
as seven SQL statements, first computing the total for
all categories, then computing the individual category
totals and percent-to-totals. What if the data in the
warehouse changes between these steps? Perhaps some new
transactions were added that will be included in the results
of the second query but not the first. In this example,
its quite possible that the total will be incorrect,
and that the percent-to-total values will add up to more
than 100%.
This example actually represents the best case for most
real-world relational OLAP systems. Many users experience
query response times that run from tens of seconds to
multiple hours. For a well-tuned application, acceptable
response times vary from 15 seconds to as much as five
minutes. Also, typical OLAP queries can contain many pages
of SQL code.
The good news is that you can avoid such inconsistencies.
Consistency is only a problem when the data is changing
quickly enough so that the data will be different at the
end of a multi-pass query execution cycle than it was
at the beginning. Using the near real-time ETL approach
with a relatively long cycle time will alleviate this
problem if the OLAP server is instructed not to send new
jobs to the data warehouse during the load or flip.
Since a near real-time approach is not sufficient for
some applications, there are ways to mitigate the affects
of report data inconsistency. The simplest is to not allow
users to perform the most complex multi-pass queries on
real-time data, instead limiting real-time reporting to
analyses that can be performed in a single SQL statement.
An alternative is to have a less-frequently updated snapshot
of the real-time data in a separate partition that can
be used for complex analytical queries. This, however,
requires a lot of setup, maintenance and user education.
If an application has to live with some internal report
inconsistency, then it is important to educate users that
this is a possibility. Uneducated users who view data
that doesnt properly add up are likely to assume
that the system is malfunctioning and cant be trusted.
Scale this, query that
The issue of query contention and scalability
is the most difficult issue facing organizations deploying
real-time data warehouse solutions. Data warehouses were
separated from transactional systems in the first place
because the type of complex, analytical queries run against
warehouses dont play well with lots
of simultaneous inserts, updates or deletes.
There are ways to get around this problem, including
the near real-time approaches already described, but when
true real-time is a hard-and-fast requirement, you need
a different solution.
Many real-time warehousing applications are relatively
simple. If reports based on real-time data can be limited
to simple and quick single-pass queries, many relational
database systems will be able to handle the contention
that is introduced. Frequently, the most complex queries
in a data warehouse will access data across a large period
of time. If these queries can be based only on the non-changing
historical data, contention with the real-time load is
eliminated.
Also, many users who are interested in real-time data
might be better served by an alert notification application
that sends them an e-mail or wireless message alerting
them to real-time data conditions that meet their pre-defined
thresholds. For example, a store manager might want to
be alerted right away when inventory of any top-selling
product falls to 10% of the usual amount. Designed properly,
these types of systems can be scaled to 100 or 1,000 times
more users than could possibly run their own concurrent,
real-time warehouse queries.
There is always the option of adding more hardware to
meet the scalability needs of large numbers of users performing
sophisticated analyses on constantly changing, real-time
data. Fortunately Teradata makes this easy, as more nodes
can always be added to an existing system. While this
approach might overcome any short-term capacity concerns,
real-time query contention often has more to do with the
way a project and data model are constructed than with
the available database system resources.
Alert on demand
Most alerting applications currently associated
with data warehouses have been mainly used to distribute
e-mail versions of reports after the nightly data warehouse
load. The availability of real-time data in a data warehouse
makes alerting applications much more appealing, as users
can be alerted to real-time conditions as they occur,
not just on a nightly basis.
At this time, all data warehouse alerting technology,
such as MicroStategys Narrowcaster or similar products
from Business Objects and Cognos, work on a schedule basis
or on an event basis. This means that in order to perform
true real-time alerting, a process must continuously monitor
incoming data and trigger events when appropriate.
One way to approximate real-time alerting without the
added complexity of a real-time data-stream monitoring
solution is to utilize a data warehouse alerting package
on a scheduled basis, with the schedule typically set
for one-, five-, 15- or 30-minute intervals. This approach
works reasonably well and provides near real-time alerting.
For a near real-time warehouse, you only need to set the
alerting schedule to trigger right after the data is refreshed.
For a true real-time warehouse, the use of an n-minute
cycle schedule will introduce a certain amount of latency,
as an alert cant be triggered until the next cycle
window comes around after the threshold condition is met.
If the cycle time is low enough and the alert is to be
sent via e-mail anyway, a one- or five-minute latency
might be acceptable for many applications.
In all real-time alerting systems, it is critically important
that the users alert thresholds are properly managed.
The problem occurs when static threshold definitions,
which are fine for systems that load on a nightly or weekly
basis, are applied to systems that update more frequently.
It is unlikely that the user desires to be reminded every
five minutes. For example, he or she might want to be
alerted once inventory levels reach 5%, again at 2% and
a final time when stock runs out.
The best systems let the users decide what types of alerts
they will receive, when they will receive them and how.
This means that any new data element might trigger alerts
for some users but not others.
This type of management, with thresholds that automatically
adjust and reset, is necessary for users to accept real-time
alert systems. Unfortunately, it is not natively supported
by current warehouse alerting tools. Until this support
is added, the best approach is to use the tools
post-service plug-in abilities to run custom SQL or procedures
to directly update the users thresholds based on the current
data conditions. It is this post-execution job that makes
it critical that cycles dont overlap, because if
the next batch begins before the thresholds are updated,
duplicate alerts are likely.
Real-time, any time
Real-time data warehousing and OLAP are possible
using todays technology, but obstacles exist. For
the determined team armed with the right knowledge and
experience, it is possible to make real-time reporting,
analysis and alerting systems work. The challenge is making
the right tradeoffs along the way so the systems meet
the needs of the user base while ensuring that they dont
collapse under their own weight or cause existing production
warehouses to malfunction.
The benefits of data warehousing in real-time are becoming
clearer every day. With the right tools, designs, advice,
approaches and, in some cases, tricks, real-time data
warehousing is possible and will only become easier in
the future. In any case, the time to begin planning and
prototyping is now. T
Justin Langseth (langseth@claraview.com)
is CTO of Claraview LLC, which provides system architecture,
project management and consulting services to organizations
implementing real-time data warehouses and alerting systems.
Langseth also founded and served as CTO of Strategy.com,
a real-time data analysis and alerting subsidiary of MicroStrategy.