Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
 





























Alerting applications notify users about real-time data conditions as they occur.”








































































“Use near real-time ETL with long cycles to avoid data inconsistency when performing multi-pass SQL.”





 

REAL-TIME REALITY

Join the here and now of data warehousing.

by Justin Langseth

Given the success of data warehousing over the last three decades, and given the fast pace of today’s economy, many data warehousing teams have been pressured to make near real-time warehouses a reality. First this meant weekly loads instead of monthly; now it can mean loading new data every few hours, every 15 minutes or even continuously.

This new desire for real-time data is breaking the long-standing rule that data in a data warehouse is static except during nightly downtime for data loading. This rule has been the core tenant that allowed warehouses to function the way they do.

ETL tools could assume that they were operating during a batch window and that they can do as they please and not worry about disrupting active user sessions.

Query tools could assume that data would not change halfway through a complex, multi-pass SQL statement, and that query results could be cached and re-used between data loads.

Real-time warehousing breaks the rule and challenges these assumptions. As a result, there are five major challenges that are likely to be encountered by any data warehousing team moving in the direction of real-time processing. Fortunately, Teradata offers solutions.

Grab-it-and-go data
A data warehouse can only be considered real-time, or near real-time, when all or part of the data is updated, loaded or refreshed on an intra-day basis, without interrupting user access to the system. However, most ETL tools, whether based on off-the-shelf or custom-coded products, operate in a batch mode. They assume that they have free reign to drop tables, re-load tables and conduct other major database operations without disturbing simultaneous end-user queries.

To work around this challenge, data that changes throughout the day can be loaded into a parallel set of tables, either through a batch process that runs every several minutes or through a continuous trickle feed. In a Teradata-based system, a batch load is best conducted using either the FastLoad or MultiLoad utility, and a trickle feed is best performed using TPump. Once the load interval, say five minutes, is up, the freshly loaded tables are simply swapped into production and the tables with the now-stale data are swapped out of production. This can be accomplished through the dynamic updating of views or by simple table renaming.

The downside of this type of n-minute cycle-loading process is that the data in the warehouse is not truly real-time. This type of system is more accurately described as near real-time. For many data warehousing applications, near real-time is good enough. Also, using a near real-time approach helps relieve some of the other challenges of real-time warehousing.

For applications where true real-time data is required, the best approach is to trickle-feed the changing data from the source system directly into the data warehouse. With TPump, Teradata handles this type of real-time updating much better than many other database platforms, and this architecture can be used even in systems with large data volumes and high numbers of concurrent users.

Model of the future
The introduction of real-time data into an existing data warehouse or the modeling of real-time data for a new data warehouse presents some interesting data modeling issues. For instance, a warehouse that has all of its data aggregated at various levels based on a time dimension needs to consider the possibility that the aggregated information may be out of sync with the real-time data. Also, some metrics, such as month-to-date and week-to-date, might behave strangely with partial-day data that continuously changes.

Some tools handle these cases automatically; a calendar table can be used to ensure that comparisons are always made to the last full day of transaction data. The main issues regarding modeling, however, revolve around where the real-time data is stored and how best to link it into the rest of the data model.

When using the real-time or near real-time ETL approach described above, and when the real-time data is stored along with the historical data in the same fact tables, no special data modeling approaches are required. From the query tool’s perspective, there is nothing fundamentally different about a real-time data warehouse modeled in this manner than from the way data would be modeled in a non-real-time warehouse.

An alternate approach is to store the real-time data in separate warehouse fact tables. Depending on the type of fact table, many query tools that support table partitioning will be smart enough to automatically retrieve the real-time data when required. For tables where this will not work—for example, if not all columns or calculations are available in the real-time data set—alternate facts and attributes can be set up to point to the real-time data tables, and the query tool can be set up to drill across to real-time data templates from reports containing the historical data.

Another approach is to store the real-time data in different tables from historical data, but in the same table structure. Then, by using database views, the historical and real-time data tables are combined together so they look like one logical table from the query tool’s perspective and provide seamless access to historical and real-time data from the user’s perspective. This helps alleviate many of the problems associated with the separate partition approach, as the query tool or end users do not need to join two tables on their own.

Simplicity in a complex world
Today’s OLAP and query tools were designed to operate on top of unchanging, static historical data. Based on this assumption, they don’t take any precautions to ensure that the results they produce are not negatively influenced by data changes concurrent to query execution. In some cases, this can lead to inconsistent and confusing query results. Also, care needs to be taken to either disable query-tool result caching or to automatically purge the cache of reports that are based on real-time data upon each intra-day data load.

Relational OLAP tools are particularly sensitive to this problem because they perform all but the simplest data analysis operations by issuing multi-pass SQL. A multi-pass SQL statement is made up of many smaller SQL statements that sequentially operate on a set of temporary tables. For example, a simple request might show sales for a retailer at the category level along with a percent-to-total calculation.

An OLAP tool would convert this request into as many as seven SQL statements, first computing the total for all categories, then computing the individual category totals and percent-to-totals. What if the data in the warehouse changes between these steps? Perhaps some new transactions were added that will be included in the results of the second query but not the first. In this example, it’s quite possible that the total will be incorrect, and that the percent-to-total values will add up to more than 100%.

This example actually represents the best case for most real-world relational OLAP systems. Many users experience query response times that run from tens of seconds to multiple hours. For a well-tuned application, acceptable response times vary from 15 seconds to as much as five minutes. Also, typical OLAP queries can contain many pages of SQL code.

The good news is that you can avoid such inconsistencies. Consistency is only a problem when the data is changing quickly enough so that the data will be different at the end of a multi-pass query execution cycle than it was at the beginning. Using the near real-time ETL approach with a relatively long cycle time will alleviate this problem if the OLAP server is instructed not to send new jobs to the data warehouse during the load or flip.

Since a near real-time approach is not sufficient for some applications, there are ways to mitigate the affects of report data inconsistency. The simplest is to not allow users to perform the most complex multi-pass queries on real-time data, instead limiting real-time reporting to analyses that can be performed in a single SQL statement. An alternative is to have a less-frequently updated snapshot of the real-time data in a separate partition that can be used for complex analytical queries. This, however, requires a lot of setup, maintenance and user education.

If an application has to live with some internal report inconsistency, then it is important to educate users that this is a possibility. Uneducated users who view data that doesn’t properly add up are likely to assume that the system is malfunctioning and can’t be trusted.

Scale this, query that
The issue of query contention and scalability is the most difficult issue facing organizations deploying real-time data warehouse solutions. Data warehouses were separated from transactional systems in the first place because the type of complex, analytical queries run against warehouses don’t “play well” with lots of simultaneous inserts, updates or deletes.

There are ways to get around this problem, including the near real-time approaches already described, but when true real-time is a hard-and-fast requirement, you need a different solution.

Many real-time warehousing applications are relatively simple. If reports based on real-time data can be limited to simple and quick single-pass queries, many relational database systems will be able to handle the contention that is introduced. Frequently, the most complex queries in a data warehouse will access data across a large period of time. If these queries can be based only on the non-changing historical data, contention with the real-time load is eliminated.

Also, many users who are interested in real-time data might be better served by an alert notification application that sends them an e-mail or wireless message alerting them to real-time data conditions that meet their pre-defined thresholds. For example, a store manager might want to be alerted right away when inventory of any top-selling product falls to 10% of the usual amount. Designed properly, these types of systems can be scaled to 100 or 1,000 times more users than could possibly run their own concurrent, real-time warehouse queries.

There is always the option of adding more hardware to meet the scalability needs of large numbers of users performing sophisticated analyses on constantly changing, real-time data. Fortunately Teradata makes this easy, as more nodes can always be added to an existing system. While this approach might overcome any short-term capacity concerns, real-time query contention often has more to do with the way a project and data model are constructed than with the available database system resources.

Alert on demand
Most alerting applications currently associated with data warehouses have been mainly used to distribute e-mail versions of reports after the nightly data warehouse load. The availability of real-time data in a data warehouse makes alerting applications much more appealing, as users can be alerted to real-time conditions as they occur, not just on a nightly basis.

At this time, all data warehouse alerting technology, such as MicroStategy’s Narrowcaster or similar products from Business Objects and Cognos, work on a schedule basis or on an event basis. This means that in order to perform true real-time alerting, a process must continuously monitor incoming data and trigger events when appropriate.

One way to approximate real-time alerting without the added complexity of a real-time data-stream monitoring solution is to utilize a data warehouse alerting package on a scheduled basis, with the schedule typically set for one-, five-, 15- or 30-minute intervals. This approach works reasonably well and provides near real-time alerting. For a near real-time warehouse, you only need to set the alerting schedule to trigger right after the data is refreshed.

For a true real-time warehouse, the use of an n-minute cycle schedule will introduce a certain amount of latency, as an alert can’t be triggered until the next cycle window comes around after the threshold condition is met. If the cycle time is low enough and the alert is to be sent via e-mail anyway, a one- or five-minute latency might be acceptable for many applications.

In all real-time alerting systems, it is critically important that the users’ alert thresholds are properly managed. The problem occurs when static threshold definitions, which are fine for systems that load on a nightly or weekly basis, are applied to systems that update more frequently. It is unlikely that the user desires to be reminded every five minutes. For example, he or she might want to be alerted once inventory levels reach 5%, again at 2% and a final time when stock runs out.

The best systems let the users decide what types of alerts they will receive, when they will receive them and how. This means that any new data element might trigger alerts for some users but not others.

This type of management, with thresholds that automatically adjust and reset, is necessary for users to accept real-time alert systems. Unfortunately, it is not natively supported by current warehouse alerting tools. Until this support is added, the best approach is to use the tools’ post-service plug-in abilities to run custom SQL or procedures to directly update the users thresholds based on the current data conditions. It is this post-execution job that makes it critical that cycles don’t overlap, because if the next batch begins before the thresholds are updated, duplicate alerts are likely.

Real-time, any time
Real-time data warehousing and OLAP are possible using today’s technology, but obstacles exist. For the determined team armed with the right knowledge and experience, it is possible to make real-time reporting, analysis and alerting systems work. The challenge is making the right tradeoffs along the way so the systems meet the needs of the user base while ensuring that they don’t collapse under their own weight or cause existing production warehouses to malfunction.

The benefits of data warehousing in real-time are becoming clearer every day. With the right tools, designs, advice, approaches and, in some cases, tricks, real-time data warehousing is possible and will only become easier in the future. In any case, the time to begin planning and prototyping is now. T

Justin Langseth (langseth@claraview.com) is CTO of Claraview LLC, which provides system architecture, project management and consulting services to organizations implementing real-time data warehouses and alerting systems. Langseth also founded and served as CTO of Strategy.com, a real-time data analysis and alerting subsidiary of MicroStrategy.




Copyright by Teradata Corporation 2001-2007.