Front-running companies make use of active load architectures.
by Dan Graham
Most Teradata clients who want to activate their enterprise data warehouse (EDW) start with active load. Fortunately, active load is getting
easier as Teradata TPump matures and more best-practices guides become available. Some sites start out simple; some leap straight into
continuous data loading. Respondents to a recent survey by The Data Warehousing Institute cited numerous techniques as being important to
their operational business intelligence (BI) strategy, including accelerating the frequency of batch loads (38%) and trickle feeding data into
the EDW as events occur (24%).
But how do these front-runners who have already blazed the path to active load do it? What architectures are they using? Here we examine four
of the many ways data gets from the front line into the EDW and back to the operational front-line users.
Simple extract, load and transform (ELT)
Active loading can be implemented using very simple ELT methods. One telecommunications company wanted to give its 4,000-plus call center
customer service representatives (CSRs) more accurate and timely information to solve customer problems faster. When the company realized the
EDW had 80% of the data needed, the goal became to load the other 20% into the EDW.
Every day as CSRs handle calls, rows are written into the call center database. Similarly, self-service interactive voice response records
are captured showing customer menu choices. Since only 1,000 of these records are loaded every five minutes, the company decided to use BTEQ
mini-batches into staging tables. iWay software is used for change-data capture to extract data from the source database. A BTEQ job then
inserts these rows into a Teradata staging table every five minutes to begin the transformation process, which includes:
|
Account and phone number matching
|
|
Call event categorization
|
|
Accommodating data from new customer channels
|
|
Moving the data into the final target base tables
|
|
Loading several hundred thousand records daily
|
Using the fresh data, a BEA WebLogic portal delivers recent contact history to the CSRs via Web services calls to the iWay integration server,
which issues SQL requests to the Teradata Database. The application has a two-second response time service level agreement (SLA) for the
active accesses. Averaging 50,000 inquiries a day with peak loads of 10,000 tactical queries an hour, the Teradata system easily supports
concurrent call center BTEQ loading, MultiLoad jobs and hundreds of concurrent tactical queries.
TPump straight-through loading
In a second example, a major retailer wanted up-to-the-minute reports for supply chain managers and store managers. Operational managers use
the reports to detect stores selling items faster or slower than expected. They wanted to duplicate the tactics of the stores that sell beyond
expectations in the stores that are lagging behind. So, the retailer's goal was to produce near real-time reports and scorecards in fewer than
five minutes—from printing the consumer's register receipt to operational reporting. With point-of-sale data already loading into the mainframe,
they decided to pass the purchase receipts data to TPump via IBM WebSphere MQ. (See figure 1 above.)
Since a transaction receipt has many complex line items—stock keeping units (SKUs), payment type, price, coupon discounts, taxes, etc.—the
programmers use an access module for transformations. The access module is a program written in C that is linked to TPump. The programmers
need TPump to insert records into as many as 40 tables from one purchase transaction. For each transaction, the access module calls one of 40
subroutines to properly reformat and transform the incoming record types.
Because of the company's database design, all data is loaded via inserts, switching to an upsert only if a "record exists" error occurs. This
approach doubles the loading throughput versus using only upserts. The access module routine also includes the "Notify Exit," which captures
errors and sends them back to the access module to be processed in real time. The application then fixes the most common errors, thus
increasing data accuracy while reducing downtime.
During the busy seasons, the system loads up to 4 million transactions per day with an average of 10 to 12 records per transaction (40 million
to 50 million records per day). The CPU overhead for the entire process on the Teradata server is only 1% to 2%. Unfortunately, on peak sales
days, the overloaded mainframe becomes a performance bottleneck. But the designers were smart and built the TPump processes and scripts to run
on either UNIX or mainframe servers.
After the design was put into production in 2005, two obvious benefits developed:
|
The data is loaded continuously and only once into the Teradata system.
|
|
All of the maintenance is contained in a single access module so new subroutines can be added.
|
TPump into staging tables
Because of the unpleasant reality that their competitor is only two clicks away, dot-com companies have a tendency to think of everything in
real time. Consequently, one travel company had so many real-time load requests that it had to prioritize projects by return on investment
(ROI). To cope with the flood of data, the company designed an architecture using TPump that could be replicated to handle different data
streams with simple scripting changes. The variety of source data—click-streams, supplier prices, Web page graphics, offers and deals, sales
bookings—and service level goals demanded a robust design.
The design uses two TPump jobs, each servicing one WebSphere MQ end point. One of the two TPumps is in sleep mode while the other actively
loads staging tables. The mini-batch "end-of-file" condition is forced every two minutes or after a specified number of rows have been loaded.
Regardless of which happens first, the TPump job stops taking in new rows and starts transformations on its mini-batch via stored procedures.
Meanwhile, the sleeping TPump job wakes up and starts accepting rows from WebSphere MQ into a second staging table. After the first TPump job
finishes emptying the staging table into the base tables, it goes to sleep and waits for its next turn. (See figure 2.)
In 2005, the company started running eight TPumps, loading more than 1 million rows a day in near real time. The TPump jobs were also designed
to do catch-up after a planned outage. In this case, the TPump job will read large flat files of saved updates and blast them in once the
Teradata system is operational again. Today, the number of TPump load streams has expanded for data when latency must be short. When the
latency does not have to be measured in minutes, the programmers use MultiLoad and FastLoad mini-batches intra-day followed by extract,
transform and load (ETL) late at night.
This is the quintessential mixed workload application for a data warehouse. Consequently, Teradata Active System Management plays a vital
role by managing service levels for the widely varied user communities and tasks. Using priority management techniques, the consumer Web site
typically achieves one- to two-second performance, reports run normally and the load jobs run in spurts as the data arrives.
Ultimately, the company was able to feed its Web site with fresh data on price changes, offers, advertising graphics and consumer campaigns in
real time using its Teradata system. While loading up to a million records a day with TPump, the company also concurrently serves 400,000
decision support queries daily and more than 1 million Web site requests to its Teradata system.
Replication and ELT
Another dot-com company—this one an e-tailer—also needed multiple load strategies running at the same time. One of the biggest concerns—after
order bookings, of course—is marketing campaigns to online users. Like an air traffic control tower, this company has dozens of marketing
people cross-selling to customers to close added business, adjust ongoing deals and reach out to customers every day. Campaigns as well as
deals could start one day and end the next. Marketing staff must invent campaigns, assess profitability, recruit customers to visit the Web
site and sell products—all within 48 hours. With the help of the Teradata Relationship Manager and Teradata Value Analyzer, this stress was
alleviated. (See figure 3.)
Because of SLAs on data latencies, more than 70% of the e-tailer's data has to be loaded intra-day, leaving mostly ERP data for nightly batch
ETL. The company turned to three products for fast deployment of their load strategies: GoldenGate TDM (Transactional Data Management) for
replication, Business Objects (an SAP company) Data Integrator for nightly batch ETL and Oracle's Sunopsis data integration tools for ELT
transformations. With GoldenGate TDM, the company can grab Oracle database "redo logs" from the Web site and load them into Teradata staging
tables at the rate of 5 million to 20 million a day.
More than 650 Oracle tables are monitored for updates that are replicated to the Teradata system each day. Because GoldenGate TDM supports
many data sources, it is also used to capture "redo logs" from Microsoft SQL Server applications. To feed the Business Objects virtual Rapid
Marts inside the EDW, the company relies heavily on Sunopsis to run 180,000 mini-batch loads per day. Once the batches of data are in the
staging tables, Sunopsis, coupled with a job-scheduling product, executes around 180 transformation jobs per hour throughout the day. A few
BTEQ jobs are also run for lower-priority transformations.
This company's compelling results provide its management with full Business Objects dashboards that have hourly profit-and-loss summaries.
Along with the expected merchandising reports, campaign results and return customer analysis, the company has push-button consumer metrics
showing new customers, retention rates, repeat buys, average revenue by customer segment and product line and, of course, profit measures.
Common themes
Each of these companies shares at least three outcomes from its use of active load: First, the results dispel the myth that near real-time
loading has to be expensive. One company said it cost 2% of its Teradata server capacity to do active loading. While each company spent the
typical project costs in labor, programming and operational tasks, nearly all said no hardware upgrade was required.
Second, each company relies on Teradata Active System Management to deliver on front-line user service levels while maintaining good load speed.
Last, each of these sites was continuously loading all day, every day—way back in 2005. These front-runners have done the debugging and
established best practices for other companies to follow. That's good news for the EDW sites that will be joining the other front-runners who
do intra-day, active loading already. T
Dan Graham has more than 35 years in IT and leads Active Data Warehouse Technical Marketing at Teradata.
Teradata Magazine-June 2008
|