Real-time data loading tips and techniques.
by John Mayrack
During a recent flight, I had some time to think about what I've learned this last year while implementing near real-time data loading at
customer sites. Within our team of Teradata Professional Services practitioners performing similar tasks, we discover both commonality and
uniqueness in our projects. Such collective knowledge may even lead to a eureka moment when things snap into place to reveal a best practice.
Out of that learning come a few tips and techniques for you to consider.
| A matter of time |
|
The Data Warehousing Institute confirmed the importance of loading data in real time or near real time in its 2007
Best Practices in Operational BI: Converging Analytical and Operational Processes survey. According to the results,
38% of respondents indicated that accelerating the frequency of batch loads was important to their operational
BI strategy, and 24% indicated that trickle feeding data into the enterprise data warehouse as events occur was
important.
|
|
Companies load data in near real time for several reasons. Business users' demand for fresh data is at the top of the list. While many
organizations use operational data stores (ODSs) to load and hold fresh data, the popularity of the ODS is waning as more enterprise data
warehouses (EDWs) load data in near real time. Other motivations include delivering fresh and historical data to the call center, real-time
fraud application and Web site offer personalization. Many uses exist for fresh data coupled with historical data.
Another motivator is simply helping IT break out of the nightly batch load cycle since, for many companies, doing business across multiple
time zones has erased any distinction between night and day.
Still, several misconceptions exist about real-time data loading. Foremost is that "nightly batch thinking" applies to intra-day loading. This
produces fears that the volume of data is too large and the arrival rates are too fast for the EDW, which could cause performance problems.
Actually, the amount of data arriving throughout the business day tends to be trivial for scalable data warehouse solutions like those from
Teradata. Today, big retailers, dot-coms and telcos load millions of rows throughout the day, often with little impact on system performance
by using Teradata Active Systems Management to control the entire workload on the platform.
The second most common misconception is that near real-time loading is expensive. Typically, the additional workload of active loading is
between 1% and 5% of the total server capacity. Unless your site runs the Teradata server at 100% capacity all day, the chances are slim that
you will need to buy additional hardware. Occasionally, companies must expand their network capacity to handle thousands of additional messages
per minute, but this is not common. The biggest costs are the same as in batch extract, transform and load (ETL)—the labor to clean up the
data. What has changed is the need for a service level agreement (SLA) defining the transformations and load to run in seconds or minutes
instead of overnight.
Real challenges in real time
Data transformations and data quality are the biggest parts of an active load implementation. The challenge is balancing the service level
targets, data transformations, data cleansing, server capacity and architectures into the optimum solution. The designer must choose the best
place to do the transformations: at the source, in the middle tier, inside the Teradata Database or a combination. (See figure 1, left.)
It is always desirable to clean the data as close to the source as possible because it improves all downstream processes. Yet with the full
parallelism of the Teradata Database, an extract, load and transform (ELT) approach is often the best choice because the transformations are
dramatically faster inside the EDW. Note that some ETL vendors are now offering "push down" transformations that can run at the source, run in
the middle tier or leverage the parallelism of the Teradata system. Advanced data integration engines can optimize placement of transformation
routines on servers, and some can provide the transforms as Web services.
Another balancing decision involves table lookups to augment or correct the incoming data stream: too many I/O requests and the loading slows
down; too few and the data is poor quality. In general, lookups are best done on a database server where the table lookup is to a local disk
and network hops can be avoided. Regardless, to ensure good performance, the designer must know how many table lookups are done for each
record type, since each lookup costs an I/O to disk.
Ideally, the designer wants every part of the active load design to be simple and flexible. This is when the decision to use mini-batch or
continuous streams comes in. When a lot of transformations and table lookups must be done, the best design is often mini-batch loading into
staging tables inside the EDW. Through a combination of SQL, user-defined functions (UDFs), stored procedures and joins, a large number of
transformations, such as parsing and co-located table lookups, can be quickly done in parallel using the ELT design. The ELT approach can also
eliminate some EDW extracts to the middle-tier server, further reducing complexity and elapsed times.
Still, some transformations are best done before data is loaded into the EDW. An example is assigning surrogate keys for better row
distribution purposes. It can take multiple SQL statements on one row to generate a surrogate key, especially when natural business
keys are reused. This process can sometimes be done faster and easier outside the EDW.
Similarly, some processing can be done only in the middle-tier ETL server, such as extensive name and address matching and householding using
third-party data cleansing tools. Generally, the more complex the transformations, the more likely they will be done in a data integration server
with multi-step, in-memory processing capability.
Planning an active load project
Active load projects follow the same steps as any ETL project. It is best to begin with business goals, especially when a dozen business
managers are demanding their data gets loaded in real time. Even a cursory estimate of return on investment (ROI) for each data stream will
help settle debates over who gets fresh data first, second and third. And if the business users permit it, starting with a relaxed SLA in a
"crawl, walk, run" skills-building mode is a good approach. At this point you should involve your physical data-modeling expert to support
both your reporting needs and your near real-time loading objectives.
Next, focus on source-system analysis, especially data arrival rates and data cleanliness. Watch out for source systems that cannot send the
data as fast as you want it. Sometimes the source system sends a data feed only once or twice a day, and funding is not available to get that
changed. In some operational systems, you can add database triggers or use change-data-capture tools to send the data sooner.
Similarly, agents and adapters that send data out through message queues are often already in the operational systems. Look for these first.
The active load process can often "subscribe" to existing "published" data on the queues. In many cases, these agents can be modified to
include additional data and transformations.
Allowable latency and transformation complexity
Latency is the time between when the original transaction takes place and when the business user can see the data from that transaction in the
data warehouse. Latencies of two to 10 minutes for point-of-sale devices or e-commerce Web sites are not uncommon. However, latencies between
a half-hour and two hours are much more common. Transformation complexity can be high (data merged from two to three sources, referential
integrity, surrogate keys, etc.), medium (SQL, UDF, stored procedure transforms) or low (simple string parsing).
The active load design requires a balance between the SLA-defined latency and complexity of transformations. (See figure 2, below.) This
often drives the selection of the data loading design, choosing between message streams, replication streams, ELT mini-batch or ETL
mini-batch. Low latency requires continuous inserts with simple transformations to satisfy the business user. In contrast, longer latency
permits designs where more complex data transformations can be done either via SQL inside the Teradata Database or in a middle-tier ETL
server processing mini-batches.
A best practice during implementation is to perform a 10X surge test. This means pushing 10 times the normal amount of data through the active
load subsystem to see whether anything breaks. Technically nothing should break, but you should be able to push testing to the capacity limit.
To do that, keep increasing the workload until a server resource hits its limit. Then you can see where you are at risk and need to take action.
To be realistic, these tests should include workload contention. Your near real-time load process should have high-priority scheduler settings.
Although each load task uses a small amount of CPU resource, it is time-critical and needs high-priority CPU access.
| Tips to implement an active load project |
|
| 1. |
Start with concrete business goals.
|
| 2. |
Develop a service level agreement with a "crawl, walk, run" approach.
|
| 3. |
Involve your physical data-modeling expert.
|
| 4. |
Focus on source-system analysis.
|
| 5. |
Look for data you need that is already published in message queues.
|
|
|
If the data comes in through message queue streams, the testers will have to find ways to stuff the queue with varying data volumes to create
a severe backlog of records. You can do this by writing a simple program to get and put records from a flat file into the messaging queue
(Java Messaging Service [JMS], WebSphere MQ, MSMQ, etc.). You should also run the same flat file directly into Teradata TPump to isolate the
performance characteristics of the JMS server from the Teradata load processes. This information will be invaluable should a production
performance problem surface months later. It is also a good idea to run the surge testing periodically—when something changes, before a peak
season, etc.
In addition, a "catch-up" alternative design may be needed to compensate for Teradata system outages, such as system upgrades. In these cases,
the active load design might include the capability to switch from message queue loads to one of several bulk load techniques for a short
time. Consequently, the active load SLA should include latency definitions for both surge and normal workloads, and some provision for planned
and unplanned outages.
Some organizations build latency measures into the load process by adding a timestamp identity column to the row. Comparing this load
timestamp with a timestamp from the source system will help you monitor your active load SLA via simple reports, which will come in handy with
your funding sponsors.
Eliminate barriers
Contrary to popular rumors, performance and expense are not serious obstacles during active loading—unless you are already out of server
capacity. Instead, the Teradata Professional Services team found that the biggest barriers to active loading are:
|
Applying batch ETL thinking to streams and mini-batch is contradictory, as the data volumes and system designs are different for
the two approaches.
|
|
Data transformation and cleanup are the most labor-intensive parts of the project.
|
|
Surge testing is mandatory to ensure the system's performance and stability during unusual business events.
|
As the Teradata Professional Services team continues working on projects at customer sites, I am certain more tips and techniques will be
revealed and eureka moments shared. T
John Mayrack, principal consultant at Teradata, has more than 30 years of experience in IT, 27 of which are in the retail industry. He has
multiple Teradata certifications and is experienced in software design and development, project management and Teradata system implementations.
Teradata Magazine-September 2008
|