Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  

SYMPHONY OF TWO

Near real time and the next-generation ETL tool

by Dan Linstedt

Recently I've been challenged by enterprise application integration (EAI) and extract, transform and load (ETL) vendors to predict what the user population might want to see next year. What's around the corner, they ask? Let's take a realistic look at integrating data—the task at hand in a near real time (NRT) functional environment. I've also been asked whether EAI will take over the ETL world or vice versa. I'm here to answer ... neither. They play in concert in tomorrow's vision. Each has its role, and each plays it well.

Clearing the data pipeline
EAI has taken the responsibility of guaranteed delivery. Rain or shine, sleet or snow, data gets to its destination as fast as possible. Even if the network goes down, and comes up again next year, it will get there. If there are alternate routes, EAI tools reroute the information and make sure it gets delivered.

These tools also ensure distribution. If there is more than one destination, they make sure that all destinations hear the message. They play a huge role in a transaction-by-transaction analysis, quite possibly reformatting the transaction on a one-by-one basis. But without extensive programming, EAI tools do not do well at taking care of data integration, aggregation and consolidation, especially at an enterprise level. This is the job of the ETL tool.

EAI is important but it has yet to master the cross-stream integration layers that ETL already offers, not to mention the graphical user interface (GUI) development environment for data flow validation and control. This is where the ETL tools play, not where EAI vendors should go. They should continue focusing on data delivery, making it faster, better, cheaper and, of course, more reliable. Consistency is the backbone of EAI; that will never go away. It is also EAI's role to ensure transactions are delivered on time. In this case, speed is an issue and will continue to be as the data volumes grow along with the source systems.

For EAI, latency is an issue, as is integration. It will no longer suffice to simply have coded efforts developed in an EAI solution, nor is it good enough to "initialize" the world of ETL. There will not be enough time in the day to download ever-changing data, and the complex world of cross-integrated solutions will require that the data be kept in sync across never-ending streams of processes.

So what does EAI hold in its future? I don't know—I haven't made that analysis of the market yet. Hopefully, by this time next year we'll be able to write intelligently about what is currently a young market (by software standards). At least for tomorrow, ETL and EAI won't necessarily merge functionality, except there will be some bleed-over in the technology sector. Best-of-breed ETL won't be able to compete with best-of-breed EAI simply because the technology band is too wide.

Before choosing to install an EAI system, be sure to have the business justification in place. Keep an open mind, and remember that EAI shouldn't be put in place of ETL, and ETL shouldn't be put in place of EAI.

Transforming transformation
ETL can be expected to change a lot, especially when it comes to supporting NRT. As far as its integration with EAI, this is one of the foundational cornerstones for the industry to survive. It is my prediction that the ETL will not survive on batch alone in the new industry. Of course that's no secret to anyone. Almost everyone these days is investing in some EAI solution and has had several ETLs from which to choose for quite some time.

But before you try NRT with your ETL, be warned. There are serious implications dictated by complex business rules that will stop you dead in your tracks. It's a game of "push the synchronization point around until it lands in the right hole." In other words, today it's a series of workarounds and undulations (a nightmare for the systems architect) to ensure success.

What we're discussing here is that ETL must expand beyond individual processes to encompass workflows and dynamic data sharing between workflows. It must also allow messaging between workflows. The more data-driven the ETL and its processes become, the more likely it will be enabled to handle NRT feeds. See ETL Wish List (page 70) for features vendors need to address.

Putting the squeeze on ETL
There is an additional side to these issues: the relational database (RDBMS) engine. I liken it to a squeeze play on ETL, with EAI on the left and RDBMS on the right, both pushing to the middle ground where ETL resides and ETL is pushing back (Figure 1). In a sense, the lines will be blurred between which tools do what. This is good because it means all the tools and vendors are migrating forward and sharing the responsibility. After all, balance is one of the major keys to success in best-of-breed tool sets.

This idea is coming to light as database and ETL vendors rush to provide support for very large database (VLDB) systems. Today, most database engines struggle to handle 10+ terabytes of information, as well as massive integration efforts of new information. In other words, they are attempting to handle integration, volume and latency issues on their own.

Getting the data into the database is typically a huge issue, one where ETL breaks down unless there is a strong partnership with the RDBMS vendor. This leads us to a best-of-breed discussion on database features. The RDBMS vendors know best how to load massive data sets through their bulk-
loading systems. However, when it comes to massive integration efforts, that might be a different story.

The RDBMS Wish List (below) provides a partial listing of new features that RDBMS engines will have to undertake to support the new world of ETL and EAI. Currently Teradata has many of these features already built in, while Oracle, DB2 UDB, Sybase and SQLServer appear not to.

As these tools come closer together, so will the many roles of the DBA. He or she will have to learn about ETL and possibly EAI. In a perfect world, there would be one person designing the EAI, ETL and database solution, one person implementing and maintaining it all. We want to cut the costs of managing and maintaining the physical layers of each area: EAI, ETL and RDBMS.

Adapt today, survive tomorrow
This is definitely tomorrow's market for ETL. Will it come right away? No. There's a lot of engineering work involved in just these suggestions alone, not to mention the other 100 that I didn't address here. However, smart ETL vendors will begin to adapt their tools to handle these and other critical features in the NRT world in order to handle additional on-the-fly information.

I think the industry will begin to see a merging of RDBMS, EAI and ETL functionality. There will be a lot of changes, possibly mergers or at least alliances between RDBMS vendors and ETL vendors. EAI vendors will continue to link to ETL and integrate the best they can. Just don't forget, the focus of successful warehousing will be a paradigm shift—from ETL compensating for deficiencies in the RDBMS engines, to the RDBMS designed for data warehousing and ETL utilizing best-of-breed functionality. T


RDBMS wish list


ETL wish list

RDBMS vendors should consider the following when upgrading their tool sets:

* Larger, more open application programming interfaces (APIs) for native bulk-loading systems. They will allow integration of the bulk-loader directly into the ETL and EAI tools.

* Full parallelism. Instead of relying on the developer to "code" or architect a multi-path parallel load system, this should be built in to the load mechanism.

* Full partitioning support. Again, hiding the complexity from the programmers and from the ETL tool, this should be dynamically set up. RDBMS engines should help hide the partitioning scheme from ETL and EAI. We as designers/architects shouldn't be forced to deal with the physical partitioning layer of the database just for load reasons.

* Stored procedure/trigger support. These should be callable through an ETL tool API and passed one or more blocks of data to execute at one time, rather than calling a stored procedure/trigger in a transactional sense (one at a time). This will help ETL tools with the concept of ELT as well.

* Auto-balanced usage of the CPUs, RAM and disk. These complex routines will need to measure, adjust and report the metrics. Due to volume loads, we (architects/designers) don't have time to perform these complex tasks. However, without these tasks being performed by the RDBMS engine underneath, our ETL processes quickly become bound to I/O, RAM and/or CPU. Again, volume is driving this issue. ETL tools should be able to leverage these auto-balance routines by communicating through the complex API.

* Threaded index updates. Indexes are too often "leftovers," handled after the load process is done. This won't work in volume situations. Indexes must be maintained during load on their own threads, although certain levels of latency are acceptable. They should not impact the load processes nor cause RAM or CPU binding.

* Bulk-mode updates and deletes. In the future RDBMS engine, part of the API will provide "look-ahead" calls or even artificially intelligent learning algorithms. One way or another, it will be able to read blocks of information into RAM for RAM-based bulk updates or bulk deletes. Today, given volume situations, we still face these problems on a row-by-row basis from an ETL perspective.

* Auto defragmentation. Given the volumes of information being worked on/within ETL and EAI, it is extremely important to keep the database defragmented. This is a difficult proposition requiring a DBA's time in most RDBMS engines.

There are serious complications to utilizing an ETL tool in NRT. Below is a list of items that need to be addressed from an ETL individual-process standpoint:

* Each process must be capable of passing messages to other running processes.

* Processes must be able to be defined as "never-ending," that is, until they receive a message to stop.

* Queuing mechanisms must be threaded and parallel so messages between processes are not lost due to CPU time-out.

* Look-ups against tables and source files must be responsive to messaging systems and able to dynamically add, update and delete rows on demand. In this case, if multiple streams are loading a single target table, the integration must be such that the lookups across these multiple streams can be synchronized without deadlock contention.

* Aggregations must be equally shared and synchronized across multiple running streams.

* Input information must be able to be dynamically reassigned to other processes based on meta-data controlled rules designed by the business.

* Multiple streams must be able to be dynamically consumed by a single process based on business rules.

* Once initialized, continuous build and append must be available.

* Checkpoints and failure recovery must be built in (or allowed to be designated) at certain points in the process. Recovery for a process must be a matter of seconds, not minutes or hours.

* Recovery should consist of returning to the most recent checkpoint.

* ETL processes should be distributable across all registered resources so the workload is shared.

* Information arriving at particular points in time should have the ability to reconcile and synchronize across process flows.

* The ETL should be able to construct data hierarchies and dependencies that span process flows, producing queued environments and timetables for latency processing of particular information.

* It should allow the developer of the processes to construct message flow diagrams based on conditions and execute different messages across the processes.

* The ETL should manage (internally) and possibly eliminate target contention or deadlock, allowing the developers of the process to focus on the complex task of getting the NRT data in and integrated.

Dan Linstedt, of Myers-Holum, Inc. can be reached at daniel.linstedt@MyersHolum.com.

ILLUSTRATION BY JOYCE HESSELBERTH




Copyright by Teradata Corporation 2001-2007.