Near
real time and the next-generation ETL tool
by Dan Linstedt
Recently I've been challenged by enterprise application
integration (EAI) and extract, transform and load (ETL)
vendors to predict what the user population might want
to see next year. What's around the corner, they
ask? Let's take a realistic look at integrating data—the
task at hand in a near real time (NRT) functional environment.
I've also been asked whether EAI will take over the
ETL world or vice versa. I'm here to answer ... neither.
They play in concert in tomorrow's vision. Each has
its role, and each plays it well.
Clearing the data
pipeline
EAI has taken the responsibility of guaranteed
delivery. Rain or shine, sleet or snow, data gets to its
destination as fast as possible. Even if the network goes
down, and comes up again next year, it will get there.
If there are alternate routes, EAI tools reroute the information
and make sure it gets delivered.
These tools also ensure distribution. If there is more
than one destination, they make sure that all destinations
hear the message. They play a huge role in a transaction-by-transaction
analysis, quite possibly reformatting the transaction
on a one-by-one basis. But without extensive programming,
EAI tools do not do well at taking care of data integration,
aggregation and consolidation, especially at an enterprise
level. This is the job of the ETL tool.
EAI is important but it has yet to master the cross-stream
integration layers that ETL already offers, not to mention
the graphical user interface (GUI) development environment
for data flow validation and control. This is where the
ETL tools play, not where EAI vendors should go. They
should continue focusing on data delivery, making it faster,
better, cheaper and, of course, more reliable. Consistency
is the backbone of EAI; that will never go away. It is
also EAI's role to ensure transactions are delivered
on time. In this case, speed is an issue and will continue
to be as the data volumes grow along with the source systems.
For EAI, latency is an issue, as is integration. It will
no longer suffice to simply have coded efforts developed
in an EAI solution, nor is it good enough to "initialize"
the world of ETL. There will not be enough time in the
day to download ever-changing data, and the complex world
of cross-integrated solutions will require that the data
be kept in sync across never-ending streams of processes.
So what does EAI hold in its future? I don't
know—I haven't made that analysis of the market
yet. Hopefully, by this time next year we'll be able
to write intelligently about what is currently a young
market (by software standards). At least for tomorrow,
ETL and EAI won't necessarily merge functionality,
except there will be some bleed-over in the technology
sector. Best-of-breed ETL won't be able to compete
with best-of-breed EAI simply because the technology band
is too wide.
Before choosing to install an EAI system, be sure to
have the business justification in place. Keep an open
mind, and remember that EAI shouldn't be put in place
of ETL, and ETL shouldn't be put in place of EAI.
Transforming transformation
ETL can be expected to change a lot, especially
when it comes to supporting NRT. As far as its integration
with EAI, this is one of the foundational cornerstones
for the industry to survive. It is my prediction that
the ETL will not survive on batch alone in the new industry.
Of course that's no secret to anyone. Almost everyone
these days is investing in some EAI solution and has had
several ETLs from which to choose for quite some time.
But before you try NRT with your ETL, be warned. There
are serious implications dictated by complex business
rules that will stop you dead in your tracks. It's
a game of "push the synchronization point around
until it lands in the right hole." In other words,
today it's a series of workarounds and undulations
(a nightmare for the systems architect) to ensure success.
What we're discussing here is that ETL must expand
beyond individual processes to encompass workflows and
dynamic data sharing between workflows. It must also allow
messaging between workflows. The more data-driven the
ETL and its processes become, the more likely it will
be enabled to handle NRT feeds. See ETL Wish List (page
70) for features vendors need to address.
Putting the squeeze on ETL
There is an additional side to these issues:
the relational database (RDBMS) engine. I liken it to
a squeeze play on ETL, with EAI on the left and RDBMS
on the right, both pushing to the middle ground where
ETL resides and ETL is pushing back (Figure 1). In a sense,
the lines will be blurred between which tools do what.
This is good because it means all the tools and vendors
are migrating forward and sharing the responsibility.
After all, balance is one of the major keys to success
in best-of-breed tool sets.
This idea is coming to light as database and ETL vendors
rush to provide support for very large database (VLDB)
systems. Today, most database engines struggle to handle
10+ terabytes of information, as well as massive integration
efforts of new information. In other words, they are attempting
to handle integration, volume and latency issues on their
own.
Getting the data into the database is typically a huge
issue, one where ETL breaks down unless there is a strong
partnership with the RDBMS vendor. This leads us to a
best-of-breed discussion on database features. The RDBMS
vendors know best how to load massive data sets through
their bulk-
loading systems. However, when it comes to massive integration
efforts, that might be a different story.
The RDBMS Wish List (below) provides a partial listing
of new features that RDBMS engines will have to undertake
to support the new world of ETL and EAI. Currently Teradata
has many of these features already built in, while Oracle,
DB2 UDB, Sybase and SQLServer appear not to.
As these tools come closer together, so will the many
roles of the DBA. He or she will have to learn about ETL
and possibly EAI. In a perfect world, there would be one
person designing the EAI, ETL and database solution, one
person implementing and maintaining it all. We want to
cut the costs of managing and maintaining the physical
layers of each area: EAI, ETL and RDBMS.
Adapt today, survive tomorrow
This is definitely tomorrow's market
for ETL. Will it come right away? No. There's a lot
of engineering work involved in just these suggestions
alone, not to mention the other 100 that I didn't
address here. However, smart ETL vendors will begin to
adapt their tools to handle these and other critical features
in the NRT world in order to handle additional on-the-fly
information.
I think the industry will begin to see a merging of RDBMS,
EAI and ETL functionality. There will be a lot of changes,
possibly mergers or at least alliances between RDBMS vendors
and ETL vendors. EAI vendors will continue to link to
ETL and integrate the best they can. Just don't forget,
the focus of successful warehousing will be a paradigm
shift—from ETL compensating for deficiencies in the
RDBMS engines, to the RDBMS designed for data warehousing
and ETL utilizing best-of-breed functionality. T
|
RDBMS wish list
|
ETL wish list
|
|
RDBMS vendors should consider
the following when upgrading their tool sets:
*
Larger, more open application programming interfaces
(APIs) for native bulk-loading systems. They will
allow integration of the bulk-loader directly into
the ETL and EAI tools.
*
Full parallelism. Instead of relying on the developer
to "code" or architect a multi-path parallel
load system, this should be built in to the load
mechanism.
*
Full partitioning support. Again, hiding the complexity
from the programmers and from the ETL tool, this
should be dynamically set up. RDBMS engines should
help hide the partitioning scheme from ETL and EAI.
We as designers/architects shouldn't be forced
to deal with the physical partitioning layer of
the database just for load reasons.
*
Stored procedure/trigger support. These should be
callable through an ETL tool API and passed one
or more blocks of data to execute at one time, rather
than calling a stored procedure/trigger in a transactional
sense (one at a time). This will help ETL tools
with the concept of ELT as well.
*
Auto-balanced usage of the CPUs, RAM and disk. These
complex routines will need to measure, adjust and
report the metrics. Due to volume loads, we (architects/designers)
don't have time to perform these complex tasks.
However, without these tasks being performed by
the RDBMS engine underneath, our ETL processes quickly
become bound to I/O, RAM and/or CPU. Again, volume
is driving this issue. ETL tools should be able
to leverage these auto-balance routines by communicating
through the complex API.
*
Threaded index updates. Indexes are too often "leftovers,"
handled after the load process is done. This won't
work in volume situations. Indexes must be maintained
during load on their own threads, although certain
levels of latency are acceptable. They should not
impact the load processes nor cause RAM or CPU binding.
*
Bulk-mode updates and deletes. In the future RDBMS
engine, part of the API will provide "look-ahead"
calls or even artificially intelligent learning
algorithms. One way or another, it will be able
to read blocks of information into RAM for RAM-based
bulk updates or bulk deletes. Today, given volume
situations, we still face these problems on a row-by-row
basis from an ETL perspective.
*
Auto defragmentation. Given the volumes of information
being worked on/within ETL and EAI, it is extremely
important to keep the database defragmented. This
is a difficult proposition requiring a DBA's
time in most RDBMS engines.
|
There are serious complications
to utilizing an ETL tool in NRT. Below is a list
of items that need to be addressed from an ETL individual-process
standpoint:
*
Each process must be capable of passing messages
to other running processes.
*
Processes must be able to be defined as "never-ending,"
that is, until they receive a message to stop.
*
Queuing mechanisms must be threaded and parallel
so messages between processes are not lost due to
CPU time-out.
*
Look-ups against tables and source files must be
responsive to messaging systems and able to dynamically
add, update and delete rows on demand. In this case,
if multiple streams are loading a single target
table, the integration must be such that the lookups
across these multiple streams can be synchronized
without deadlock contention.
*
Aggregations must be equally shared and synchronized
across multiple running streams.
*
Input information must be able to be dynamically
reassigned to other processes based on meta-data
controlled rules designed by the business.
*
Multiple streams must be able to be dynamically
consumed by a single process based on business rules.
*
Once initialized, continuous build and append must
be available.
*
Checkpoints and failure recovery must be built in
(or allowed to be designated) at certain points
in the process. Recovery for a process must be a
matter of seconds, not minutes or hours.
*
Recovery should consist of returning to the most
recent checkpoint.
*
ETL processes should be distributable across all
registered resources so the workload is shared.
*
Information arriving at particular points in time
should have the ability to reconcile and synchronize
across process flows.
*
The ETL should be able to construct data hierarchies
and dependencies that span process flows, producing
queued environments and timetables for latency processing
of particular information.
*
It should allow the developer of the processes to
construct message flow diagrams based on conditions
and execute different messages across the processes.
*
The ETL should manage (internally) and possibly
eliminate target contention or deadlock, allowing
the developers of the process to focus on the complex
task of getting the NRT data in and integrated.
|
Dan Linstedt, of Myers-Holum, Inc. can be reached at daniel.linstedt@MyersHolum.com.
ILLUSTRATION BY JOYCE HESSELBERTH