Tomorrow's real-time ETL imperative
Vendors-and businesses-must adopt real-time strategies
in order to survive.
by Dan Linstedt
ETL (extract-transform-load) has become a commodity in the marketplace.
Nearly every data warehousing vendor offers it, and most DBAs understand
what it takes to prepare data for loading into a data warehouse.
ELT (extract-load-transform) is not as well-known and requires a
bit more skill to architect and utilize. However, it too is becoming
mainstream. In fact, there are some tools that offer GUI-driven
ELT capabilities with metadata management, making this method very
accessible.
But change is in the air. ETL must shift to survive, and while ELT
is part of the solution, it's not enough. I believe that advances
in real-time will change ETL and ELT into something I call ELR (extract-load
in real-time). ETL and ELT vendors that don't adapt might not make
the cut.
What's happening?
Dealing with data in transformational layers in real-time instead
of in batches changes the architectural data flow. It requires that
we change the way we think about data. For example, transformation
elements such as data mining and data cleansing will be introduced
downstream instead of upstream.
Performance throughput will also change. ELR, with its highly focused
transactional sessions and heavy transformation logic, will slow
down throughput significantly, but because these tools will be dealing
with much smaller volumes of data with greater frequency, latency
may become a non-issue.
Industry consultants are seeing the following trends developing
in this space:
> Data volumes have more than doubled, and business needs are driving
that growth. With the advent of RFID technology, volume will only
continue to increase, presenting new challenges to all vendors.
> Data quality has turned into the idea of "knowledge management
through data mining." In-database data mining is the wave of the
future; it's becoming nearly impossible to export large data sets,
reformat them and push them through algorithms outside the database.
> Thanks to active data warehousing, data is bypassing ETL completely
and being deposited by enterprise application integration (EAI)
and other mechanisms directly into the enterprise data warehouse,
requiring that transformation be embedded in the DBMS systems. (This
may not be true for data warehouses that still operate in batch
mode.)
> Batch modes of the future will be small and fast, except in the
cases where data can sit until a batch window can be opened and
the data moved en masse.
> Government and international regulations suggest that auditors
can request to view data as it appeared in every step of the ETL
process. This places a tremendous load on the data warehouse, and
it puts the onus on the business itself to maintain scrupulous data
records.
As a result, many businesses are changing the way they view, record,
collect, manage, interpret and transform data. Therefore, data warehouses
need to become better at tracking "before" and "after" pictures
of these data modifications. Transformations that take place within
the database are a perfect way to track the process-to produce an
audit trail, if you will.
Data warehouses are fast becoming "active" data warehouses, and
they are beginning to look, act and respond as independent operational
systems.
Additionally, businesses are required to make faster, better decisions
than ever before. As a result, the ETL industry is being pushed
towards moving less data more frequently and, of course, at a faster
rate. These changes also give rise to on-demand requests for information
that previously was unavailable.
What does that mean to the transformation process? We no longer
have the time to transform the data in-stream. Extract and load
routines have to bring the data in as is, and then, depending on
the business's needs, do the transformation either on the way into
the data mart or warehouse (for OLAP and strategic analysis) or
on a transaction-by-transaction basis inside of a data-mining algorithm.
So what's the bottom line?
Volume, latency and functionality have hit a convergence point.
Businesses no longer have the time to perform massive sets of transformation
on massive data volumes before making tactical decisions. Traditional
ETL methods create bottlenecks because in some cases they offer
a single-point solution for either batch or near-real time, but
they don't offer a complete view of the data.
In order to survive going forward, ETL engines will need to shift
their focus to either ETLT (extract-transform-load-transform) or
ELRD (extract-load in real-time with dynamic restructuring capabilities).
ETL tools have evolved to handle metadata, scheduling and volume
without batting an eye.
While transformation mid-stream will never go away entirely, there
is a growing need to tie transformation to in-database data mining,
in-database data quality, in-database ranking and scoring, in-database
integration, in-database parallelism, and in-database scalability,
fault tolerance and fail-over. Moreover, ETL vendors must offer
in-database management of the structures, combined with business
rules and transformation logic. Impact analysis should be a single
stream of findings, and changes to table structure should be automatically
adapted by transformation logic, data mining, profiling, quality
and cleansing routines.
Please don't discount the ETL engines just yet-they are mature and
exist for very good reasons. ETL vendors still have a few tricks
up their sleeves, as do all companies that have good direction.
But they will have to change or they will not survive. ETL and ELT
will have a place in the new market, but only if they incorporate
real-time capabilities, not to mention the ability to handle metadata
and business rules and to adapt to and integrate best-of-breed solutions
in each of these areas.
Let's look at the issue another way: ETL wins when a business has
huge data volumes to transform quickly and consistently from a metadata
standpoint. It wins when the database machine does not have enough
power or is widely distributed, geographically speaking. It also
works if multiple targets exist on disparate platforms, but only
if there is a high-speed gateway between the RDBMS engines.
On the other hand, ELT wins when the business has a high-powered,
highly scalable database engine like Teradata that is capable of
running inserts, updates and deletes in parallel-not to mention
data mining, profiling, cleansing and segmentation-all from a SQL
command line.
ELT might struggle if the transformation logic must be built and
then distributed across physically separated data warehouses. In
this case, in-stream transformation would work better.
However, if the data warehouse is based on a single integrated version
of the data, then having the transformation occur within the database
might prove more effective because it can take advantage of all
the parallelism and power within an RDBMS engine such as Teradata.
ETLT gives the database architect and designers a choice between
ELT and ETL, allowing them to use the best solution for the current
need. It is my opinion that the ETLT engine of the future will be
ELR and will include visual leverage points from both ETL and ELT.
In other words, we will have the tools necessary to build source-
and target-related SQL and transformations, and we will be able
to include them in the database as throwaway or persistent SQL elements.
Some transformations will include data mining, cleansing and/or
massively parallel transformation ability. But all will include
real-time capacities, even attaching triggers to existing database
systems to capture data automatically.
Looking ahead
ETL won't go away, but it will morph into something else in the
future. Today's vendors must focus on "connecting" to everything
and providing more near real-time integration features and database-specific
generation specialties to survive. They must leverage their utilization
and expertise in metadata and begin applying business rules and
not just technical rules to the information systems.
I believe that the glory days of being "just ETL" or "just ELT"
are over. The days of ETLT and ELRD are just beginning, particularly
in terms of visual business rule interfaces.
The new ETLT tool of the future will contain a Web-based business
workflow diagramming solution, much like EAI, that end-users can
design on three levels: business rules, encapsulation routines and
low-level code. (See the "Future shock" sidebar for more features.)
The end result will be high-speed, dynamic data, along with structure
triggering and manipulation.
Current ETL vendors that I think have an opportunity to change the
future of this space are (in alphabetical order): Ab Initio, Ascential,
Hummingbird and Informatica. And the RDBMS vendor that is best suited
to handle this change is Teradata.
Without a doubt, they face a tremendous challenge. But the future
is in their hands, and that future looks bright indeed.
T