Additional thoughts: ETL and ELT, RDBMS
Part 2
by Dan Linstedt
What about accountability?
The accountability of our data warehouse systems is changing. Accountability
may mean different things to different people (as indicated by a recent
straw poll I've conducted). For too long we've been blindly
altering, merging, mixing and matching data going into the data warehouse-in
essence, massaging it on the way in. We have not been considering the
ramifications of our choices.
Lest we forget, the business owns the data, not IT. IT has been tasked
with altering and massaging the data to "get it into the data warehouse."
The problem with this (in my opinion) is that it comes at a heavy price.
When someone modifies the data, they modify the meaning of the data. Thus,
they take full responsibility/ownership for the results of those
data modifications.
You may be wondering, "Did he just say that IT started taking ownership
of the data when it's transformed mid-stream?" Yes, I did
say that. At least this is the way the auditors look at it when they arrive
on the scene.
Enter the notions of auditability and accountability of the system. The
business needs to be able to trust that the data warehouse has the data
as it stood on the source systems, as of a particular date. For
auditing reasons, IT should not be taking ownership of the data or what
was done to the data; that's the responsibility of the business.
We should, however, be able to step in and provide the data in a quality-driven,
consistent format to the end-users coming from the data warehouse.
That way, when the auditor asks, "What did the data look like last
week (or last month or last year)?" we can answer the question.
When the business changes its definition of "today's truth,"
we can rebuild the data marts and transform the data as desired without
losing any of the history or facts of that history.
Tracking of the data set, therefore, comes to the forefront when answering
compliance questions. How can we do this with current ETL tools and batch
cycles? It becomes difficult to untangle just when an element of data
was transformed and how, unless we are tracking both the before and after
or recording the transformation at the time it occurs. Transformation
in the database, however, proves to be more robust and capable of actually
recording the actions of the transformation (as a function of database
audit trails).
Dynamic restructuring defined
I'm suggesting that a starting structure (starting point) will be
provided to the processing engine. But from there, the structure will
be flexible and change dynamically according to the data provider. Imagine
this scenario:
You sign up for a Web services daily "customer prospect feed"
from a vendor. When the vendor signs the agreement, it agrees to provide
address, city, state and ZIP code information. Soon your vendor provides
phone number information and an integrated customer number. The next week
they include it on your feed without telling you. Do you:
a) Stop the whole process when you see the new elements and alert someone?
b) Define a place to put the new element (field) and dynamically add
it to the feed?
c) Use a data mining engine to determine the impact to the structure
and where it should fit to carry it through?
d) Discard the new data element until re-programming can take place?
This may be called "dynamic data warehousing" in the future,
but whatever it ends up being named, it's coming. The ability to
weigh the impact, determine the placement and dynamically adapt structure
will be "where it's at" in the future. Those vendors
that can handle real-time feeds and integrated meta data today may be
ahead of the game, especially if they decide to add artificial intelligence
(AI) to their processing engines (not just for data mining, but also for
structural changes, particularly so we can segment the type of data element
that arrives on the feed).
The high cost of RDBMS
There are certain vendors in this space that are extremely strong. They
provide excellent transformation capabilities from right within the SQL
layers themselves. If you have a data quality or data mining engine plugged
in, the SQL can transparently activate or utilize those functions (for
example, using Teradata, First Logic and TeraMiner).
The RDBMS components of the world are getting stronger and vendors will
be offering these pieces as highly scalable, highly parallel activities,
straight from SQL. The new tools of tomorrow (ELT, if you will) will offer
all their visual delights, but they will also be able to generate specific
RDBMS SQL code. Vendors are poised to make strong entries in this field
within the coming year. (One of them, Teradata, makes the above-mentioned
abilities available today.)
In-database transformation offers the following benefits:
- It can utilize information already in the RDBMS stores to make decisions.
- It has the full scalability and parallelism of the RDBMS engine underneath.
- It has the capacity to compare data in RAM within the RDBMS engine
utilization. Transformations can be single transactions or multiple
batches of data (transformation in the RDBMS doesn't care).
It is also apparent that the current pricing model must change. There
are a whole host of new entries from RDBMS vendors that will become freely
available (or already are). These new engines (if they are to remain best-of-breed)
must come down in price. Yes, they must offer more features and more power
at lower prices.
The entire model must be competitively priced based on a modular pricing
format. There will be customers who need AI, and those who don't.
Likewise, there will be customers that desire five-minute latency with
massive sets of data, and others who need a one-hour refresh cycle with
not much data. Still others will require six-hour cycles and will have
loads of batch-driven data. The pricing models must evolve as well.
The future of ETL
ELT and best-of-breed extraction systems (without modification) will only
be around for a while longer in their current form. If they are to survive,
they need to change the way they do business. There will always be a need
to transform batch-level data, particularly because many businesses can't
justify converting 100% of their business to the relatively high costs
of truly active data warehousing. But what will change is the way we do
ETL, and the notion of ETLT itself will come into being (see "Alphabet
soup" in my article in Vol.
4 #3 of Teradata Magazine).
This raises a number of questions: Why has ETL shifted toward ELR? Why
is in-database transformation becoming best of breed? Why has EAI stayed
where it is in this game, despite the fact that its ability to handle
business rules and transaction distribution has expanded? And why do we
need map-generation and in-database transformations?
Here are some of the business drivers influencing the future of ETL:
- Accountability of the data warehouse
- The ability of businesses to change their definition of the "truth"
without affecting underlying history (ties to accountability)
- Compliance with national and international accords
- Auditability of the data as a statement of fact
- The merging notions of what's operational and what's a
data warehouse
- The need for business to incorporate strategic results into answering
tactical questions about trends and forecasts (i.e. data mining)
- Finding source system and source process errors that have existed
for years, but which can't be spotted if the data warehouse loading
systems are "cleaning up the data" before loading it
- Business decision time is continually shrinking. It's nearly
impossible anymore to be on a pro-active decision curve. Most of the
time we analyze information, and end up making a reactive decision.
There's simply too much data for us to digest it all.
It's no longer enough to provide transformation in-stream. It is
now necessary to also offer the ability to push all or part of the transformations
into the RDBMS engines which they feed.
Where does real-time fit in?
Real-time is the function of the data being triggered from source systems,
and as soon as the change is ready, it is passed along the chain to the
target (hopefully with one single version of the facts).
Real-time (or more accurately, near real-time) is the ability to respond
to transactions as they happen. ETL functions best in a batch mode. ELT
provides some real-time opportunities that cannot be found in the traditional
ETL space.
The new tool of the future will contain a Web-based business workflow
diagramming solution (much like EAI) that end-users can design on three
levels: business rules, encapsulation routines and low-level code. However,
the end-result will be high-speed, dynamic data and structure triggering
and manipulation.
ELR shifts the very foundation of "batch" functionality into
a more congruent state of mind. In very simple terms, we can take what
was a traditional batch tool, set a job to run in real-time and wait for
requests on a Web-services queue.
Conclusion
Between a data volume of 1TB and about 80TB, ETL and in-database transformation
seem to be a wonderful fit. Beyond 80TB, it no longer matters which tool
does the transformation; it simply has to happen in RAM and in parallel.
In some cases, it must happen as data arrives in the integrated data store
(data warehouse); in other instances, the accountability will be drawn
into question, and the transformation must occur on the way out of the
data warehouse.
In an after-effect, scoring the data and constantly cleaning out the
haystacks of "bad" or indifferent information will be an ongoing
task. Deleting the uninteresting data after it's scored and deemed
irrelevant will be important to managing size. However, don't forget
the Federal and international requirements to keep all data auditable.
This will quickly become an impossible task, until a petabyte is available
within the nanotech world.
Finally, for a majority of the market, in-database transformation is
a trend that shows absolutely no sign of slowing down. And let's
not forget the notions that brought us to this point of massive data sets:
accountability and wavering definitions of "system of record."
T
Dan Linstedt, of Myers-Holum, Inc. can be reached at daniel.linstedt@MyersHolum.com.
© Teradata Magazine-September 2004
back to top |