 |
Additional thoughts: ETL and ELT, RDBMS
Part 1
by Dan Linstedt
As discussed in my article in Volume
4 #3 of Teradata Magazine, there seems to be a range of data
today that indicates performing transformation in-database is the way
to go, especially if the DBMS engine is highly capable of running massively
parallel operations. With that, DBMS vendors are getting the message;
most are beginning to build (or have already built) a majority of the
transformation capabilities necessary directly into the engine.
Coupling transformation ability with
the SQL engines and the structure
of information within the database
is proving to be a much more powerful
solution. In fact, it is such a provocative
space that most RDBMS vendors are
taking it very seriously. We are now
seeing the emergence of powerful connection
software, transformation and data
warehouse automation built and distributed
freely with the next new releases
of DBMS engines. SQL Server (Yukon),
Oracle 10g and Warehouse Builder,
Teradata Warehouse and DB2 UDB Web
sphere, to name but a few, are becoming
very active in these areas.
EAI, RDBMS and ETL: Worlds are
colliding
EAI tools are encroaching on the feature
sets of ETL (ELR), and RDBMS vendors
are building in transformation features.
In some cases, this results in an
opportunity to choose where the transformation
is implemented. ETL is feeling the
pinch, as the niche market it has
occupied is now seen as a value-add
to both EAI and RDBMS. The "squeeze
play," as it was introduced, is a
feature squeeze that has occurred
over the past two years. Industry
consultants have recently observed
the following trends:
a) The emergence of EII as a PULL
technology to pull transactions
when requested, and perform transformation
on the transactions being pulled
across sources. For example, let's
suppose you want to vote. You walk
into your local polling place and
the county wants to check (in near
real-time) your place of residence,
criminal record, voter registration
activity, current DMV record and
other items of interest. They would
then fire a query over the system
saying, "Go wherever you need to
go to get data on this individual.
Bring it back (current info only),
transform it, and integrate it.
Then present it to me on a single
screen." This would be a PULL of
your records from disparate systems
(such as the DMV, county court offices,
voter registration systems, etc.).
b) The encroachment of ETL and EAI
on the EII space-as they take
over real-time PULL technology through
existing ETL and ELT engines, such
as Web-service engines and real-time
offerings. These pieces allow ETL
and ELT engines to transform things
in a transactional request/reply
format.
c) Bigger, stronger, transformation
support within RDBMS engines and
d) Modifications to the meaning
of ETL-T, and inclusion of mapping
generators, transform logic generators,
truly real-time request engines
(driven by a Web-service approach),
to name a few.
But wait...there's more!
Of course there's more! There's always more-more data, more processing,
more complexity, more business rules, more machines and, if we're lucky,
more machine power (made available to the project). So what's considered
"big data" today? Ten, 50 or 100 terabytes?
If you're looking at 100 terabytes, then there are probably many different
problems that are being addressed. In some specific instances (non-commercial),
there are systems generating two petabytes (PB) every other week. Take
the CERN laboratory in Switzerland, for example. Every time the lab's
scientists smash an atom, 2PB of data are generated.
Then there's the national weather search, carried out by Florida's hurricane
forecasters. Each of the three storm tracks forecasted requires a significant
amount of data, as does each successive prediction about the next segment
in the track. Only 15 generations of the track are kept, while the others
are thrown away. The tracks themselves are kept in memory for the computation
of the next track and next segment.
Wait a minute. Did I just say that transformation at this size is done
in memory? You bet! It seems that the larger the data set, the
more likely that all transformation rules will be carried out in memory.
It's a race to the finish for the vendors. The RDBMS, EAI and ETL vendors
that can get there first will win the race, and those that don't will
lose big.
What it will take to get there is an integrated data set in motion-with
real-time feeds, more and more data piled onto the system-followed by
in-stream or in-memory transformation. Finally, it will take putting data
down in the right place in the RDBMS engines for further analysis.
We will have to recognize the information on the way in and figure out
where to place it within the schema as the data arrives. Sometimes that
necessitates changing the data set to enable us to see what we need to
see. The larger or faster the data set gets, the more likely it becomes
that we will have to transform it in memory. Data mining algorithms that
never sleep (24x7x365) will become the predicate to all transformation
logic. To win the race, these vendors will have to come to grips with
100TB of information, delivered at a rate of 20TB a week.
There are federal and international regulations to observe, not to mention
a company's serious interest in the details of its own information. Proving
business value on an ongoing basis will become exceedingly important within
the integrated data-store area of the future. Strategic decisions will
be made faster and will project for shorter periods of time as the machines
get quicker and more nimble.
These factors will create an environment that will make today's cutthroat
business competition look like playing in the sandbox with toy cars. (Tomorrow's
competition will seem more like playing catch...with razor blades. One
mistake and wham! You're out.) The fortunes and futures of companies
will change hands nearly as routinely as the notations on a stock ticker.
All of this raises additional questions about reliability, traceability
and the definition of the term "system of record" (SOR). Sometimes the
SOR is defined to be a single place where the original data exists and
is maintained. (But if that's the case, what does that make the backups
of the SOR?)
If we think back, it used to be the responsibility of the source system
to be the SOR, but that old way of doing things becomes a physical impossibility
when today's auditor walks through the door. Too many systems have changed,
data models have changed and integrity rules have changed, all of which
make the everlasting backups totally unusable and even unstable.
In some cases, the tapes on which these backups have been made are sitting
somewhere in a physical storage house, slowly rotting away.
This raises a question: If the source system can no longer be the SOR
for the business, then where does that responsibility fall? Fortunately
or unfortunately, because of the nature of the integrated historical data
store (data warehouse or operational data store), the auditor is beginning
to look at the data warehouse as the SOR.
Say, for instance, that we've compiled a customer master list, and the
only place it exists is within the data warehouse. Does that then make
the data warehouse an SOR? I would say so. Granted, we would much prefer
to have the data warehouse actually be a "system of fact" or "statement
of fact," rather than an SOR. However, that's not up to us anymore. T
Dan Linstedt, of Myers-Holum, Inc. can be reached at daniel.linstedt@MyersHolum.com.
© Teradata Magazine-September
2004
back to top |
 |