Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
WEB-ONLY CONTENT

Printable versionPrintable version Send to a colleagueSend to a colleague

Additional thoughts: ETL and ELT, RDBMS

Part 1

As discussed in my article in Volume 4 #3 of Teradata Magazine, there seems to be a range of data today that indicates performing transformation in-database is the way to go, especially if the DBMS engine is highly capable of running massively parallel operations. With that, DBMS vendors are getting the message; most are beginning to build (or have already built) a majority of the transformation capabilities necessary directly into the engine.

Coupling transformation ability with the SQL engines and the structure of information within the database is proving to be a much more powerful solution. In fact, it is such a provocative space that most RDBMS vendors are taking it very seriously. We are now seeing the emergence of powerful connection software, transformation and data warehouse automation built and distributed freely with the next new releases of DBMS engines. SQL Server (Yukon), Oracle 10g and Warehouse Builder, Teradata Warehouse and DB2 UDB Web sphere, to name but a few, are becoming very active in these areas.

EAI, RDBMS and ETL: Worlds are colliding
EAI tools are encroaching on the feature sets of ETL (ELR), and RDBMS vendors are building in transformation features. In some cases, this results in an opportunity to choose where the transformation is implemented. ETL is feeling the pinch, as the niche market it has occupied is now seen as a value-add to both EAI and RDBMS. The "squeeze play," as it was introduced, is a feature squeeze that has occurred over the past two years. Industry consultants have recently observed the following trends:

a) The emergence of EII as a PULL technology to pull transactions when requested, and perform transformation on the transactions being pulled across sources. For example, let's suppose you want to vote. You walk into your local polling place and the county wants to check (in near real-time) your place of residence, criminal record, voter registration activity, current DMV record and other items of interest. They would then fire a query over the system saying, "Go wherever you need to go to get data on this individual. Bring it back (current info only), transform it, and integrate it. Then present it to me on a single screen." This would be a PULL of your records from disparate systems (such as the DMV, county court offices, voter registration systems, etc.).
b) The encroachment of ETL and EAI on the EII space-as they take over real-time PULL technology through existing ETL and ELT engines, such as Web-service engines and real-time offerings. These pieces allow ETL and ELT engines to transform things in a transactional request/reply format.
c) Bigger, stronger, transformation support within RDBMS engines and
d) Modifications to the meaning of ETL-T, and inclusion of mapping generators, transform logic generators, truly real-time request engines (driven by a Web-service approach), to name a few.

But wait...there's more!
Of course there's more! There's always more-more data, more processing, more complexity, more business rules, more machines and, if we're lucky, more machine power (made available to the project). So what's considered "big data" today? Ten, 50 or 100 terabytes?

If you're looking at 100 terabytes, then there are probably many different problems that are being addressed. In some specific instances (non-commercial), there are systems generating two petabytes (PB) every other week. Take the CERN laboratory in Switzerland, for example. Every time the lab's scientists smash an atom, 2PB of data are generated.

Then there's the national weather search, carried out by Florida's hurricane forecasters. Each of the three storm tracks forecasted requires a significant amount of data, as does each successive prediction about the next segment in the track. Only 15 generations of the track are kept, while the others are thrown away. The tracks themselves are kept in memory for the computation of the next track and next segment.

Wait a minute. Did I just say that transformation at this size is done in memory? You bet! It seems that the larger the data set, the more likely that all transformation rules will be carried out in memory. It's a race to the finish for the vendors. The RDBMS, EAI and ETL vendors that can get there first will win the race, and those that don't will lose big.

What it will take to get there is an integrated data set in motion-with real-time feeds, more and more data piled onto the system-followed by in-stream or in-memory transformation. Finally, it will take putting data down in the right place in the RDBMS engines for further analysis.

We will have to recognize the information on the way in and figure out where to place it within the schema as the data arrives. Sometimes that necessitates changing the data set to enable us to see what we need to see. The larger or faster the data set gets, the more likely it becomes that we will have to transform it in memory. Data mining algorithms that never sleep (24x7x365) will become the predicate to all transformation logic. To win the race, these vendors will have to come to grips with 100TB of information, delivered at a rate of 20TB a week.

There are federal and international regulations to observe, not to mention a company's serious interest in the details of its own information. Proving business value on an ongoing basis will become exceedingly important within the integrated data-store area of the future. Strategic decisions will be made faster and will project for shorter periods of time as the machines get quicker and more nimble.

These factors will create an environment that will make today's cutthroat business competition look like playing in the sandbox with toy cars. (Tomorrow's competition will seem more like playing catch...with razor blades. One mistake and wham! You're out.) The fortunes and futures of companies will change hands nearly as routinely as the notations on a stock ticker.

All of this raises additional questions about reliability, traceability and the definition of the term "system of record" (SOR). Sometimes the SOR is defined to be a single place where the original data exists and is maintained. (But if that's the case, what does that make the backups of the SOR?)

If we think back, it used to be the responsibility of the source system to be the SOR, but that old way of doing things becomes a physical impossibility when today's auditor walks through the door. Too many systems have changed, data models have changed and integrity rules have changed, all of which make the everlasting backups totally unusable and even unstable. In some cases, the tapes on which these backups have been made are sitting somewhere in a physical storage house, slowly rotting away.

This raises a question: If the source system can no longer be the SOR for the business, then where does that responsibility fall? Fortunately or unfortunately, because of the nature of the integrated historical data store (data warehouse or operational data store), the auditor is beginning to look at the data warehouse as the SOR.

Say, for instance, that we've compiled a customer master list, and the only place it exists is within the data warehouse. Does that then make the data warehouse an SOR? I would say so. Granted, we would much prefer to have the data warehouse actually be a "system of fact" or "statement of fact," rather than an SOR. However, that's not up to us anymore. T



RELATED LINK:

Dan Linstedt's online column on the ETL/ELT debate Part II

© Teradata Magazine-September 2004


back to top



Copyright by Teradata Corporation 2001-2007.