Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
WEB-ONLY CONTENT

Printable versionPrintable version Send to a colleagueSend to a colleague

Additional thoughts: ETL and ELT, RDBMS

Part 2

What about accountability?
The accountability of our data warehouse systems is changing. Accountability may mean different things to different people (as indicated by a recent straw poll I've conducted). For too long we've been blindly altering, merging, mixing and matching data going into the data warehouse-in essence, massaging it on the way in. We have not been considering the ramifications of our choices.

Lest we forget, the business owns the data, not IT. IT has been tasked with altering and massaging the data to "get it into the data warehouse." The problem with this (in my opinion) is that it comes at a heavy price. When someone modifies the data, they modify the meaning of the data. Thus, they take full responsibility/ownership for the results of those data modifications.

You may be wondering, "Did he just say that IT started taking ownership of the data when it's transformed mid-stream?" Yes, I did say that. At least this is the way the auditors look at it when they arrive on the scene.

Enter the notions of auditability and accountability of the system. The business needs to be able to trust that the data warehouse has the data as it stood on the source systems, as of a particular date. For auditing reasons, IT should not be taking ownership of the data or what was done to the data; that's the responsibility of the business. We should, however, be able to step in and provide the data in a quality-driven, consistent format to the end-users coming from the data warehouse. That way, when the auditor asks, "What did the data look like last week (or last month or last year)?" we can answer the question. When the business changes its definition of "today's truth," we can rebuild the data marts and transform the data as desired without losing any of the history or facts of that history.

Tracking of the data set, therefore, comes to the forefront when answering compliance questions. How can we do this with current ETL tools and batch cycles? It becomes difficult to untangle just when an element of data was transformed and how, unless we are tracking both the before and after or recording the transformation at the time it occurs. Transformation in the database, however, proves to be more robust and capable of actually recording the actions of the transformation (as a function of database audit trails).

Dynamic restructuring defined
I'm suggesting that a starting structure (starting point) will be provided to the processing engine. But from there, the structure will be flexible and change dynamically according to the data provider. Imagine this scenario:

You sign up for a Web services daily "customer prospect feed" from a vendor. When the vendor signs the agreement, it agrees to provide address, city, state and ZIP code information. Soon your vendor provides phone number information and an integrated customer number. The next week they include it on your feed without telling you. Do you:

a) Stop the whole process when you see the new elements and alert someone?
b) Define a place to put the new element (field) and dynamically add it to the feed?
c) Use a data mining engine to determine the impact to the structure and where it should fit to carry it through?
d) Discard the new data element until re-programming can take place?

This may be called "dynamic data warehousing" in the future, but whatever it ends up being named, it's coming. The ability to weigh the impact, determine the placement and dynamically adapt structure will be "where it's at" in the future. Those vendors that can handle real-time feeds and integrated meta data today may be ahead of the game, especially if they decide to add artificial intelligence (AI) to their processing engines (not just for data mining, but also for structural changes, particularly so we can segment the type of data element that arrives on the feed).

The high cost of RDBMS
There are certain vendors in this space that are extremely strong. They provide excellent transformation capabilities from right within the SQL layers themselves. If you have a data quality or data mining engine plugged in, the SQL can transparently activate or utilize those functions (for example, using Teradata, First Logic and TeraMiner).

The RDBMS components of the world are getting stronger and vendors will be offering these pieces as highly scalable, highly parallel activities, straight from SQL. The new tools of tomorrow (ELT, if you will) will offer all their visual delights, but they will also be able to generate specific RDBMS SQL code. Vendors are poised to make strong entries in this field within the coming year. (One of them, Teradata, makes the above-mentioned abilities available today.)

In-database transformation offers the following benefits:

  1. It can utilize information already in the RDBMS stores to make decisions.
  2. It has the full scalability and parallelism of the RDBMS engine underneath.
  3. It has the capacity to compare data in RAM within the RDBMS engine utilization. Transformations can be single transactions or multiple batches of data (transformation in the RDBMS doesn't care).

It is also apparent that the current pricing model must change. There are a whole host of new entries from RDBMS vendors that will become freely available (or already are). These new engines (if they are to remain best-of-breed) must come down in price. Yes, they must offer more features and more power at lower prices.

The entire model must be competitively priced based on a modular pricing format. There will be customers who need AI, and those who don't. Likewise, there will be customers that desire five-minute latency with massive sets of data, and others who need a one-hour refresh cycle with not much data. Still others will require six-hour cycles and will have loads of batch-driven data. The pricing models must evolve as well.

The future of ETL
ELT and best-of-breed extraction systems (without modification) will only be around for a while longer in their current form. If they are to survive, they need to change the way they do business. There will always be a need to transform batch-level data, particularly because many businesses can't justify converting 100% of their business to the relatively high costs of truly active data warehousing. But what will change is the way we do ETL, and the notion of ETLT itself will come into being (see "Alphabet soup" in my article in Vol. 4 #3 of Teradata Magazine).

This raises a number of questions: Why has ETL shifted toward ELR? Why is in-database transformation becoming best of breed? Why has EAI stayed where it is in this game, despite the fact that its ability to handle business rules and transaction distribution has expanded? And why do we need map-generation and in-database transformations?

Here are some of the business drivers influencing the future of ETL:

  • Accountability of the data warehouse
  • The ability of businesses to change their definition of the "truth" without affecting underlying history (ties to accountability)
  • Compliance with national and international accords
  • Auditability of the data as a statement of fact
  • The merging notions of what's operational and what's a data warehouse
  • The need for business to incorporate strategic results into answering tactical questions about trends and forecasts (i.e. data mining)
  • Finding source system and source process errors that have existed for years, but which can't be spotted if the data warehouse loading systems are "cleaning up the data" before loading it
  • Business decision time is continually shrinking. It's nearly impossible anymore to be on a pro-active decision curve. Most of the time we analyze information, and end up making a reactive decision. There's simply too much data for us to digest it all.

It's no longer enough to provide transformation in-stream. It is now necessary to also offer the ability to push all or part of the transformations into the RDBMS engines which they feed.

Where does real-time fit in?
Real-time is the function of the data being triggered from source systems, and as soon as the change is ready, it is passed along the chain to the target (hopefully with one single version of the facts).

Real-time (or more accurately, near real-time) is the ability to respond to transactions as they happen. ETL functions best in a batch mode. ELT provides some real-time opportunities that cannot be found in the traditional ETL space.

The new tool of the future will contain a Web-based business workflow diagramming solution (much like EAI) that end-users can design on three levels: business rules, encapsulation routines and low-level code. However, the end-result will be high-speed, dynamic data and structure triggering and manipulation.

ELR shifts the very foundation of "batch" functionality into a more congruent state of mind. In very simple terms, we can take what was a traditional batch tool, set a job to run in real-time and wait for requests on a Web-services queue.

Conclusion
Between a data volume of 1TB and about 80TB, ETL and in-database transformation seem to be a wonderful fit. Beyond 80TB, it no longer matters which tool does the transformation; it simply has to happen in RAM and in parallel. In some cases, it must happen as data arrives in the integrated data store (data warehouse); in other instances, the accountability will be drawn into question, and the transformation must occur on the way out of the data warehouse.

In an after-effect, scoring the data and constantly cleaning out the haystacks of "bad" or indifferent information will be an ongoing task. Deleting the uninteresting data after it's scored and deemed irrelevant will be important to managing size. However, don't forget the Federal and international requirements to keep all data auditable. This will quickly become an impossible task, until a petabyte is available within the nanotech world.

Finally, for a majority of the market, in-database transformation is a trend that shows absolutely no sign of slowing down. And let's not forget the notions that brought us to this point of massive data sets: accountability and wavering definitions of "system of record." T

RELATED LINK:

Dan Linstedt's online column on the ETL/ELT debate Part I

© Teradata Magazine-September 2004


back to top



Copyright by Teradata Corporation 2001-2007.