Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
Printable versionPrintable version     Send to a colleagueSend to a colleague

Drawing a line

Lineage collects details along data's perfect path

The quest to find "Holy Grail" can be considered a journey that leads to perfection. To data warehousing professionals, having a record of where data originates, what applications employ it and what processes shape and clean it could be considered the "Holy Grail" of data governance. But does this help the data warehouse users?

Lineage, the collection of details about data's journey from source to target, is the audit trail of data warehouse processes. These processes acquire the data into transient staging areas called data dumps, cleanse and transform the data before and/or after it lands in the dumps, load the data into the data warehouse in a normalized format, push the data to logical and/or physical data marts, and tweak the data inside business intelligence (BI) tools.

Data governance of the data warehouse process
Figure 1 Click to enlarge
Figure 1 shows a simplified representation of the lineage for an enterprise data warehouse. Data dumps and marts might reside on the same platform as the data warehouse, but not necessarily.

With today's tight budgets, IT is unlikely to secure new funding for better data governance. The benefit that appears intuitive to data warehouse technicians becomes difficult to quantify in a cost justification analysis. If an investment fails to help the bottom line, then you might as well not ask. All is not lost, however. The political key to success in this area is to understand, promote and cost-justify data lineage as a savings on the business side, as data guidance. The technical key is to have a modern, object-oriented, extensible metadata engine.

Marketing the warehouse
The data warehouse, unlike operational systems, requires voluntary usage with minimal training on the business side in order to be successful. One way to market the data warehouse is to use metadata to provide data guidance. This allows the business users of the data warehouse to solve their ad hoc questions without technical intervention or heroics. To see data guidance's value, consider the plight of the user. Data warehouse users know that specific data exists somewhere in the organization, but it might have a different name or meaning depending on where it resides. An optimally utilized data warehouse provides the business side with answers to three basic guidance questions: What is the meaning of the data (glossary)? Where does this data come from (first sources)? Where can I see this data (final targets)?

Ideally, the subterranean activity depicted in figure 1—the acquiring, cleansing, loading and utilization of the data warehouse—should never be exposed to the business user. Lineage is a required but hidden component of the metadata. With a glossary of terms and aliases — related to summaries of the lineage — as the central entry point, a repository can enable business guidance for the data warehouse by allowing the user to get answers to those three basic data guidance questions.

Teradata markets a state-of-the-art metadata repository engine called Teradata Meta Data Services (MDS). It was founded on the standards for repositories that emanate from the ANSI committee of the late 1980s. Modern repositories are object-oriented and extensible. Object-oriented repositories require less writing and less code maintenance than with a SQL model; extensibility means that, whenever you add new object models, classes, relationships or properties to the repository, the underlying system automatically recognizes and processes these modifications.

Teradata Meta Data Services (MDS) seeds and feeds
Figure 2 Click to enlarge
These two key value-added features lessen administrative support to the repository. Figure 2 illustrates some likely metadata seeds that exist to support data warehouse guidance.

Meeting the requirements
There are several requirements for lineage support:
>A glossary of terms along with other useful classes that may be populated from a data model development tool such as ERwin.
>A set of classes that fully describe the journey from source to target.
>Maintenance routines, typically one for each tool type.
>Summarization routines that hide the voluminous lineage details from the business user while providing a synopsis of the journey in a user-friendly way.
>A Web tool for displaying specific details or summaries to interested users.
>Processes and procedures that execute lineage acquisition and summarization.

The basic Teradata objects are organized into the DatabaseModel (formerly known as the DIM), which is included with every Teradata MDS installation. However, Teradata MDS can easily support any model variety that has classes of objects. Since lineage components and glossary requirements vary from shop to shop, this extensible feature of Teradata MDS is critical. Figure 3 shows sample extension models that hold the glossary and lineage classes. For organizational simplicity, we prefer to separate the glossary's model (CentralModel) and the LineageModel from the DatabaseModel.

Describing the lineage
The value of Teradata MDS's object orientation and extensibility now comes into focus. A simple XML file is used to add a couple of models—CentralModel and LineageModel.

Models of Teradata MDS
Figure 3 Click to enlarge
As shown in figure 3, a model, CentralModel, with a class, Glossary, has been added. Each Teradata MDS customer can have his or her own variation of this CentralModel. Glossary has all enterprise terms and their multiple definitions. For searches, one would select a term and then be able to "fan" alternatives, related terms, aliases, sub-terms and any usage of the term held in the customer's model.

A superclass, CentralClass, has the ability to hold many classes of interest to the business, and any such class will inherit characteristics that have been placed on the CentralClass. Entity and Element could be two such classes loaded from a modeling tool such as ERwin. Any term in the Glossary can be related to any object in the CentralClass. The loading and management of the Glossary and the related classes can be accomplished via many tools and techniques, and the details of these processes are left for another day.

The LineageModel documents the data's journey from source to target. All data receptacles are defined to DataDepot. This can include databases, reports, fields, screens, business intelligence cubes and any other resting or transfer point of data as it journeys from legacy systems to business tools. Transform is the class that holds rules that, when applied to sources in DataDepot, produce targets in DataDepot. Some examples of transform objects are steps in Teradata's load utilities FastLoad,MultiLoad, TPump or FastExport; specifications of modifications made to an extract, transform and load (ETL) tool; and CreateView statements. Mapping is a class that holds what are often referred to as job streams or scripts. The relationships between these Lineage classes should have names that reflect their purpose.

Populating the lineage
To generate a data lineage, each mapping job's details of how it relates sources to targets should be stored. This will provide a complete picture of how data flows across the organization from legacy through staging, to tables, then to views and out to BI tools. One can then see backward to the feeder systems and forward to the usage systems. This implies the need for as many loading techniques as warehouse-related data transform tools.
Here are a few examples.

> ETL—One useful technique is to use a view facility over, for example, the Informatica metadata, saving the answer set to Excel via Teradata Queryman. An Excel–MDS bridge can then load the Informatica mappings and transforms into the Lineage structure.
> xLoad—This Transform type utilizes a parser. Several companies have parsers that can populate Teradata MDS. Teradata has a model and facility CLM (ClientLoadModel) that reports on Teradata Utilities such as FastLoad and MultiLoad. This tool fixes sources and targets within Teradata objects but is not easily deployable for other databases' objects, ETL techniques and BI tools.
> BTEQ—Most craftsmen have more than one tool in their toolbox.While architecturally attractive, ETL tools often fall short with large or complex processes. Moreover, most ETL tools are seen as "pre-Teradata" in that they are utilized to load staging tables but further cleansing and transformation is most efficiently handled within Teradata. Batch Teradata Query (BTEQ) is uniquely suited to execute such a strategy.When one ponders the complexity of some BTEQ transforms (subqueries nested to the nth level), the ability to determine lineage or impact analysis manually becomes a superhuman effort. Parsers strip this complexity away to populate Lineage.
> View Data Definition Language (DDL)—Often metadata is considered solely to handle ETL, where ETL is seen as terminating once the data is on Teradata tables. The utilization of data warehouse data begins with the DDL of Teradata View creation. Base views, which reflect the conceptual schema, and derived views, which reflect the external schemas, need to have their DDL parsed and populated within the LineageModel. View creation is often a recursive process in that views are defined against views to an indeterminate level. But consider this: Any parser that handles "Insert … Select …" within BTEQs Data Manipulation Language (DML) should handle the DDL syntax of "CreateView." The DML creates a physical object while the DDL creates a logical object, but the parsing is abstractly the same.
> BI Tool Lineage—Every BI tool differs on how to extract and load lineage. Object-oriented tools, like BusinessObjects, have an attractive access to pull and push the metadata. Tools that have a proprietary metastore, like MicroStrategy, require that an Application Programming Interface program be written to populate the Lineage.


Using lineage
Users want to know what data means, where it comes from and where it "lives," but they don't need or want to know about the technical effort required to convert sources into reporting objects. The details would confuse, not help, users. Even technicians sometimes find it challenging to comprehend the complexity of the data warehouse processes that are outside their expertise.

While everything that has been discussed to this point is exposable via outof-the-box tools, navigation through this complexity is overpowering to technical users and impossible for most data warehouse users. Details within a data warehouse require summarized data (marts); so too does the lineage require simplification. One way to effectively accomplish this is to extend the models further and to populate them via recursive routines that perform the summarization.

So we add a class, SummaryTransform, holding Impact Analysis reports (in both directions) for technical usage, and we add two classes that perform a similar but even simpler summarization for the business user, named SummarySource and SummaryTarget. Both of these classes are related to a superclass of Entity and Element referred to here as DataItem. This simplifies the picture, and more significantly, it simplifies navigation. While placement is arbitrary, figure 4 shows the technical summary class (SummaryTransform) placed into the LineageModel and two user summary classes (SummaryTarget and SummarySource) placed into the CentralModel.

Additional lineage summary
Figure 4 Click to enlarge
The population of these summary classes can be tricky since lineage operations have an arbitrary number of steps and paths. If a vendor provides for the loading of lineage from multiple vendors and tools, then make sure they have compatible summarization tools. Most often, the summarization is based on SubjectArea.

As a modern repository, Teradata MDS is extensible, so the act of adding new models, classes and properties for objects is trivial; the tools automatically recognize and utilize the extensions. The challenge comes not in the specification or utilization of the metadata but in its acquisition and management. The extensions and management technique described here are operational at accounts deploying Teradata MDS, which can be easily tailored to provide the lineage support for your data warehouse.

The establishment and support of lineage takes data warehousing professionals further in their journey to find the "Holy Grail" because it provides a cost justification at the business level based on business user guidance. It answers users' critical questions without the need to contact or involve technical support personnel. And the fact that it provides data governance (cost reduction) to the technical staff is a massive by-product.

© Teradata Magazine-June 2004



Copyright by Teradata Corporation 2001-2007.