Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
WEB-ONLY CONTENT

Printable versionPrintable version Send to a colleagueSend to a colleague

Data-modeling process for Enterprise Data Warehouse

Editor's note: The concepts in this piece are outlined in a diagram of the data-modeling process. It may be helpful to print this diagram and have it handy as a reference while reviewing this piece.

This data-modeling process is designed to produce an enterprise data warehouse (EDW), which supports a large number of users across many or all functional areas of the business. An EDW is too broad in scope to be built in a single development effort. Instead, it must be constructed in a coordinated, iterative fashion. This usually involves several development teams, each responsible for a different portion or subject area of the EDW. Some teams may work in parallel if their subject areas are not dependent on one another, but others must work in a series, with one laying the foundation for the next.

A successful, long-lasting EDW must be flexible, extensible and integrated. Flexibility concerns the impact new or modified business data requirements have on the data warehouse design. The less a design must be altered when a requirement is added or changed, the more flexible it is. Extensibility deals with the way increasing the scope of the data warehouse impacts its design. The less a design must be altered when equivalent data from other business lines are added, the more extensible it is. Integration reflects how well a design avoids data redundancy, i.e. storing the same data value or the same kind of data in more than one place. The fewer redundant tables and columns a design has, the more integrated it is.

While an EDW must accept data from various source databases and provide data to various analytic applications, it must have a neutral data structure—not one that is heavily influenced by either the source databases or the analytic applications. To accomplish this objective, this modeling process needs data requirements that are based on the business processes that the EDW will support.

Walking through the data-modeling process

Step 1
The EDW data-modeling process consists of four major steps, one for each of the four models. In the first step (see A1 in the accompanying diagram), the modelers work with subject matter experts (SMEs) who have broad expertise in how the enterprise operates. Together they identify the fundamental business objects and business rules that should be represented in the framework model. The modelers also examine various reference models, looking for model objects they can adopt or adapt to simplify the task. The framework model is fairly small (i.e. it has relatively few entities), so it can be completed before the next modeling step begins.

When the framework model is finished, it is used to plan the rest of the EDW modeling and development efforts. This is a program-management activity, not a modeling activity, so it is not part of the modeling process. However, it is described here to provide context for the subsequent modeling steps.

Program management, with the help of data modelers and referring to the framework model, determines the data scope for each development team. In many cases this involves assigning responsibility for particular entities in the framework model to a team. However, the data scope for some teams may involve business data that is too detailed to appear in the framework model. For example, one team may be given responsibility for all reference data. Other teams may focus on particular detailed aspects of framework entities.

Each of the remaining steps in the modeling process produces a single data model, but not all at once. Instead, each model is built up in a series of layers, each contributed by a different development team. Each layer consists of a number of subject areas containing all the model objects (entities, attributes and relationships) that are within that team's data scope.

The first team creates the initial layer, i.e. the first set of subject areas and model objects. Each subsequent team expands the model by adding a new layer, i.e. new subject areas and model objects. Many teams have overlapping data scopes; each needs to share some model objects with other teams. Since the goal is an integrated, nonredundant EDW, no team recreates model objects that a prior team has already provided. Instead, each team reuses any existing model objects that are relevant to its own data scope by including them in the subject areas that it builds. A team may modify or augment an existing model object (e.g. clarify an existing entity definition, add a new attribute to an existing entity, etc.) but only after discussing the changes with all other concerned teams.

Note: To simplify the terminology, the rest of this modeling-process description uses "model" rather than "model layer". However, "layer" is always implied. For example, in the first sentence of the next paragraph the phrase "construct a data model of the business data requirements" actually refers to building a layer of the requirements model, not the entire model.

Step 2
In the second step of the process (A2), the modelers on each development team construct a data model of the business data requirements that were gathered by that team. These requirements come from two sources: SMEs who are familiar with particular business areas and the computer applications that support those business areas (i.e. reverse engineering). The goal at this point in the process is to verify that the team found, documented and now understands all the relevant data requirements. (It is important to remember that this model is not the initial design for the EDW.)

The modelers produce a literal, neutral data model that is recognizable and intuitive to the business community. They then review that model with SMEs to validate its accuracy. The model is " "literal" because the modelers avoid using abstract model objects that would be more difficult for SMEs to understand and validate. It is "neutral" because the modelers do not mimic data structures from existing databases, even if they derive most of the requirements by reverse engineering from database designs. In addition, the modelers document the mappings between the requirements and the equivalent model objects.

Step 3
In the third step (A3), the modelers develop a logical design that will satisfy the business data requirements shown in the requirements model. At this point they are only concerned with producing a flexible, extensible design. The modelers look for literal model objects in the requirements model that could be replaced with abstract model objects in the logical data model. Abstract objects can make a model more flexible and extensible by enabling it to accommodate new or altered requirements without having to be changed. For example, a person can be related to an application in any of several ways: primary applicant, secondary applicant, guarantor, etc. (see figure 1).

These relationships can be thought of as roles that a person can have on an application. Each is a separate, literal relationship in the requirements model, but they all could be combined into a single abstract associative entity in the logical model (see figure 2). Each instance of that abstract entity would represent a particular person on a particular application and would include a code to identify which role that person was in: "P" for primary, "S" for applicant, "G" for guarantor, etc. This abstract entity would allow new roles to be established by simply assigning additional code values rather than having to modify the model itself.

So the only differences between the requirements model and the logical model are where the modelers have replaced literal model objects with abstract model objects. The modelers document these differences as mappings between the corresponding objects in the two models.

While the modelers are developing the logical model, other members of the development team (data mappers) compare the requirements model to source databases, looking for equivalent objects. This is a data-mapping activity, not a modeling activity, so it is not part of the data-modeling process. However, it is described here because the modelers need the source mappings when they begin the next step.

The data mappers try to match the attributes in the requirements model to equivalent columns in source databases. An attribute and a column are equivalent if they have the same meaning. The data mappers also compare the model entities and the database tables to determine which of those are equivalent.

The data mappers record the source mappings for all attributes, documenting one of the following types for each:

  1. Equivalent column
  2. Substring of a compound column
  3. Concatenation of various columns
  4. Calculation from one or more columns
  5. No equivalent column

Step 4
In the final step (A4), the modelers convert the logical design into a physical design. This involves five substeps.

In the first substep (A4.1) the modelers generate a physical model from the logical model. At this point the two models are identical, except that:

  • the table and column names have been shortened to conform to the maximum length for Teradata names,
  • logical datatypes have been replaced with their physical equivalents, and
  • supertype/subtype relationships have been converted to associative relationships with 1:0-1 cardinality.

In the second substep (A4.2), the modelers profile the data in the source databases and assign the appropriate datatype and nullability to each column. They also examine the source mappings and adjust the data model where necessary.

The first type of mapping (equivalent column), which applies to most attributes, requires no adjustments. The remaining types are fairly rare. The substring, concatenation and calculation mapping types do not require any changes to the physical model, but the modelers (in conjunction with the development team leader, project sponsor and relevant business users) may decide that the source columns are additional requirements.

If so, the modelers return to the A2 step to add attributes to the requirements model that are directly equivalent to the source columns. Then the modelers proceed to the A3 step to add those new attributes to the logical model, and the data mappers augment the source mappings to account for the new attributes.

For the last type of mapping (no equivalent) the modelers decide whether to remove the column from the physical model or to leave it, knowing that it will be sourced from a future database. If they remove the column, they also mark the attribute as logical-only in the requirements and logical models.

In the third substep (A4.3) the modelers turn their attention to the applications that will use the EDW. They examine the SQL Select statements that each application is expected to use, identifying the columns that, for example:

  • will appear in the output (Select clause).
  • will be used to join tables (From clause).
  • will be used to select rows (Where clause).

Based on this analysis, the modelers establish the appropriate primary index for each table. They may also create secondary indexes and join indexes, apply compression, define views and even denormalize tables, all with the intent of improving query performance.

In the next substep (A4.4) the modelers add any tables and columns that are needed to help control the operation of the EDW. These will support activities such as data loading, auditing, archiving, etc.

In the final substep (A4.5) the modelers document the mappings between the logical model objects (entities and attributes) and the corresponding physical model objects (tables and columns), accounting for any unsourceable columns that were removed in A4.2 or any table denormalizations that were established in A4.3.

Summary
This data-modeling process builds a series of four data models, each of which serves a different purpose. The framework data model is used in planning and coordinating the efforts of the individual development projects that build various subject areas of the enterprise data warehouse. The requirements data model ensures that the development teams have complete, accurate, well understood business data requirements before they begin designing the warehouse. The logical data model provides the abstract data structures that make the warehouse flexible and extensible. The physical data model augments the warehouse design, where necessary, to meet usage, performance, and control requirements.

The framework data model is fairly small and is completed before the individual development projects begin. Each of the other three data models is built incrementally as each development team contributes its assigned subject area. The teams coordinate with each other so that they produce a single, integrated data model of each type rather than a collection of diverse, stovepiped models. In addition, the teams evolve each data model from the previous one (logical from requirements, for example) rather than starting from scratch, so all the data models are consistent with one another. T

Data models

This enterprise data warehouse (EDW) data-modeling process produces four data models:

  • Framework data model—a high-level model containing only major entities and relationships. It is meant to guide the EDW modeling and development efforts by documenting the enterprise's fundamental business objects and business rules. It omits reference data and other non-key attributes.
  • Requirements data model—a detailed model that documents the business requirements. It contains all the entities, attributes and relationships needed to describe the requirements. It is a literal model that presents the requirements in concrete terms; it does not introduce abstractions that may be needed to achieve flexibility and extensibility in the EDW, such as Party-Party Relationship, which appears in Teradata's Financial Services Logical Data Model (FS-LDM).
  • Logical data model—another detailed model of the business requirements, but one in which abstract model objects (entities, attributes or relationships) replace some of the literal objects from the requirements model in order to provide the flexibility and extensibility that is expected in the EDW.
  • Physical data model—a detailed model of the actual tables and columns that will be implemented in the EDW. It is based on the logical data model, but it may be denormalized somewhat to meet particular usage and performance requirements, and it may differ from the logical model in areas where the source database designs do not conform to the business rules and requirements.

© Teradata Magazine-June 2005


back to top




Copyright by Teradata Corporation 2001-2007.