|
Project needs and the enterprise view—using EDW design layers to bridge the divide
by Tom Russell
Enterprise data warehousing (EDW) design layers are the comprehensive set of data models and generated databases developed through progressive design steps. They address the varied needs of true enterprise intelligence.
There are four basic layers all of which contain sub-layers for physical data models (PDM), data definition language (DDL) and the generated database.
 | The Base layer can be perceived as the foundational enterprise data. It is composed of high normal form (HNF) PDMs, DDL and generated databases. HNF is used in the Base layer to ensure data independence across the enterprise. This layer models the natural business relationships rather than avoiding them to build artificial structures that meet immediate project requirements only. The Base layer is protected from direct user access. |
 | The Data Access layer, also known as the Semantic layer, is composed of the PDMs, DDL, and generated databases that are used to meet immediate project needs. This layer includes database views, materialized views and propagated tables. The Data Access layer is where denormalization and special-purpose data structures are implemented, but they are always built upon the HNF Base layer. The Data Access layer enables the data to be easily presented through the use of customized views to anyone who needs to use it. The difference between this and the Application layer is that all objects are maintained within the EDW database, as opposed to dependent data marts. |
 | The Application layer is composed of PDMs, DDL, and generated databases used to meet immediate project needs, but which are managed outside the EDW database, as dependent data marts. These may include application specific file structures.
|
 | The Staging layer is the complete set of PDMs, DDL and the generated database used in the extraction, transformation and loading of data into the EDW. |
The concepts of Application and Staging layers are relatively easy to grasp, but the concept of the Base layer and the Data Access layer is more complex and has seldom been plainly expressed. Therefore, this article will focus on these two layers, providing insight into their purpose, structure, benefits and application.
A tale of two layers
By formulating Base layer and Data Access design layers (Figure 1), we can begin to methodically determine how to sort through the often controversial issues of how to model and build EDW data structures. These layers enable us to embrace, without bias, all of the data modeling techniques that have been developed for decision support, including high normal form (HNF), denormalized methods (e.g. star schema and snowflake), and special purpose file structures.
We can use these techniques in the appropriate places to achieve the dual purposes of addressing immediate project requirements and future, as yet unknown, enterprise needs. This is what separates true enterprise data warehousing from other forms of decision support, which focus only on immediate project requirements.
To grasp the importance of this distinction, we must first define what we mean by the word enterprise. A narrow definition of this word refers to the enterprise as the firm, its organization, and its processes. A broader definition, which we advocate here, is that the enterprise includes the entire view of all the value chains for each market in which a firm competes. This outlook is not centric to the firm; rather, it allows the firm to embrace the perspective of any participant in the value chain, as need and data availability permit.
If we use this definition of enterprise, then enterprise intelligence must have the potential to support it—not through a series of disconnected snapshots, but through an integrated view of the flow of all discrete value-creating activities, even as they span organizations and companies.
It is the Base layer that enables value-chain centric enterprise intelligence. All data from across the value chain flows into this layer and, from there, out into the Data Access layer. It is largely through the Base layer that we gain the ability to perform analysis that cuts across the entire value chain and to address future unknown needs.
Such a capability can only be achieved through the recognition that the EDW is built incrementally, project by project. It is formed like an enterprise jigsaw puzzle, with each puzzle piece being a project that returns some immediate value through its implementation. The holes and knobs of the puzzle pieces are the natural relationships that exist in the business; the well-instituted capture of these relationships will enable us to produce our value-chain views of the business.
The result over time will be the emergence of the complete view of the enterprise puzzle. In it, every strategically and tactically relevant activity and event that takes place in the value chain (upstream or downstream from the firm) will be linked to show how marketplace value is created. Likewise, every product produced within the value chain and every material used can be potentially traced from conception to termination.
Every organization or individual (both internal and external) can be integrated, identifying the multiple roles they play in owning, performing, purchasing, selling, controlling, etc. Performance driver measures can be linked to outcome-oriented measures (even to models of shareholder value creation), thus creating predictive models to drive performance improvement. And on we could go.
The Base layer is the key to achieving this vision. The approach to the Base layer is deeply aligned with the fundamental objective of a relational database management system (RDBMS): to keep the data independent of the applications and analysis that use it. Data independence is achieved through HNF data modeling1, where data is stored in tables that eliminate redundancy.
HNF ensures that every column of a table is rigorously related to that table and no others. It eliminates the repetition of groups within tables, provides for only one fact per column and eliminates derivable data from Base layer base tables. This rigor ensures that information is consistently beneficial to all applications and analyses that use it (now and in the future), and it provides for reliability across applications because there is a single source for each particular data element.
Prior to the development of the RDBMS, applications were data dependent, meaning that the physical representation of the data and the methods of accessing it were built into the application code. With the advent of the RDBMS, which has come to dominate the database industry, databases employed to run businesses (often called on-line transaction processing, or OLTP, systems) followed the principle of data independence dutifully. This worked well for OLTP systems because applications generally accessed a single table or perhaps joined together a few, and the queries were relatively simple.
Then along came data warehousing, and suddenly huge demands were placed on the relational systems in the form of large table scans, large result sets, many table joins, complex derivations, and concurrent, complex transformation and load requirements. Databases that had been designed for OLTP could not handle these decision support performance demands. They could not perform the queries in a reasonable time or, in some cases, at all.
What happened as a result was rather ironic. Many of the leaders in RDBMS technology sacrificed the concept of data independence (the foundation of the relational model) and started to build data models tied closely to the type of analysis performed. The consequence was a proliferation of independent data marts linked directly to the needs of particular applications or analysis. As new needs were identified, new data marts were built—along with a proliferation of data mart servers and desktop data marts.
Instead of working together to develop one puzzle, a series of puzzle-piece clusters started materializing throughout the enterprise. Unfortunately, they could not be connected together because many of the pieces were redundant or, perhaps worse, were thought to be identical when, in fact, they were not.
The negative consequence was threefold: (1) skyrocketing costs of maintaining the spider web of analytical data marts; (2) an analytical credibility crisis when numbers did not match up across data marts, and, most importantly, (3) the inability to do cross-functional analysis, let alone cross value-chain analysis. Clearly, the cost of moving from data independence was significant.
There were, however, positive consequences. The new schemas appeared much less complex to non-technical users and provided value for certain types of analysis (e.g. dimensional modeling). Moreover, as these denormalized schemas minimized complexity they actually performed better. A third significant reality was that certain analytical application vendors tied their offerings to proprietary data structures, making it necessary for their customers to embrace these structures. EDW design layers help us determine how to take advantage of all the innovations that have occurred in enterprise data warehousing during the past couple decades.
Logical data modeling and the Base layer
The EDW LDM is largely the representation of the Base layer. In order to address cross value-chain analysis and future needs, it should be constructed as an HNF model with a high degree of abstraction.
HNF modeling provides the means to connect the knobs and holes of our enterprise puzzle pieces. None of the enterprise's value-creating activities exists as a stovepipe in the value chain. Instead, by definition, the value chain is a continuous flow of activities in the creation of marketplace value. All are connected to some or many other activities.
By identifying the natural objects and relationships that exist across the enterprise, HNF models enable this view of the entire value chain to emerge over time as new projects are embraced. There is a synergistic discovery process that takes place as analysts gain new insights into how their particular projects and subject areas are linked to others upstream and downstream. In short, this is the stuff of discovery and innovation.
HNF is the most important technique in building the cross-enterprise view, but there is another technique that contributes significantly to the Base layer's extensibility, manageability, and the ability to construct shared analytical structures that span the value chain. This technique is known as abstraction.
Abstraction is the process of forming generalized entities from specific entities to gain insights into the business. Abstraction can improve the stability of the model by more effectively managing its extension, and it can simplify the model by reducing the number of entities in it. However, the real enterprise value of abstraction is realized through an expanded view of the business. For example:
 | As a company looks across its supply chain, it likely will find that there are common inbound and outbound logistical activities that can provide scheduling and backhaul benefits if the pattern of transportation logistics were integrated. Moreover, these inbound and outbound activities link directly with production and warehousing activities. By generalizing activities, the company can apply common cost structures and institute operational measures that provide new insights into the origin of costs and their impact on overall marketplace value. |
 | Abstraction facilitates the tracing of materials and products (generalized as items) from inception to termination (an impending opportunity given the adoption of RFID technology). |
 | Organizations and individuals play many roles in a business. For instance, organizations can refer to both customers and suppliers. Abstraction enables the distinction of roles among organizations. |
 | Generalization provides for collaboration by providing a common structure whereby suppliers, manufacturers, distributors, and customers can work together in a shared forecasting and planning environment. |
 | Typically, the majority of an enterprise data warehouse's workload involves analytics. Considerable value is gained in managing the common measures that support those analytics to avoid redundancy and build a common layer of metadata for end users. |
When people build puzzles, they often start by gathering similar pieces together to aid in creating larger segments that can then be pieced together. Abstraction also enables us to form clusters of similar business objects. In fact, as abstraction is embraced, we begin to see the formation of a core business metamodel, a kind of backbone for enterprise intelligence.
Such a backbone includes generalizations like Party (organizations, individuals, and households); Items (materials, products, and equipment); Activities and Events (actual, standard, and scheduled); Geographic Areas, Locations and Addresses; Documents; Time Periods; States; and Measures.
Subject area templates within the Base layer
After abstraction, the resulting structures should be developed as "Foundation Subject Areas" (e.g. Party or Item) within the EDW LDM. These Foundation Subject Areas can then be used to build Value Chain Template subject areas.
Value Chain Templates address common business functions that can be implemented anywhere across the value chain. These include templates for manufacturing, transportation logistics, inventory management, sales, marketing, customer service, procurement, financial management, and other value-chain processes. The key criterion for these templates is that they support the general functionality of the process and are extensible.
By having predefined Value Chain Templates readily available, one can rapidly begin implementing specific projects. It is as simple as creating a copy of the appropriate Value Chain Subject Areas and then extending it to meet the needs of the project (Figure 2).
Physical data modeling within the Base layer
Once the HNF project LDM has been developed, it can be transformed into a physical design. The benefits of data independence—cost, consistency, and cross value-chain analysis—can only be achieved if the HNF structures are carried into the physical model at the Base layer. Note that as the project LDM is transformed to a project PDM, only a fraction of the EDW is implemented in a "just in time" fashion. Accordingly, any unused components of the LDM can be modified over time.
With that said, however, the key to the Base layer is extensibility, which comes in two forms:
 | First, as new projects are undertaken, many of their requirements will be supported by the existing (already physicalized) Base layer tables, with no changes or extensions required. Only new types and rows are added to existing tables. |
 | Second, there may be new objects required—new tables or new columns on existing tables. This is not a problem and, in fact, is what the Base layer is intended to be—extensible. |
Base layer failure happens when new requirements demand massive restructuring of existing Base layer base tables, as this will impact the downstream Data Access layer and all applications that are indirectly using the structures. This can be avoided only as a result of HNF modeling and appropriate abstraction in the Base layer PDM. Accordingly, only slight denormalization takes place in the transformation from LDM to Base layer PDM.
DDL and database generation are two additional sub-layers in the Base layer. The Base layer will only contain base tables, as all views and materialized views will be managed in the Data Access layer.
The Data Access layer (Semantic layer)
The Data Access layer (also called the Semantic layer) includes views, materialized views, and propagated tables, all constructed from base tables in the Base layer database. Data Access objects may reside within the EDW database or outside the EDW as dependent data marts, but the requirement for membership in the Data Access layer is that all data, except derived data, must flow out of the Base layer.
Some think of the Data Access layer as a series of database views sitting atop the HNF Base layer base tables, but physical modelers called upon to build these structures will quickly tell you that views alone are impractical for meeting all performance needs. Database views should, however, be the first consideration in addressing reporting, application, and analysis requirements as they provide the highest level of consistency.
If database views do not meet performance needs, the next consideration should be materialized views, as these are always maintained to be up to date with changes to the tables they are based upon. If either view type does not meet performance requirements, the Data Access layer may also contain denormalized tables propagated from the Base layer base tables.
These are simple rules; there are many other methods that can be introduced to optimize tradeoffs within the Data Access layer.
Universal structures within the Data Access layer
Although the Data Access layer is focused on meeting immediate project requirements, the objects created in it should be built to span multiple projects. This results in the emergence of sub-layers within the Data Access layer, as shown in Figure 3.
The foundation of the Data Access layer is the set of views that mirror the Base layer base tables. It is these views that will be used to build other Data Access structures.
Although one might think of the Data Access layer simply as a group of application-specific objects, its primary purpose is to manage derived data and commonly accessed structures. Most calculated measures (e.g., net sales, gross profit, return on assets, and asset utilization) and aggregations should be built once at the universal layer, rather than calculating the same measures again and again in separate places.
There are three main reasons for this. The obvious one is data consistency. Measures that are calculated separately—even in dependent data marts or views within the same database—will invariably yield different results as different methods of calculation, filtering, and transformations are introduced.
The second reason for managing shared measures is time to solution. The universal code for managing measures can be created once, with the application code accessing it to apply whatever enhancements are needed at that level.
The third reason, which often is cited against the alternative of moving calculations to special-purpose databases, is performance. The parallelism of the EDW will generally outperform calculations done on serialized special-purpose platforms.
The same benefits can be realized by building universal structures that address other forms of business rules. For example, hierarchies are highly reusable across projects, yet in the realm of independent data marts they are typically replicated again and again with different methods of filtering and transformation, such that, like redundant measures, they introduce inconsistency across different forms of analysis. Hierarchies can be extremely complicated, especially those that are recursive, unbalanced, multi-parent, and mixed (supporting multiple entity types in the same hierarchy). As with calculated measures, building complex hierarchies once yields time-to-solution and cost benefits.
Time-dependent structures can also be sorted out at the universal level in the Data Access layer to support many projects. Database views, materialized views, and propagated tables can be set up for snapshots of the current view, actual historical profiles, forecasted views, current and future-oriented reference data with transactional data recast, and lifecycle structures reflecting changes over time.
In addition, the universal level can support building shared structures for: conditional logic; reassembling generalizations that exist within the Base layer; and selecting, filtering, and grouping information.
Both the universal views and the views of Base layer base tables in the Data Access layer can be used to create structures that relate to particular applications and analysis. Here we can build views, materialized views, and tables that support star schema applications; flat tables for reporting; and schemas that meet the needs of packaged software, reporting tools, and analysis tools.
The benefits of Base and Data Access design layers
When you consider what has been described here, you might ask, "Why invest the time to create two layers—for the Base layer and for Data Access layer?" First, history has shown that embracing only the immediate project with no consideration for the enterprise results in the proliferation of analytical data marts patterned much like the extraction-processing spider webs that led data warehouse design expert Bill Inmon to define the need for data warehousing solutions in the first place.2 The cost to the enterprise is a reemergence of a "credibility crisis" whereby departments are delivering the same measures with different results due to variable timing, calculations, filtering, cleansing, transformation, or data sources.
The second very real advantage is time to solution and cost. As the EDW matures, many of the objects needed to build the structures that support an application, report, or analysis will already exist as universal Data Access objects. The cost advantage of having a single Base layer is evidenced by the data mart consolidation trend that has emerged in the past few years.
The third and most important reason for building two layers is the creation of a true enterprise intelligence environment—one that recognizes that the firm is more than its organization and processes. In embracing a wider view by creating a Base layer, we gain the strategic potential to create an integrated view of all the value chains for all markets in which a firm competes. Business insights that would otherwise be impossible are now visible because the natural relationships that exist in the business have been captured in the Base layer.
This is particularly true for business analysts charged with explorative analysis (data mining, expert systems, and ad hoc analysis), as opposed to monitoring analysis (reporting and dimensional modeling). Explorative analysis moves beyond predefined business analysis to search for patterns of behavior within the natural relationships.
The goal is a "best of all worlds" scenario, in which we can meet immediate project needs and provide a foundation for future requirements, all with less cost and faster time to solution. However, this requires a disciplined approach that will not compromise enterprise requirements for short-term gains. The organization must be committed to forming the Base layer early on and maintaining it in order to realize future benefits of cost and time savings, consistency, and a value-chain centric model.
T
1I use the term high normal form modeling, versus third normal form modeling, though in practice many use them synonymously. In reality high normal form includes fourth and fifth normal form adherence, and Boyce-Codd normal form as well. Most third normal form models are also in fourth and fifth normal form, but this does not negate the need to understand and adhere to the higher normal form principles.
2Inmon, W.H., Building the Data Warehouse, 1996, John Wiley and Sons, Inc.
Tom Russell is a Principal Consultant and Practice Partner with Teradata. He has designed and developed enterprise data warehousing and decision support systems for large corporations for more than 20 years.
© Teradata Magazine-March 2006
back to top
|