Intelligent by design
Building a physical data model that works.
by Steve Hoberman
The Data Warehousing Institute defines business intelligence (BI) as "the processes, technologies and tools needed to turn data into information, information into knowledge and knowledge into plans that drive profitable business action. BI encompasses data warehousing, business analytic tools and content/knowledge management."
Data is the critical starting point of the transformation chain in this BI definition. To turn data into information and eventually into profitable business action, we need to consider the environment or context. Because BI is broader than data warehousing, all of our staging, data warehousing, operational data store, integration hub and data mart data models fit within the realm of BI.
A model is used to represent something in the real world. One of the beauties of a model is that you can take the same content and display it at different levels of granularity depending on the audience and purpose. A blueprint for a home, for example, will be the communication medium at one level of detail for the homeowners and at another level of detail for the electrician.
Taking this analogy to the data model, the logical data model is the medium for understanding the overall data requirement while the physical data model is the medium for understanding the data requirements within a given context or environment. There are a number of different techniques that can be deployed to create a physical data model, including denormalization, surrogate keys, indexing, partitioning, views and dimensionality.
In order to describe each technique, let's use the following "big picture" example as a starting point. Figure 1 shows a fully normalized logical data model for a large restaurant chain's line of desserts. All of the desserts (banana split, brownie sundae, etc.) are described and priced in the Dessert entity. A dessert can belong to one or more categories, and a category can contain many different desserts. Likewise, a dessert can be offered in many restaurants, while a single restaurant can offer many different desserts. In addition, daily orders are recorded for each dessert by day and by restaurant. A region can contain many restaurants (e.g. Northeast contains 50 restaurants), but a restaurant can reside in only a single region.
With the logical data model providing the big picture, let's look at how each of the following six physical data modeling techniques can help users understand the data in context.
Technique 1: Denormalization
Denormalization is the process of selectively violating normalization rules and reintroducing redundancy into the model. This extra redundancy can reduce data retrieval time, which is the primary reason for denormalizing. We also can denormalize to create a more user-friendly model. For example, if all of the time data elements existed in a single table instead of 10 normalized tables, users might be able to retrieve the data faster and understand it better.
Figure 2 shows two examples of denormalization. Dessert Category and Dessert Category Assignment are no longer separate tables, as in figure 1; instead, they exist as three sets of data elements in Dessert. We call this type of denormalization an array. An array is a fairly severe form of denormalization, as it is a repeating group and therefore violates first normal form (1NF). The two structures labeled Dessert and Restaurant displayed in Exhibits A and B, respectively, are representative of the denormalized tables noted in figure 2.
Note that these denormalized structures can make it faster to retrieve certain kinds of information; for example, the business user could view all of the desserts priced over $4 along with each of their categories. Denormalization also can make other types of reporting more difficult, such as viewing all frozen desserts. Looking for frozen desserts in the denormalized structure Dessert requires searching through each of the three category data elements.
Region, from figure 1, has been denormalized into Restaurant in figure 2. This is the most popular denormalization technique. The parent entity goes away and the parent data elements and relationships are now a part of the child. There are several other ways to denormalize, but they all share the characteristics shown in these two examples. Introducing redundancy allows for faster retrieval and more user-friendly structures; however, we need to be very selective where we introduce denormalization, as this redundancy comes with a price. For instance, it can:
- Cause update, delete and insert performance to suffer.
When we repeat a data element in two or more tables, we can usually retrieve the values within this data element much faster. However, if we have to edit the value in this data element, then we also need to edit it in every other table where it resides.
- Take up more space.
Repeating the region name for each restaurant requires more space. In a table with a small number of records, this extra storage space will not be substantial. However, in tables with millions of rows, such propagation could require megabytes of additional space.
- Introduce data quality problems.
By including the same data element multiple times, we substantially increase the risk of data quality issues when value changes occur. Also, denormalization reduces the number of relationships and, therefore, reduces referential integrity.
- Stunt growth of the application.
When we denormalize, it can become harder to enhance structures to meet new requirements. This is because we need to understand all of the hidden rules on the physical data model that were shown on the logical data model—before we add data elements or relationships.
Technique 2: Surrogate keys
A surrogate key is a unique identifier for a table. It is an integer that has no meaning based on its value. That is, you can't look at a Month Identifier of '1' and assume that it is January. Instead, a surrogate key is usually a system-generated counter. In almost all cases, surrogate keys are not visible to the business; instead, they operate "behind the scenes" to help maintain uniqueness and facilitate integration in the data warehouse.
Surrogate keys provide for a number of benefits, the two most important being integration and efficiency. BI solutions will more than likely source from more than one application, a process that requires a fair amount of integration. Surrogate keys can be defined as unique across source systems, so that a Robert Jones from system XYZ and a Renee Jane from system ABC might both be identified in their respective systems as RJ. However, once they are loaded into the data warehouse, they are each assigned a unique, non-overlapping value. Also, surrogate keys are integers that take up less space and provide faster joins than other formats or composite keys.
In figure 2, Restaurant Identifier is an example of a surrogate key.
Technique 3: Indexing
An index points to the object you are seeking. An often-used analogy is a library's card catalog. The card catalog points you to where the book you want is on the shelf, which is much faster than looking at every shelf in the library until you find the right book. In the same way, a data index points to the actual place on the disk where the data is stored, making retrieval much faster.
In BI applications where there is a lot of retrieval in the form of reporting and queries, indexing is essential. Indexes work best on data elements that are regularly retrieved but rarely updated. Indexes also are valuable for tables like Customer and Order that contain many rows. Indexing is very specific to the underlying database; you might find that what works well on one database platform falls short on another.
Technique 4: Partitioning
Partitioning divides tables by row, by column or both. If a table is broken up by column, it is vertically partitioned. If a table is broken up by row, it is horizontally partitioned. Partitioning, as with indexes, is very specific to the database; be sure to consult your database documentation to learn the guidelines on when to partition.
In the example in figure 2, if Daily Order gets too many rows for it to provide the desired query performance, we might want to use horizontal partitioning to break out the rows by year. This would be very efficient if most of the queries did not cross years, but would be very inefficient for questions such as "On which days over the last five years did we sell more than 10 banana splits in a given restaurant?"
An example of vertical partitioning would be storing Dessert Price Amount in its own table because it might change much more frequently than the rest of the Dessert data elements.
Technique 5: Views
In the 2003 Intelligent Enterprise article "Teradata ... a Data Warehouse with a View," Deborah Smith wrote, "Views are the transparent layers that sit on top of the underlying data model or tables. They provide a way of looking at or regarding the data. This transparent presentation layer is what the end user sees and uses. Teradata views don't require space ... and they can transition the underlying data schema into nearly any presentation with such efficiency that the cost of using views is mitigated to simply the management of them. The benefit provided by the view far exceeds the costs."
The key point is that, at times, views can offer all of the same benefits as denormalization without the drawbacks associated with data redundancy. For example, a view can provide user-friendly structures without sacrificing referential integrity. A view will keep the underlying logical data model intact while presenting a denormalized or summarized view of the world to the business.
There are different types of views, some of which might not match the performance of an actual denormalized table, but this depends on database, number of rows, etc. However, there also are views, such as materialized views or snapshots, that can match and sometimes beat retrieval speed from actual tables because they are generated at some predetermined time and stored in a system cache.
Figure 3 shows two views. The view built on Restaurant and Region, Restaurant_User_View, allows for the region and restaurant information to be presented in the same structure, saving the user and the report tool the time and space required to locate and perform the join across both tables. In Category_Sales_ By_Day_View, we have built a summary on top of the underlying tables, without increasing space requirements.
Technique 6: Dimensionality
BI models are either relational or dimensional. Mastering Data Warehouse Design (Imhoff, Galemmo and Geiger; 2003) has a great definition of a relational model: "The relational model is a form of data model in which data is packaged according to business rules and data relationships, regardless of how the data will be used in processes, in as non-redundant a fashion as possible. Normalization rules are used to create this form of model."
From the same book: "A dimensional model is a form of data modeling that packages data according to specific business queries and processes. The goals are business user understandability and multidimensional query performance."
Note there is nothing in this dimensional definition about business rules and normalization. How the data is going to perform and be used are contextual concerns; therefore, dimensionality is a physical data modeling technique.
A dimensional design is a seemingly simple structure where all of the elements viewed on reports or in queries appear in a central table called a meter (data warehousing guru Ralph Kimball refers to it as a fact table). In this type of model, all of the ways of reporting on the data appear in dimensional structures by subject area. Each subject area is called a dimension. There are two major rules in a dimensional structure that make its appearance completely different from the relational:
- You can never have relationships between subject areas. That is, all relationships between dimensions must appear only through the meter.
- Within a dimension, tables are defined based on levels and not based on normalization. For example, in the Time dimension there could be separate tables for Year and Month, even though there are normalization violations within each of these tables.
Going back to the dessert example, let's say we need to answer the business question, "How many desserts do we sell by restaurant, month and category?" The initial dimensional model might look like what appears in figure 4. This is just the first step in a physical design. To gain greater understanding and clearer answers, you need to change this model further into one of three designs: star schema, snowflake or starflake.
Star schema
A star schema folds each set of tablesthat make up a dimension into a single table. Notice the word "fold" instead of "denormalize." This is because a dimensional model is never normalized to begin with, but instead is separated into different tables by level.
The meter is in the center of the model, and each of the dimensions relate to the meter at the lowest level of detail. In figure 5, the meter Dessert_Measurement relates to the Month dimension at the Month level. If there are 15 desserts ordered in January 2003, the user or reporting tool can then query as to how many desserts were ordered in all of 2003.
A star schema is relatively easy to create and implement, and visually it appears elegant and simplistic to both IT and business users. For these reasons it is often used even when it is not the optimal design choice. Star schemas are most useful when the dimensions have relatively few rows, the meter contains only metric facts, and there is a specific short-term purpose for the model.
Snowflake
A snowflake exists when the initial dimensional model is implemented without folding up any tables. In most cases, the snowflake design looks exactly like the example in figure 4. Snowflakes are often used when the rate at which data values change varies across hierarchy levels.
Starflake
A starflake, as the name suggests, is a combination of a star schema and a snowflake. Certain dimensions have all of their levels folded into a single table, while other dimensions keep their hierarchy levels in separate tables. Applying guidelines at a subject-area level instead of at a model level can easily lead to a starflake, as some dimensions will experience varying rates of change while others remain relatively static.
Working your plan
Remember that a physical data model represents the data requirements from the logical data model and is specific to a given environment—such as a database or an application. We need to be very careful how we deploy the various techniques discussed, as undoing the work of normalization can exact a huge price. By carefully understanding the factors at work in our environment and defining the guidelines, we can make intelligent data model design decisions. T
Steve Hoberman of Steve Hoberman & Associates, Inc., is an author and a data modeling consultant. Steve specializes in data modeling training, design strategy and development. Visit his Web site at stevehoberman.com.