Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  

Sam Sterling

Essential Modeling Options

Despite the sometimes religious wars of dimensional modeling vs. third normal form, the truth is that you need both


When recently asked where I stood in the battle between the star schema and third normal form (3NF), I emphatically answered, "squarely in the middle!" Some believe it's an either-or debate, but there's actually a place for both. I've read the arguments, articles, and discussions and implemented a variety of models, and I don't think that any narrowly focused choice fits all situations.


How to Query Outside a Star

Suppose we have a dimensional model for Ordered Items, one for Shipped Items, and one for Received Items. I can traverse my dimensional hierarchies to get any level of Item, Store, and Date. Simple, right?

I can find out how many televisions I've got coming in during the next few days. Even though this is a grouping, I've built my category and department into my Fact table, drilled down on the hierarchy, subsetted the rows -- and I've gotten a pretty quick response. There are no joins, the rows are probably grouped together, and there's minimal aggregation. (But if this is something I need to know at a summary level, I may want to build a table for it.) Probably 80 percent or more of queries look like this.

But what if I'm in charge of the loading dock and my strongest worker is out for the next two days? I may need to know what shipments are coming in that are too bulky or too heavy for the rest of the crew to handle. What's been shipped that hasn't been received (see a NOT IN coming?) and what's been ordered but not shipped (a UNION, perhaps) that is more than 80 pounds, six cubic feet, six feet long in any one dimension, or seven feet long in any two dimensions? Whew!

I don't have any of that packaging information in my star. It's on my Item table though -- and I can join to that. But I have to leave the star to do it, and I can't use my typical OLAP tools. What's more, I've got three different stars to consider.

A tool such as Hummingbird's BI/Query lets you look at the model (and hence the problem) relationally. You can easily build all the packaging constraints into the Item table, join it to Ships, join to and exclude Received, and join to and add Orders with a ship date within the next two days.

You've addressed your immediate problem without having to build and populate a new schema.
 

Third normal form is a model of entities, relationships, and attributes. It usually represents our world in a logical, mathematical sort of way, like the "who, what, when, and where" approach to journalism. Any departure from this "normal" model is called denormalization.

The dimensional model better represents our world in terms of less abstract concepts, such as how much, how many, and how often: the "facts" at the intersection of the dimensions in which we're interested. It is "denormalized" because many facts can occur at a given intersection and can be placed in a single "row" of a table.

Fortunately for us (for both IT and the business world), Teradata gives you choices by letting you physically implement either logical model. I'll discuss how this works and why the choice is important.

3NF: The Pros

The normalized view of the universe is a mathematical model based on entities and their relationships. The attributes of each entity are described only once -- where they are relevant to "the key, the whole key, and nothing but the key," in modeling jargon. Normalization allows creative questions that extend beyond the preconceived dimensions of a star schema. For example, with a normalized model you can look at how the weather relates to the sales of certain items or the credit worthiness of a given customer. With a dimensional model or an online analytical processing (OLAP) tool, you can't just add a new table or data source. You either have to denormalize it into a fact table (imagine the extra work that you'd have to go through for that one) or create a new dimension for it. Either way you have to add rows, which can cause the table to expand many times beyond its original size. Remember that you will need to adjust the partitioning and space allocations in many RDBMSs. (See sidebar, "How to Query Outside a Star.")

But there are other reasons to advocate 3NF. It maximizes our flexibility. In the warehouse race, the competitive edge goes to the company that can respond most quickly and effectively to new situations, and 3NF lets you do this by modeling the data in the data model and leaving the business rules (such as dimensional relationships) behind.

Any deviation from the relational logical model, which represents the business mathematically, creates a limitation that you may have to live with for a long time. Sometimes these trade-offs make good business sense, but often we come to regret them in short order. They make us less efficient and responsive because we have to modify the basic model that we've implemented, instead of just looking at data in another (possibly new) table. Sticking to the rigor of a 3NF model lets us see things in their most discrete (atomic or detailed) aspects. It's always easier to build something when you have the basic building blocks instead of trying to cut away useless parts and squeeze in the new things.

3NF: The Cons

Third normal form places heavy demands on a database. Because it's normalized, the model requires more joins than a basic star schema or denormalized model. Remember, joins take computing resources that you may not have available.

Third normal form is also more difficult to comprehend. Many of the 3NF warehouses that I've implemented have more than 50 tables, as opposed to 10 in a typical star schema. When you set up the tables, you have more of them (50 tables generally take five times as many commands as 10 tables, unless the 10 are heavily denormalized), and that means more work, if everything else (such as the database itself) is equal. Also, you must tell query tools what to do with the tables regardless of the model, and the more tables there are, the more work it is to set up the query tool.

Dimensional Models: The Pros

The human brain thinks in dimensions (normally two, three, or four of them) because that's the way we relate to everything in the world -- as real objects in space, spread out over time. That's also the way we view data; many business questions lend themselves naturally to a dimensional look and feel. A sales transaction occurs at a certain place, at a certain time, and involves both an item and a customer. A phone call involves a caller, a called party, a start time, an end time, and the equipment and circuitry to complete the call.

The dimensions (geography, time, thing, and person) represent the entities (where, when, what, and who) involved. In short, dimensions are how we represent reality in thought, word, and deed.

Because I used dimensions and entities in the same sentence, you can see that they are not exclusive of each other. You can use them to complement each other and make life easier to understand.

Most OLAP products and many other front-end user tools require a star or snowflake schema to help them generate SQL. A physical implementation of a dimensional model can make management (the M in RDBMS) a lot easier for the database engine. Managing a complex join can create an immense problem for some databases. A large, denormalized fact table helps a DBMS avoid taxing operations such as joins, scans, aggregates, and sorts by creating composite data and placing it in the order most likely to be requested. The fact table itself contains the hierarchy.

Dimensional Models: The Cons

To create a dimensional model, you have to design the combination of the composite data, and all the DDL, extraction programs, and load scripts have to work around the model and feed it.

Dimensional modeling takes the workload off of the DBMS and places it squarely on the shoulders of the database designer. Obviously, this isn't ideal; the database can do the work faster and more accurately than the people who have the already challenging task of implementing a sophisticated data warehouse. There is a cost associated with doing the work yourself.

An additional cost of dimensional modeling is the data redundancy inherent in denormalization (through outer joins). Beyond that, making large denormalized fact tables accessible often means adding many indexes to avoid scans and multiple levels of summary tables to avoid both scans and aggregations. The DBMS must populate these during the update cycle, which sometimes takes longer than we would like and always consumes resources we would rather spend on queries.

Dimensional models, by definition (and even nomenclature), are restricted to the dimensions you build. For the majority of queries and business questions, these dimensions are adequate. But the known questions, the precanned, thought-out, expected questions tend to have relatively low value by themselves. They can return useful information, but you can achieve the insight and wisdom that transforms a company only by thinking "outside the star" (a multipointed box, if you will).

The Best of Both Worlds

If you can implement a dimensional model using views (and indexes and summary tables) on top of a 3NF model, you can get the speed and usability of "canned" queries and many of the tools that insulate users from the underlying tables and the drudgery (and errors) of SQL. You can also get the flexibility to extend the model in almost any direction to include purchased or other new data.

With a 3NF model, the physical model is the logical model; the transformation is very simple and straightforward. You can create views that let tools display a dimensional model to the user, making it understandable and easy to use. Finally, you can provide direct access to the normalized model to relational tools such as Hummingbird Communication Ltd.'s BI/

Query (formerly Andyne's GQL), or sophisticated users and IT professionals can write their own SQL.

There is every reason to take advantage of other modeling approaches when they offer some value. For example, pseudo-keys (an idea borrowed from dimensional modeling) can, when used correctly, simplify user access, improve query performance, and reduce overall system workload demands.

Teradata lends itself well to this hybrid of a 3NF physical model and dimensional views because it handles joins so well, and its optimizer is mature and robust. Views in Teradata are not materialized until execution time, making it possible for the query join plan to reflect the needs of each individual query, minimizing the amount of data that needs to be redistributed or duplicated. Hashing on a primary index simplifies partitioning, and space allocation happens only at the database level, as opposed to table by partition and index by partition. Teradata's optimizer handles index intersection, giving most of the performance advantage of bitmap indexing without all the maintenance overhead. This means that Teradata needs to create fewer indexes, further saving space and maintenance time. And with Teradata's sync scan feature, you can use the time to do scans of larger tables over several (or even hundreds or thousands) of concurrent queries doing the same work, improving throughput by several orders of magnitude. These are just a few of the reasons, along with parallel performance, why Teradata minimizes the dependence on summary tables.

The real advantage of Teradata, however, comes from the virtual elimination of redundant dimensional models and the likely reduction of excessive hardware for multiple data marts.

Because of its simple hash data placement and setup, Teradata makes the implementation of almost any combination of 3NF and star schema approaches feasible. And you can switch from star schema to 3NF as your business demands expansion and renormalization.

Small table and large table join optimization. When Teradata first started building very large databases in 1988, some surprising challenges appeared. Even with Teradata's shared-nothing approach, large tables took a long time to operate on and scan. As a result, the engineers (myself included) developed techniques to make the optimizer go after the small dimensional tables first, joining them to create the primary index for the large fact table and reducing response time for many queries. This technique was included in a major release of the Teradata optimizer in 1990.

As Figure 1 shows, the big table (Sales) has a composite primary index composed of the foreign keys from the dimension tables of Stores, Items, and Weeks. In a star schema, all four tables are logically (and sometimes physically) combined. In the figure, a physical 3NF model, the query in Listing 1 allows for selections from Stores, Items, and Weeks, with the intersection giving data from the Sales table as well. Say, for example, we wanted total sales of all televisions in a particular group of states (let's say Colorado and Minnesota) during the two weeks leading up to the Super Bowl, and we wanted to see that by size of television and ordered by store. Given a view to emulate a star schema, the query doesn't take the underlying physical model into account, and the query is pretty simple. (Most of the model is hidden in the view.)

Query:
SELECT store, SUM(sales$), SUM(salesQty), Substr(itemdesc,
  1,3) Named screensize, itemdesc 
FROM SalesStar
WHERE weeknbr BETWEEN 9805 AND 9806 AND state IN ('CO',
  'MN') AND subdept = 'Television'
ORDER BY Sales$ Desc, store
GROUP BY screensize;

View:
CREATE VIEW SalesStar AS
SELECT (=) FROM
SALES B, STORES S, ITEMS I, WEEKS W
WHERE
   B.STORE_NBR = S.STORE_NBR AND
   B.ITEM_NBR = I.ITEM_NBR AND
   B.WEEK = W.WEEK;

Listing 1. Query against the Sales table.


 

Figure 1. Star schema Sales table with composite primary index.


 

The Teradata optimizer knows that the Weeks table has the fewest rows, so it selects only those few weeks we're interested in. It then copies those two rows to each virtual AMP (Teradata's parallel units), caching them. Because it knows the Stores table doesn't have many rows and there is an index on state, it selects the 30 or so rows from those states, joins them to the two rows already in memory, and copies the result to each vAMP. It doesn't know that we're only looking for a handful of items from the two million-row item table until it traverses the subdepartment index and selects the televisions.

Now it's got, say, 2433032 (1,440) rows. It redistributes those rows by hash (the same hashing as the Sales table) to the correct vAMPs, sorting them as it does so.

Finally, the optimizer matches hash codes to the Sales tables -- performing a merge join -- and returns the results in sorted order, aggregating them by store. Listing 2 shows a paraphrased explanation of the join plan.

1. All-AMPs Retrieve from Weeks with residual condition 
   <Week selection criteria>. Duplicate Spool on all-AMPs.
2. All-AMPs join from with residual condition <Store 
   selection criteria>. Stores is joined to Spool with 
   <dummy condition>. Duplicate Spool on all-AMPs.
3. All-AMPs join from Items with residual condition <Item 
   selection criteria>. Items is joined to Spool with 
   <dummy condition>. Redistribute to all-AMPs and SORT 
   Spool.
4. All-AMPs join from Sales and Spool using MERGE JOIN 
   (row hash match scan).

Listing 2. Paraphrased plan for large table-to-small table join.


 

User views. When Teradata's developers came up with techniques to influence the join plans, they formalized and implemented them using views. Even though the optimizer is smarter than it was eight years ago, views are still useful for giving tools and humans a representation unconcerned with all the underlying tables.

You don't want to force users who think in terms of fixed dimensions to reconstruct those dimensions for every new request. User views simplify the users' lives while forcing consistency in the business rules applied.

The new join index. The join index available in Teradata V2R2.1 can combine detail data (Fact) tables and related tables in one index structure without having to join the tables at execution time. You could consider the join index a combination of Fact table and some number of dimension tables if you carried enough meaningful data in the index. Teradata automatically propagates changes in the base tables to the join index, so the additional support costs are minimal. The optimizer automatically detects and uses the join index, including subsets and supersets.

Summary tables. You can't satisfy all performance needs with the normalized model. For some applications it makes sense to create separate summary structures within the data warehouse environment. To make supporting the environment easier, start by summarizing the tables on the physical machine where the data warehouse resides. Sometimes this summarization still won't provide the consistency of performance required for, say, a call center application. Such extreme cases may warrant hosting the application on a physically separate platform.

Teradata allows and encourages the use of summary tables when they make sense -- if they provide a significant response improvement for a large number of queries with low maintenance overhead. They just aren't required very often with Teradata because of its optimizer and parallel processing capabilities. So don't plan on them until you know what queries the users will run and how often they'll run them. Conduct a cost-benefit study before you spend a lot of effort building and maintaining summary tables.

What's Really at Stake

Teradata gives you the option to create the model most appropriate for the task at hand, with minimal concessions to technology limits. Many companies wind up confusing the logical and physical models because they think in terms of specific products. Never let a technology run your business. Don't get enamored with a glitzy product or a Wall Street darling. Don't allow a standard to run and ruin your company. The "one size fits all" approach -- just as with clothing -- is a marketing myth. With data warehousing, no size fits all. The decision of whether or not to create a physical dimensional model should not be made by your DBMS but by your business needs.
 
Sam Sterling has built and consulted on dozens of warehouses, using various databases, over the last 11 years. He formerly led NCR's Competitive Task Force and is an instructor of data warehousing workshops. He has written several articles and spoken at data warehousing seminars including DCI and NCR Partners. Sam has recently become an independent consultant. You can reach him at sam.sterling@broad-reach.com.




Copyright by Teradata Corporation 2001-2007.