Sam Sterling
Essential Modeling
Options
|
Despite the sometimes religious wars of dimensional modeling vs. third normal form, the truth is that you need both
|
When recently asked where I stood in the battle between the star schema
and third normal form (3NF), I emphatically answered, "squarely in the middle!"
Some believe it's an either-or debate, but there's actually a place for both.
I've read the arguments, articles, and discussions and implemented a variety of
models, and I don't think that any narrowly focused choice fits all situations.
How to Query Outside a
Star
Suppose we have a dimensional model for Ordered Items, one for Shipped
Items, and one for Received Items. I can traverse my dimensional
hierarchies to get any level of Item, Store, and Date. Simple, right?
I can find out how many televisions I've got coming in during the next
few days. Even though this is a grouping, I've built my category and
department into my Fact table, drilled down on the hierarchy, subsetted
the rows -- and I've gotten a pretty quick response. There are no joins,
the rows are probably grouped together, and there's minimal aggregation.
(But if this is something I need to know at a summary level, I may want to
build a table for it.) Probably 80 percent or more of queries look like
this.
But what if I'm in charge of the loading dock and my strongest worker
is out for the next two days? I may need to know what shipments are coming
in that are too bulky or too heavy for the rest of the crew to handle.
What's been shipped that hasn't been received (see a NOT IN coming?) and
what's been ordered but not shipped (a UNION, perhaps) that is more than
80 pounds, six cubic feet, six feet long in any one dimension, or seven
feet long in any two dimensions? Whew!
I don't have any of that packaging information in my star. It's on my
Item table though -- and I can join to that. But I have to leave the star
to do it, and I can't use my typical OLAP tools. What's more, I've got
three different stars to consider.
A tool such as Hummingbird's BI/Query lets you look at the model (and
hence the problem) relationally. You can easily build all the packaging
constraints into the Item table, join it to Ships, join to and exclude
Received, and join to and add Orders with a ship date within the next two
days.
You've addressed your immediate problem without having to build and
populate a new schema.
|
Third normal form is a model of entities, relationships, and attributes. It
usually represents our world in a logical, mathematical sort of way, like the
"who, what, when, and where" approach to journalism. Any departure from this
"normal" model is called denormalization.
The dimensional model better represents our world in terms of less abstract
concepts, such as how much, how many, and how often: the "facts" at the
intersection of the dimensions in which we're interested. It is "denormalized"
because many facts can occur at a given intersection and can be placed in a
single "row" of a table.
Fortunately for us (for both IT and the business world), Teradata gives you
choices by letting you physically implement either logical model. I'll discuss
how this works and why the choice is important.
3NF: The Pros
The normalized view of the universe is a mathematical model based on entities
and their relationships. The attributes of each entity are described only once
-- where they are relevant to "the key, the whole key, and nothing but the key,"
in modeling jargon. Normalization allows creative questions that extend beyond
the preconceived dimensions of a star schema. For example, with a normalized
model you can look at how the weather relates to the sales of certain items or
the credit worthiness of a given customer. With a dimensional model or an online
analytical processing (OLAP) tool, you can't just add a new table or data
source. You either have to denormalize it into a fact table (imagine the extra
work that you'd have to go through for that one) or create a new dimension for
it. Either way you have to add rows, which can cause the table to expand many
times beyond its original size. Remember that you will need to adjust the
partitioning and space allocations in many RDBMSs. (See sidebar, "How to Query
Outside a Star.")
But there are other reasons to advocate 3NF. It maximizes our flexibility. In
the warehouse race, the competitive edge goes to the company that can respond
most quickly and effectively to new situations, and 3NF lets you do this by
modeling the data in the data model and leaving the business rules (such as
dimensional relationships) behind.
Any deviation from the relational logical model, which represents the
business mathematically, creates a limitation that you may have to live with for
a long time. Sometimes these trade-offs make good business sense, but often we
come to regret them in short order. They make us less efficient and responsive
because we have to modify the basic model that we've implemented, instead of
just looking at data in another (possibly new) table. Sticking to the rigor of a
3NF model lets us see things in their most discrete (atomic or detailed)
aspects. It's always easier to build something when you have the basic building
blocks instead of trying to cut away useless parts and squeeze in the new
things.
3NF: The Cons
Third normal form places heavy demands on a database. Because it's
normalized, the model requires more joins than a basic star schema or
denormalized model. Remember, joins take computing resources that you may not
have available.
Third normal form is also more difficult to comprehend. Many of the 3NF
warehouses that I've implemented have more than 50 tables, as opposed to 10 in a
typical star schema. When you set up the tables, you have more of them (50
tables generally take five times as many commands as 10 tables, unless the 10
are heavily denormalized), and that means more work, if everything else (such as
the database itself) is equal. Also, you must tell query tools what to do with
the tables regardless of the model, and the more tables there are, the more work
it is to set up the query tool.
Dimensional Models: The Pros
The human brain thinks in dimensions (normally two, three, or four of them)
because that's the way we relate to everything in the world -- as real objects
in space, spread out over time. That's also the way we view data; many business
questions lend themselves naturally to a dimensional look and feel. A sales
transaction occurs at a certain place, at a certain time, and involves both an
item and a customer. A phone call involves a caller, a called party, a start
time, an end time, and the equipment and circuitry to complete the call.
The dimensions (geography, time, thing, and person) represent the entities
(where, when, what, and who) involved. In short, dimensions are how we represent
reality in thought, word, and deed.
Because I used dimensions and entities in the same sentence, you can see that
they are not exclusive of each other. You can use them to complement each other
and make life easier to understand.
Most OLAP products and many other front-end user tools require a star or
snowflake schema to help them generate SQL. A physical implementation of a
dimensional model can make management (the M in RDBMS) a lot easier for the
database engine. Managing a complex join can create an immense problem for some
databases. A large, denormalized fact table helps a DBMS avoid taxing operations
such as joins, scans, aggregates, and sorts by creating composite data and
placing it in the order most likely to be requested. The fact table itself
contains the hierarchy.
Dimensional Models: The Cons
To create a dimensional model, you have to design the combination of the
composite data, and all the DDL, extraction programs, and load scripts have to
work around the model and feed it.
Dimensional modeling takes the workload off of the DBMS and places it
squarely on the shoulders of the database designer. Obviously, this isn't ideal;
the database can do the work faster and more accurately than the people who have
the already challenging task of implementing a sophisticated data warehouse.
There is a cost associated with doing the work yourself.
An additional cost of dimensional modeling is the data redundancy inherent in
denormalization (through outer joins). Beyond that, making large denormalized
fact tables accessible often means adding many indexes to avoid scans and
multiple levels of summary tables to avoid both scans and aggregations. The DBMS
must populate these during the update cycle, which sometimes takes longer than
we would like and always consumes resources we would rather spend on queries.
Dimensional models, by definition (and even nomenclature), are restricted to
the dimensions you build. For the majority of queries and business questions,
these dimensions are adequate. But the known questions, the precanned,
thought-out, expected questions tend to have relatively low value by themselves.
They can return useful information, but you can achieve the insight and wisdom
that transforms a company only by thinking "outside the star" (a multipointed
box, if you will).
The Best of Both Worlds
If you can implement a dimensional model using views (and indexes and summary
tables) on top of a 3NF model, you can get the speed and usability of "canned"
queries and many of the tools that insulate users from the underlying tables and
the drudgery (and errors) of SQL. You can also get the flexibility to extend the
model in almost any direction to include purchased or other new data.
With a 3NF model, the physical model is the logical model; the transformation
is very simple and straightforward. You can create views that let tools display
a dimensional model to the user, making it understandable and easy to use.
Finally, you can provide direct access to the normalized model to relational
tools such as Hummingbird Communication Ltd.'s BI/
Query (formerly Andyne's GQL), or sophisticated users and IT professionals
can write their own SQL.
There is every reason to take advantage of other modeling approaches when
they offer some value. For example, pseudo-keys (an idea borrowed from
dimensional modeling) can, when used correctly, simplify user access, improve
query performance, and reduce overall system workload demands.
Teradata lends itself well to this hybrid of a 3NF physical model and
dimensional views because it handles joins so well, and its optimizer is mature
and robust. Views in Teradata are not materialized until execution time, making
it possible for the query join plan to reflect the needs of each individual
query, minimizing the amount of data that needs to be redistributed or
duplicated. Hashing on a primary index simplifies partitioning, and space
allocation happens only at the database level, as opposed to table by partition
and index by partition. Teradata's optimizer handles index intersection, giving
most of the performance advantage of bitmap indexing without all the maintenance
overhead. This means that Teradata needs to create fewer indexes, further saving
space and maintenance time. And with Teradata's sync scan feature, you can use
the time to do scans of larger tables over several (or even hundreds or
thousands) of concurrent queries doing the same work, improving throughput by
several orders of magnitude. These are just a few of the reasons, along with
parallel performance, why Teradata minimizes the dependence on summary tables.
The real advantage of Teradata, however, comes from the virtual elimination
of redundant dimensional models and the likely reduction of excessive hardware
for multiple data marts.
Because of its simple hash data placement and setup, Teradata makes the
implementation of almost any combination of 3NF and star schema approaches
feasible. And you can switch from star schema to 3NF as your business demands
expansion and renormalization.
Small table and large table join optimization. When Teradata first
started building very large databases in 1988, some surprising challenges
appeared. Even with Teradata's shared-nothing approach, large tables took a long
time to operate on and scan. As a result, the engineers (myself included)
developed techniques to make the optimizer go after the small dimensional tables
first, joining them to create the primary index for the large fact table and
reducing response time for many queries. This technique was included in a major
release of the Teradata optimizer in 1990.
As Figure 1 shows, the big table (Sales) has a composite primary index
composed of the foreign keys from the dimension tables of Stores, Items, and
Weeks. In a star schema, all four tables are logically (and sometimes
physically) combined. In the figure, a physical 3NF model, the query in Listing
1 allows for selections from Stores, Items, and Weeks, with the intersection
giving data from the Sales table as well. Say, for example, we wanted total
sales of all televisions in a particular group of states (let's say Colorado and
Minnesota) during the two weeks leading up to the Super Bowl, and we wanted to
see that by size of television and ordered by store. Given a view to emulate a
star schema, the query doesn't take the underlying physical model into account,
and the query is pretty simple. (Most of the model is hidden in the view.)
Query:
SELECT store, SUM(sales$), SUM(salesQty), Substr(itemdesc,
1,3) Named screensize, itemdesc
FROM SalesStar
WHERE weeknbr BETWEEN 9805 AND 9806 AND state IN ('CO',
'MN') AND subdept = 'Television'
ORDER BY Sales$ Desc, store
GROUP BY screensize;
View:
CREATE VIEW SalesStar AS
SELECT (=) FROM
SALES B, STORES S, ITEMS I, WEEKS W
WHERE
B.STORE_NBR = S.STORE_NBR AND
B.ITEM_NBR = I.ITEM_NBR AND
B.WEEK = W.WEEK;
Listing 1. Query against the Sales table.

Figure 1. Star schema Sales table with composite primary
index.
The Teradata optimizer knows that the Weeks table has the fewest rows, so it
selects only those few weeks we're interested in. It then copies those two rows
to each virtual AMP (Teradata's parallel units), caching them. Because it knows
the Stores table doesn't have many rows and there is an index on state, it
selects the 30 or so rows from those states, joins them to the two rows already
in memory, and copies the result to each vAMP. It doesn't know that we're only
looking for a handful of items from the two million-row item table until it
traverses the subdepartment index and selects the televisions.
Now it's got, say, 2433032 (1,440) rows. It redistributes those rows by hash
(the same hashing as the Sales table) to the correct vAMPs, sorting them as it
does so.
Finally, the optimizer matches hash codes to the Sales tables -- performing a
merge join -- and returns the results in sorted order, aggregating them by
store. Listing 2 shows a paraphrased explanation of the join plan.
1. All-AMPs Retrieve from Weeks with residual condition
<Week selection criteria>. Duplicate Spool on all-AMPs.
2. All-AMPs join from with residual condition <Store
selection criteria>. Stores is joined to Spool with
<dummy condition>. Duplicate Spool on all-AMPs.
3. All-AMPs join from Items with residual condition <Item
selection criteria>. Items is joined to Spool with
<dummy condition>. Redistribute to all-AMPs and SORT
Spool.
4. All-AMPs join from Sales and Spool using MERGE JOIN
(row hash match scan).
Listing 2. Paraphrased plan for large table-to-small table
join.
User views. When Teradata's developers came up with techniques to
influence the join plans, they formalized and implemented them using views. Even
though the optimizer is smarter than it was eight years ago, views are still
useful for giving tools and humans a representation unconcerned with all the
underlying tables.
You don't want to force users who think in terms of fixed dimensions to
reconstruct those dimensions for every new request. User views simplify the
users' lives while forcing consistency in the business rules applied.
The new join index. The join index available in Teradata V2R2.1 can
combine detail data (Fact) tables and related tables in one index structure
without having to join the tables at execution time. You could consider the join
index a combination of Fact table and some number of dimension tables if you
carried enough meaningful data in the index. Teradata automatically propagates
changes in the base tables to the join index, so the additional support costs
are minimal. The optimizer automatically detects and uses the join index,
including subsets and supersets.
Summary tables. You can't satisfy all performance needs with the
normalized model. For some applications it makes sense to create separate
summary structures within the data warehouse environment. To make supporting the
environment easier, start by summarizing the tables on the physical machine
where the data warehouse resides. Sometimes this summarization still won't
provide the consistency of performance required for, say, a call center
application. Such extreme cases may warrant hosting the application on a
physically separate platform.
Teradata allows and encourages the use of summary tables when they make sense
-- if they provide a significant response improvement for a large number of
queries with low maintenance overhead. They just aren't required very often with
Teradata because of its optimizer and parallel processing capabilities. So don't
plan on them until you know what queries the users will run and how often
they'll run them. Conduct a cost-benefit study before you spend a lot of effort
building and maintaining summary tables.
What's Really at Stake
Teradata gives you the option to create the model most appropriate for the
task at hand, with minimal concessions to technology limits. Many companies wind
up confusing the logical and physical models because they think in terms of
specific products. Never let a technology run your business. Don't get enamored
with a glitzy product or a Wall Street darling. Don't allow a standard to run
and ruin your company. The "one size fits all" approach -- just as with clothing
-- is a marketing myth. With data warehousing, no size fits all. The decision of
whether or not to create a physical dimensional model should not be made by your
DBMS but by your business needs.
Sam Sterling has built and
consulted on dozens of warehouses, using various databases, over the last 11
years. He formerly led NCR's Competitive Task Force and is an instructor of data
warehousing workshops. He has written several articles and spoken at data
warehousing seminars including DCI and NCR Partners. Sam has recently become an
independent consultant. You can reach him at sam.sterling@broad-reach.com.