Register | Log in


Subscribe Now>>
ARCHIVE: Vol. 6, No. 4
Home News Tech2Tech Features Viewpoints Facts & Fun Teradata.com
Features
Send to Colleague

Normalizing the database

A simplified explanation of normal forms.

by Dale R. Rollins

Normal forms are rules that are applied during the development of relational databases. The process of applying them is called normalization or normalizing the database. The intent of normalization is to minimize data redundancy, thereby improving data consistency and reducing update problems.

Data redundancy is present when the same fact (i.e., data value) is recorded more than once in a database. This increases the cost of storing and updating the database. Each redundant copy of a fact takes extra space on the media (disk/tape/etc.) where the database is stored. Depending on the amount of data redundancy, the database may need additional media units. The cost of updating the database is higher because when facts are added, changed, or removed, all redundant copies must be updated to keep them synchronized. That takes extra processing time. In addition, accessing the database for a query or report may take longer because there are more facts (data) to look through to find those that are needed.

Data consistency is improved by eliminating data redundancy. When queries and reports are produced from the same set of nonredundant facts they present a consistent, single view of the business. Business people can rely on what they see in the queries and reports. They don’t have to waste time tracking down the real facts like they do when queries and reports are produced from redundant, inconsistent data.

Update problems arise when the same fact must be recorded more than once (due to data redundancy) or when existing facts must be reentered even though they have not changed. This increases the amount and complexity of the update software, which raises the development cost and lengthens the development time. This added complexity increases the risk of software errors, which, if not found and corrected, will cause incomplete updating of the database, leading to data inconsistencies in queries and reports.

An ideal is the elimination of all data redundancy, i.e., record every fact only once in the database. This achieves the goal of a single view of the business. However, other factors must also be considered when designing databases—factors that encourage some amount of data redundancy (e.g., query response time). These factors are beyond the scope of this paper, but must be considered when deciding which normal forms to apply when designing a database.

This paper describes five normal forms, which are simply identified with ordinal numbers: First Normal Form, Second Normal Form, etc. Each of these builds on the prior. A table must be in First Normal Form before it can be in Second; it must be in Second before it can be in Third; and so on. Third Normal Form is generally considered the minimum acceptable level of normalization for a database.

A brief review of relational terminology
A relational database is a set of two-dimensional tables. Each table consists of rows and columns (see the example below). Each fact is recorded in a particular row and column. A row contains different facts about the same real-world thing. A column contains the same kind of fact about different real-world things.

Each table has at least one unique key—a set of columns (one or more) whose values uniquely identify each row. One of the unique keys is chosen to be the table’s primary key, and any others are designated as alternate keys. A nonkey column is one that is not part of the table’s primary key, but may be part of an alternate key. In the example below, the primary key columns are to the left of the double vertical line, and the nonkey columns are to the right.

A foreign key is a set of columns in one table that match the primary key columns in another table. It represents a relationship between the two tables. The table containing the foreign key is called the child table; the table containing the matching primary key is called the parent. The foreign key value in each child row is the same as the primary key value in the related parent row. If a foreign key in a child table is also part of the child’s primary key, then the parent for that foreign key is an identifying parent; the identity of the parent row must be known in order to uniquely identify the child row. Otherwise, the parent is nonidentifying. In the example below, Dept Nbr in the Employee table is a foreign key to the Department table where Dept Nbr is the primary key.

Department
Dept Nbr
Dept Name
10
Accounting
12
Sales

Employee
Emp Nbr
Emp Name
Dept Nbr
32954
Jason Jones
10
18433
Susan Smith
12
52853
Diane Downs
14
26448
Mary Moore
10
19628
Ted Thomas
12

First Normal Form (1NF)
1NF prohibits more than one fact from being recorded in any column in any row of a table. (See the Appendix, which discusses a broader version of 1NF.)

In the following example each pastime is a separate fact about an employee. Therefore, the Employee table is not in 1NF because some of its rows contain more than one fact in the Pastimes column.

Employee
Emp Nbr
Emp Name
Pastime
32954
Jason Jones Fishing, Sailing, Hiking
18433
Susan Smith Hiking
52853
Diane Downs Cooking, Fishing

This presents certain update problems. Recording a new pastime for Susan Smith requires reentering her existing pastime, Hiking. Likewise, removing an existing pastime from Jason Jones requires reentering his remaining pastimes. So, both cases require reentering facts that have not changed. They entail extra cost, but do not provide any added value. In addition, if the facts are reentered incorrectly, errors are introduced into the database.

Putting the table into 1NF involves moving the Pastime column into a new table, Employee Pastime, as shown below. This new table has a separate row for each of an employee’s pastimes. The Emp Nbr column in the new table is a foreign key, which links each Employee Pastime row to its related Employee row.

Employee
Emp Nbr
Emp Name
32954
Jason Jones
18433
Susan Smith
52853
Diane Downs

Employee Pastime
Emp Nbr
Pastime
32954
Fishing
32954
Sailing
32954
Hiking
18433
Hiking
52853
Cooking
52853
Fishing

Adding or removing a pastime is now just a matter of inserting or deleting a row for that one fact. The rows for the employee’s other pastimes are not affected, so the extra cost and risk of errors are avoided.

Second Normal Form (2NF)
2NF is concerned with data redundancy—recording the same fact more than once. It requires that facts recorded in a row be about the thing represented by that row rather than some other related thing.

The following example shows a table that is not in 2NF. The columns to the left of the double vertical lines are the primary key. The other columns are nonkeys.

Project Task
Project Nbr
Task Nbr
Project Name
Task Name
100
1
ABC System
Gather Requirements
100
2
ABC System
Design System
100
3
ABC System
Build System
200
1
DEF System
Requirements Definition
200
2
DEF System
System Design
200
3
DEF System
System Construction

Notice that all the rows with the same value for Project Nbr also have the same value for Project Name. Thus, the same fact (e.g., the name of project 100 is “ABC System”) is recorded more than once, which is a case of data redundancy.

The technical way of describing this situation is:
(1) Project Nbr determines Project Name or
(2) Project Name depends on Project Nbr.

Either of these means: for any value of Project Nbr there is only one value for Project Name (e.g., every row with Project Nbr = 100 has Project Name = “ABC System”). 2NF prohibits any nonkey column in a table from depending on a partial primary key, i.e., on less than the entire primary key. Thus, the Project Task table is not in 2NF because Project Name depends on Project Nbr, which is only one of the columns in the primary key. The same rule applies to alternate keys. No column may depend on a partial alternate key.

NOTE: Because 2NF pertains to dependencies on partial primary and alternate keys, it applies only to tables with at least two columns in their keys. A single-column key cannot be partial, so any table with only single-column keys is automatically in 2NF.

The Project Task table presents several update problems. If a new task is added to a project, the project name must be reentered even though it is already recorded in other rows. If a project name changes, all the task rows for the project must be updated. A project’s name cannot be recorded unless the project has at least one task, which may not be the case for a project that is just being established.

The table can be put into 2NF by moving the Project Name column into a new table, Project, as shown below. This new table has one row for each project, so each project name is recorded only once.

Project
Project Nbr
Project Name
100
ABC System
200
DEF System

Project Task
Project Nbr
Task Nbr
Task Name
100
1
Gather Requirements
100
2
Design System
100
3
Build System
200
1
Requirements Definition
200
2
System Design
200
3
System Construction

Now a project name can be recorded before any of the project’s tasks are identified, and it can be changed by updating a single Project row instead of several Project Task rows.

Third Normal Form (3NF)
3NF is similar to 2NF because it too is concerned with data redundancy and requires that facts recorded in a row be about the thing represented by that row rather than some other related thing. It differs from 2NF in that it deals with dependencies on nonkey columns rather than on primary and alternate keys. 3NF prohibits any column in a table from depending on a set of nonkey columns (one or more).

The following example shows a table that is in 2NF, but not in 3NF.

Employee
Emp Nbr
Emp Name
Dept Nbr
Dept Name
32954
Jason Jones
10
Accounting
18433
Susan Smith
12
Sales
52853
Diane Downs
12
Sales
26448
Mary Moore
10
Accounting
19628
Ted Thomas
12
Sales

All the rows with the same value for Dept Nbr also have the same value for Dept Name, so Dept Name depends on Dept Nbr, which is a nonkey column. This is another case where the same fact (e.g., the name of department 10 is “Accounting”) is recorded more than once, and it presents the same types of update problems as the Project Task table in the 2NF example.

When a new employee is hired, the department name must be reentered even though it is already recorded in other rows. If a department name changes, all the rows for employees who are assigned to that department must be updated. The name for a new department cannot be recorded until an employee is assigned to the department.

Putting the table into 3NF involves the same sort of redesign as for 2NF, i.e., moving the Dept Name column into a new table, Department, as shown below. This new table has one row for each department, so each department name is recorded only once.

Department
Dept Nbr
Dept Name
10
Accounting
12
Sales

Employee
Emp Nbr
Emp Name
Dept Nbr
32954
Jason Jones
10
18433
Susan Smith
12
52853
Diane Downs
14
26448
Mary Moore
10
19628
Ted Thomas
12

Now a department name can be recorded before any employees are assigned to the department, and it can be changed by updating a single Department row instead of several Employee rows.

2NF and 3NF can be summarized as:
Every column depends on the key, the whole key, and nothing but the key.

Many-to-many relationships
In a many-to-many relationship any row of one table may be related to many rows of another and any row of the second may be related to many rows of the first. In the following example an employee can fill many roles, and many employees can fill the same role.

Role
Role Nbr
Role Name
1
Analyst
2
Researcher
3
Accountant

Employee
Emp Nbr
Emp Name
Roles
72956
Sally Smith 1, 2
39479
Bob Burns 1, 3

The Roles column in the Employee table is a foreign key to the Role table. It shows which roles each employee can fill. Because it contains more than one fact per row, the Employee table violates 1NF.

So, many-to-many relationships must always be recorded in a third table in order to avoid multi-valued foreign keys, which would violate 1NF. In the following example the Employee Role table is that third table. Each of its rows records the fact that a particular employee can fill a particular role.

Employee
Emp Nbr
Emp Name
72956
Sally Smith
39479
Bob Burns

Role
Role Nbr
Role Name
1
Analyst
2
Researcher
3
Accountant

Employee Role
Emp Nbr
Role Nbr
72956
1
72956
2
39479
1
39479
3


Normalizing the database continued>>

Teradata Magazine-December 2006

More Features

Related Links

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:

Teradata.com | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.