A simplified explanation of normal forms.
by Dale R. Rollins
Normal forms are rules that are applied during the development of relational databases. The process of applying them is called normalization or normalizing the database. The intent of normalization is to minimize data redundancy, thereby improving data consistency and reducing update problems.
Data redundancy is present when the same fact (i.e., data value) is recorded more than once in a database. This increases the cost of storing and updating the database. Each redundant copy of a fact takes extra space on the media (disk/tape/etc.) where the database is stored. Depending on the amount of data redundancy, the database may need additional media units. The cost of updating the database is higher because when facts are added, changed, or removed, all redundant copies must be updated to keep them synchronized. That takes extra processing time. In addition, accessing the database for a query or report may take longer because there are more facts (data) to look through to find those that are needed.
Data consistency is improved by eliminating data redundancy. When queries and reports are produced from the same set of nonredundant facts they present a consistent, single view of the business. Business people can rely on what they see in the queries and reports. They don’t have to waste time tracking down the real facts like they do when queries and reports are produced from redundant, inconsistent data.
Update problems arise when the same fact must be recorded more than once (due to data redundancy) or when existing facts must be reentered even though they have not changed. This increases the amount and complexity of the update software, which raises the development cost and lengthens the development time. This added complexity increases the risk of software errors, which, if not found and corrected, will cause incomplete updating of the database, leading to data inconsistencies in queries and reports.
An ideal is the elimination of all data redundancy, i.e., record every fact only once in the database. This achieves the goal of a single view of the business. However, other factors must also be considered when designing databases—factors that encourage some amount of data redundancy (e.g., query response time). These factors are beyond the scope of this paper, but must be considered when deciding which normal forms to apply when designing a database.
This paper describes five normal forms, which are simply identified with ordinal numbers: First Normal Form, Second Normal Form, etc. Each of these builds on the prior. A table must be in First Normal Form before it can be in Second; it must be in Second before it can be in Third; and so on. Third Normal Form is generally considered the minimum acceptable level of normalization for a database.
A brief review of relational terminology
A relational database is a set of two-dimensional tables. Each table consists of rows and columns (see the example below). Each fact is recorded in a particular row and column. A row contains different facts about the same real-world thing. A column contains the same kind of fact about different real-world things.
Each table has at least one unique key—a set of columns (one or more) whose values uniquely identify each row. One of the unique keys is chosen to be the table’s primary key, and any others are designated as alternate keys. A nonkey column is one that is not part of the table’s primary key, but may be part of an alternate key. In the example below, the primary key columns are to the left of the double vertical line, and the nonkey columns are to the right.
A foreign key is a set of columns in one table that match the primary key columns in another table. It represents a relationship between the two tables. The table containing the foreign key is called the child table; the table containing the matching primary key is called the parent. The foreign key value in each child row is the same as the primary key value in the related parent row. If a foreign key in a child table is also part of the child’s primary key, then the parent for that foreign key is an identifying parent; the identity of the parent row must be known in order to uniquely identify the child row. Otherwise, the parent is nonidentifying. In the example below, Dept Nbr in the Employee table is a foreign key to the Department table where Dept Nbr is the primary key.
Department
Dept Nbr |
Dept Name |
10 |
Accounting |
12 |
Sales |
|
Employee
Emp Nbr |
Emp Name |
Dept Nbr |
32954 |
Jason Jones |
10 |
18433 |
Susan Smith |
12 |
52853 |
Diane Downs |
14 |
26448 |
Mary Moore |
10 |
19628 |
Ted Thomas |
12 |
|
First Normal Form (1NF)
1NF prohibits more than one fact from being recorded in any column in any row of a table. (See the Appendix, which discusses a broader version of 1NF.)
In the following example each pastime is a separate fact about an employee. Therefore, the Employee table is not in 1NF because some of its rows contain more than one fact in the Pastimes column.
Employee
Emp Nbr |
Emp Name |
Pastime |
32954 |
Jason Jones |
Fishing, Sailing, Hiking |
18433 |
Susan Smith |
Hiking |
52853 |
Diane Downs |
Cooking, Fishing |
|
This presents certain update problems. Recording a new pastime for Susan Smith requires reentering her existing pastime, Hiking. Likewise, removing an existing pastime from Jason Jones requires reentering his remaining pastimes. So, both cases require reentering facts that have not changed. They entail extra cost, but do not provide any added value. In addition, if the facts are reentered incorrectly, errors are introduced into the database.
Putting the table into 1NF involves moving the Pastime column into a new table, Employee Pastime, as shown below. This new table has a separate row for each of an employee’s pastimes. The Emp Nbr column in the new table is a foreign key, which links each Employee Pastime row to its related Employee row.
Employee
Emp Nbr |
Emp Name |
32954 |
Jason Jones |
18433 |
Susan Smith |
52853 |
Diane Downs |
|
Employee Pastime
Emp Nbr |
Pastime |
32954 |
Fishing |
32954 |
Sailing |
32954 |
Hiking |
18433 |
Hiking |
52853 |
Cooking |
52853 |
Fishing |
|
Adding or removing a pastime is now just a matter of inserting or deleting a row for that one fact. The rows for the employee’s other pastimes are not affected, so the extra cost and risk of errors are avoided.
Second Normal Form (2NF)
2NF is concerned with data redundancy—recording the same fact more than once. It requires that facts recorded in a row be about the thing represented by that row rather than some other related thing.
The following example shows a table that is not in 2NF. The columns to the left of the double vertical lines are the primary key. The other columns are nonkeys.
Project Task
Project Nbr |
Task Nbr |
Project Name |
Task Name |
100 |
1 |
ABC System |
Gather Requirements |
100 |
2 |
ABC System |
Design System |
100 |
3 |
ABC System |
Build System |
200 |
1 |
DEF System |
Requirements Definition |
200 |
2 |
DEF System |
System Design |
200 |
3 |
DEF System |
System Construction |
|
Notice that all the rows with the same value for Project Nbr also have the same value for Project Name. Thus, the same fact (e.g., the name of project 100 is “ABC System”) is recorded more than once, which is a case of data redundancy.
The technical way of describing this situation is:
| (1) |
Project Nbr determines Project Name or |
| (2) |
Project Name depends on Project Nbr. |
Either of these means: for any value of Project Nbr there is only one value for Project Name (e.g., every row with Project Nbr = 100 has Project Name = “ABC System”). 2NF prohibits any nonkey column in a table from depending on a partial primary key, i.e., on less than the entire primary key. Thus, the Project Task table is not in 2NF because Project Name depends on Project Nbr, which is only one of the columns in the primary key. The same rule applies to alternate keys. No column may depend on a partial alternate key.
NOTE: Because 2NF pertains to dependencies on partial primary and alternate keys, it applies only to tables with at least two columns in their keys. A single-column key cannot be partial, so any table with only single-column keys is automatically in 2NF.
The Project Task table presents several update problems. If a new task is added to a project, the project name must be reentered even though it is already recorded in other rows. If a project name changes, all the task rows for the project must be updated. A project’s name cannot be recorded unless the project has at least one task, which may not be the case for a project that is just being established.
The table can be put into 2NF by moving the Project Name column into a new table, Project, as shown below. This new table has one row for each project, so each project name is recorded only once.
Project
Project Nbr |
Project Name |
100 |
ABC System |
200 |
DEF System |
|
Project Task
Project Nbr |
Task Nbr |
Task Name |
100 |
1 |
Gather Requirements |
100 |
2 |
Design System |
100 |
3 |
Build System |
200 |
1 |
Requirements Definition |
200 |
2 |
System Design |
200 |
3 |
System Construction |
|
Now a project name can be recorded before any of the project’s tasks are identified, and it can be changed by updating a single Project row instead of several Project Task rows.
Third Normal Form (3NF)
3NF is similar to 2NF because it too is concerned with data redundancy and requires that facts recorded in a row be about the thing represented by that row rather than some other related thing. It differs from 2NF in that it deals with dependencies on nonkey columns rather than on primary and alternate keys. 3NF prohibits any column in a table from depending on a set of nonkey columns (one or more).
The following example shows a table that is in 2NF, but not in 3NF.
Employee
Emp Nbr |
Emp Name |
Dept Nbr |
Dept Name |
32954 |
Jason Jones |
10 |
Accounting |
18433 |
Susan Smith |
12 |
Sales |
52853 |
Diane Downs |
12 |
Sales |
26448 |
Mary Moore |
10 |
Accounting |
19628 |
Ted Thomas |
12 |
Sales |
|
All the rows with the same value for Dept Nbr also have the same value for Dept Name, so Dept Name depends on Dept Nbr, which is a nonkey column. This is another case where the same fact (e.g., the name of department 10 is “Accounting”) is recorded more than once, and it presents the same types of update problems as the Project Task table in the 2NF example.
When a new employee is hired, the department name must be reentered even though it is already recorded in other rows. If a department name changes, all the rows for employees who are assigned to that department must be updated. The name for a new department cannot be recorded until an employee is assigned to the department.
Putting the table into 3NF involves the same sort of redesign as for 2NF, i.e., moving the Dept Name column into a new table, Department, as shown below. This new table has one row for each department, so each department name is recorded only once.
Department
Dept Nbr |
Dept Name |
10 |
Accounting |
12 |
Sales |
|
Employee
Emp Nbr |
Emp Name |
Dept Nbr |
32954 |
Jason Jones |
10 |
18433 |
Susan Smith |
12 |
52853 |
Diane Downs |
14 |
26448 |
Mary Moore |
10 |
19628 |
Ted Thomas |
12 |
|
Now a department name can be recorded before any employees are assigned to the department, and it can be changed by updating a single Department row instead of several Employee rows.
2NF and 3NF can be summarized as:
Every column depends on the key, the whole key, and nothing but the key.
Many-to-many relationships
In a many-to-many relationship any row of one table may be related to many rows of another and any row of the second may be related to many rows of the first. In the following example an employee can fill many roles, and many employees can fill the same role.
Role
Role Nbr |
Role Name |
1 |
Analyst |
2 |
Researcher |
3 |
Accountant |
|
Employee
Emp Nbr |
Emp Name |
Roles |
72956 |
Sally Smith |
1, 2 |
39479 |
Bob Burns |
1, 3 |
|
The Roles column in the Employee table is a foreign key to the Role table. It shows which roles each employee can fill. Because it contains more than one fact per row, the Employee table violates 1NF.
So, many-to-many relationships must always be recorded in a third table in order to avoid multi-valued foreign keys, which would violate 1NF. In the following example the Employee Role table is that third table. Each of its rows records the fact that a particular employee can fill a particular role.
Employee
Emp Nbr |
Emp Name |
72956 |
Sally Smith |
39479 |
Bob Burns |
|
Role
Role Nbr |
Role Name |
1 |
Analyst |
2 |
Researcher |
3 |
Accountant |
|
Employee Role
Emp Nbr |
Role Nbr |
72956 |
1 |
72956 |
2 |
39479 |
1 |
39479 |
3 |
|
Normalizing the database continued>>
Teradata Magazine-December 2006
|