Recursive SQL can solve a variety of data warehousing problems.
by Paul Boal
Other recursive SQL examples
For those already familiar with recursive SQL, the discussion so far has probably been a refresher. To the detriment of database programmers, most discussions of recursive SQL tend to stop with this kind of example. The next four examples show some creative ways to employ recursive SQL in solving other types of problems.
Products and packages
Obviously, types of hierarchies other than organizational can be evaluated using recursive SQL: geographic regions, such as states and counties; family relationships; and categories and subcategories. So as not to jump in too quickly, we’ll stick with another classical hierarchy example: products and packages. In this example, suppose that a company can sell individual products or prepackaged combinations of those products. Several packages can even be bundled together into a bigger package. Clearly, there’s a hierarchy of individual products and packaged combinations of those. Consider these ways that a publisher might sell a set of products:
PACKAGE PRODUCT
|
PRODUCTS INCLUDED
|
DESCRIPTION
|
|
|
A
|
hardback |
|
|
B
|
online course |
|
|
C
|
electronic version |
|
|
D
|
CD |
| E
|
A+B
|
hardback + online course |
| F
|
A+C |
hardback + electronic version |
| G
|
B+C
|
electronic version + online course |
| H
|
A+D
|
hardback + CD |
| I
|
C+D
|
electronic version + CD |
| J
|
E+D
|
hardback + online course + CD |
| K
|
J+C
|
hardback + online course + CD + e-version |
Suppose that this packaging information is stored in a simple two-column table that stores the package product ID (PKG_ID) and row for each of the products in that package (PRD_ID). This may, for instance, be the easiest way to assemble the packages on the warehouse floor, so it might make sense to store every package as a combination of at most two products or other packages:
| PKG_ID
|
PRD_ID
|
PRD_TYP
|
| A |
A |
I |
| B |
B |
I |
| C |
C |
I |
| D |
D |
I |
| E |
A |
I |
| E |
B |
I |
| F |
A |
I |
| F |
C |
I |
| G |
B |
I |
| G |
C |
I |
| H |
A |
I |
| H |
D |
I |
| I |
C |
I |
| I |
D |
I |
| J |
E |
P |
| J |
D |
P |
| K |
J |
P |
| K |
C |
I |
In its sales system, the company records each sale by PKG_ID; but in their inventory management system, everything is recorded at the individual PRD_ID. If not for the recursive package definitions (J and K), the resolution of packages to products would be a simple join to this cross-reference table. To resolve the recursive definition, though, it requires more sophisticated programming logic.
Recursive SQL can be used against the sales information to unravel the package hierarchy to determine how many of each individual product has been sold.
WITH RECURSIVE pkg_hier (pkg_id, prd_id) AS
(
SELECT p.pkg_id, p.prd_id
FROM packages p
UNION ALL
SELECT p.pkg_id, p.prd_id
FROM pkg_hier h, packages p
WHERE h.pkg_id <> p.pkg_id
AND h.prd_id = p.pkg_id
)
SELECT p.prd_id, count(*)
FROM sales_txn s, pkg_hier p
WHERE s.pkg_id = p.pkg_id
GROUP BY p.prd_id;
In this situation, the recursive portion of the SQL is executed three times: once for all of the first-level packages (E, F, G, H, I); a second time for the second-level package (J); and a third time for the top-level package of everything together (K). Joined to the sales transaction table, the recursive SQL takes each transaction and breaks it down into the most elementary components that comprise it, allowing the query to return the amount of each individual product sold.
String concatenation
While recursion is often shown as a way to resolve hierarchical relationships, which are typically many-to-one, a special type of hierarchical relationship exists in which each parent-child relationship is distinct: a simple list. Though this is a degenerate example of a hierarchy, the solution that recursive SQL provides for processing lists of arbitrary length can come in very handy.
Suppose a healthcare Web site allows individual family members to create their own login, each with an associated e-mail address. We know that everyone in the family is related based on their COVERAGE_ID.
| COVERAGE_ID
|
MEMBER_ID
|
E-MAIL
|
| 187 |
7764 |
mom@ourhouse.net |
| 187 |
6645 |
dad@ourhouse.net |
| 187 |
9983 |
college_kid@ourhouse.net |
Suppose that you wanted to send an e-mail communication customized to the family—something like a college health awareness drive. You want to send this communication not only to the college-age member, but also to her parents, so that they might encourage her to go. It might be useful to gather all of the e-mail addresses together into a single comma-separated list of values.
To make the internal logic of the recursive SQL more straightforward, first we’ll add a numeric sequence number to each group of members in the database. This can be accomplished in a view using the ROW_NUMBER() function. The resulting records would look this:
| COVERAGE_ID
|
MEMBER_ID
|
E-MAIL
|
SEQ
|
| 187 |
7764 |
mom@ourhouse.net |
1 |
| 187 |
6645 |
dad@ourhouse.net |
2 |
| 187 |
9983 |
college_kid@ourhouse.net |
3 |
The recursive SQL to generate that comma-separated list of e-mail addresses might look like this:
WITH RECURSIVE email_list (coverage_id, email, seq) AS
(
SELECT m.coverage_id, m.email, m.seq
FROM members m
WHERE m.seq = 1
UNION ALL
SELECT e.email ||�, �|| m.email, m.seq
FROM email_list e, members m
WHERE m.seq = e.seq + 1
AND m.coverage_id = e.coverage_id
)
SELECT r.coverage_id, r.email_list
FROM email_list r
WHERE r.seq = (SELECT MAX(h.seq) FROM email_list h
WHERE h.coverage_id = r.coverage_id
GROUP BY h.coverage_id)
;
The output of this SQL is a single row for each COVERAGE_ID containing the comma-separated e-mail list:
�mom@ourhouse.net, dad@ourhouse.net, college_kid@ourhouse.net�
Delimited string normalization
As you might expect, the inverse of the last example can also be accomplished using recursive SQL. Parsing a delimited list of values into separate records (i.e., normalizing them) can be a powerful technique.
This example includes a Web form that allows the user to select one of several options from a list, or “other” to enter freeform text. It seems likely that a user might choose to enter data manually into an “other” box, for instance, if they want to enter more than one response. Parsing that list into separate response records would make it easier to understand trends in the data. The records in this case contain merely a key value and the response text. Here are two separate responses:
| RESPONSE_KEY
|
RESPONSE_TXT
|
| 643 |
eggs, milk, toast |
| 872 |
cereal, juice, bagel, cream cheese |
Parsing the response text using recursive SQL is very similar to the last case, but uses the POSITION and SUBSTR functions to identify each element of the list rather than finding each element in a separate record.
WITH RECURSIVE parse_list
(response_key, delim_pos, item_num, element, remainder) AS
(
SELECT response_key, 0, 0, CAST('' AS VARCHAR(100)), response_txt
FROM response
UNION ALL
SELECT response_key,
CASE WHEN POSITION(',' IN remainder) > 0
THEN POSITION(',' IN remainder)
ELSE CHARACTER_LENGTH(remainder) END dpos,
item_num + 1,
TRIM(BOTH �,� FROM SUBSTR(remainder, 0, dpos+1)),
TRIM(SUBSTR(remainder, dpos+1))
FROM parse_list
WHERE dpos > 0
)
SELECT response_key, element
FROM parse_list p
WHERE item_num > 0
ORDER BY response_key, item_num;
The output from this query includes a record for each response_key and element of the delimited list:
| RESPONSE_KEY
|
ELEMENT
|
| 643 |
eggs |
| 643 |
milk |
| 643 |
toast |
| 872 |
cereal |
| 872 |
juice |
| 872 |
bagel |
| 872 |
cream cheese |
Each time through the recursion, the element at the head of the list is popped off the front and into the <element> field, leaving only <remainder> to go through the next recursion. In the end, the list items are in separate records.
| enlarge
|
|
Sample inputs and output of splitting overlapping time segments that have been defined with a priority, start date time and end date time, and collapse into a single timeline in which higher priority segments override the overlapping lower priority segments.
|
|
Food for thought
These examples may not appear very complicated initially, but then again, neither do recursive functions in a procedural language. Recursion is just when a function calls itself, right? So, recursive SQL is just when the select statement references rows it’s already selected. This powerful concept can be used to solve problems that might have only seemed solvable in an external program or stored procedure. Keeping the execution with the relational engine, though, can greatly improve the performance of those solutions.
Now that it’s clear how powerful recursive SQL can be, consider how it might be used to answer these questions, too:
Splitting overlapping time segments
Consider what it might take to split overlapping time segments that have been defined with a priority, start date time and end date time, and collapse those into a single timeline in which higher priority segments override lower priority segments that it overlaps with. Refer to figure 3 for the inputs and output of challenge. Could that be accomplished by recursively splitting time segments in one priority level with a summary of all of the time segments with a higher priority level?
Database dependencies
Consider the dependencies between database objects. Views are built off tables or other views. If some table changes, which of the views that reference that table are impacted? And are there any views that reference those views? How about those? Surely recursive SQL could resolve that hierarchy of objects.
Database privileges
Consider the database hierarchy represented within the Teradata DBC catalog tables. Many systems are configured with four, five, six or more levels of nested databases. Each of those layers may play a part in what data a given user can access. Add onto that users who might be members of multiple roles, and views that refer to views of other views, and the ability to easily tell who exactly has read access to table X can require several iterations of SQL research. Could that be simplified with a recursive view? T
Paul Boal is the St. Louis Application Development Manager for the Data Warehouse and Reporting team at Express Scripts, Inc, a major pharmaceutical benefit management company. He has worked in the data warehousing field as a consultant, architect and manager for the past seven years across various industries.
� Teradata Magazine-September 2006
|