Recursive SQL can solve a variety of data warehousing problems.
by Paul Boal
A condensed version of this article is printed in Teradata Magazine, Vol. 6, No. 3. If you've already read the print version and are interested in Paul's additional recursive SQL examples, click here.
Most data warehouse engineers know some great structured query language (SQL) tricks. They can write inner joins three different ways, build nested derived tables, play optimizer tricks using WITH tables and maybe even throw in some fancy online analytical processing (OLAP) functions like ROLLUP. Another great trick, recursive SQL, tends to be pigeon-holed as merely a mechanism for resolving product hierarchies, organizational hierarchies or other kinds of hierarchies. It turns out, though, that recursive SQL—much like recursive functions in a traditional programming language—can solve a whole variety of other problems. To see how, let's go back to the CompSci 101 version of recursion.
Recursion revisited
Someone once suggested that the shortest definition for recursion is "see recursion." (I apologize to purists who will rush to chastise the author of this definition. Later, you'll see why the purists are right.) However, that author's point is still worthwhile: the power of recursion comes from being able to break down a problem into pieces that look the same as the original problem and in being able to combine the incremental results together into the final answer. A trivial example of recursion in a procedure programming language might be one that sums a list of numbers by recursively adding the current value to the sum of the remaining values. In the C++-like pseudo-code below, notice how the sum of a list is implemented as some other operation combined with the sum of a (smaller) list.
/* Return the additive sum of the integers in the list */
int sum (list l)
{
/* TERMINAL CONDITION #1 */
/* If there are no items in the list, the sum is 0 */
If (list.length() = 0)
{
return 0;
}
/* TERMINAL CONDITION #2 */
/* If there is only one value in the list,
the sum is just that value */
else if (length(list) = 1)
{
return (list[0]);
}
/* RECURSIVE CALL */
/* The sum of the list can be defined as the first
value plus the sum of the rest of the list */
else
{
return (list[0] + sum(list.next_item()));
}
}
A more descriptive definition for a recursive function is one that calls itself one or more times to calculate partial results, ideally avoids infinite recursion using one or more terminal conditions and combines those partial results together into the final result. Obviously, procedural recursive techniques taught in first-year computer science have widespread and powerful implications: manipulating lists, traversing tree structures and navigating graphs. Anything that involves manipulating complex data structures is potentially well-suited for recursive programming. Despite being a powerful programming tool, procedural recursion is used in surprisingly few applications. More often than not, recursive constructs are replaced with looping syntaxes (i.e., for, for/each, do/while).
Recursion in SQL
If recursion is underused in procedural programming, it's outright neglected in the database world of SQL programming and set arithmetic. In fact, most relational database vendors didn't fully support recursive SQL constructs until the past decade, and many vendors started their support with custom SQL extensions. In fact, the ANSI standard for SQL didn't include recursive constructs until the early 1990s. Despite the fact that most relational database systems now support ANSI recursive SQL, it's still sorely underutilized. Few business intelligence (BI) or reporting tools ever consider generating recursive SQL. Many SQL developers have heard of it, but never actually considered using it in an application. The variety of examples in this article, however, show how recursive SQL can be used to solve an assortment of problems in SQL, rather than having a developer resort to cursor processing in stored procedures or using external programming to achieve recursion.
Recursive SQL syntax
Before we get into the examples, it's valuable to have some familiarity with what a recursive SQL statement looks like and how it executes. One form of recursive SQL uses the WITH temporary table construct to handle recursion. The Teradata syntax looks like this:
WITH RECURSIVE <temp table> <column list> AS
(
<seed statement>
UNION ALL
<recursive statement>
)
<statement>;
The <temp table> and <column list> are standard components of the WITH clause. Two or more SELECT statements appear within the body of the WITH clause: one or more <seed statements> are standard SELECT statements that reference existing database objects; the <recursive statements> are special SELECT statements that reference the <temp table> temporary table. The final <statement> is a standard SELECT statement that references the <temp table> temporary table and possibly other objects in the database.
For the sake of comfort, a simple example might involve an employee hierarchy table where each row had the employee ID and associated manager ID. A standard query could retrieve a list of direct reports for a given employee ID: SELECT * FROM employee WHERE mgr_id = 123. A recursive query, however, could return the list of all employees anywhere under the hierarchy below that same manager. The distinction is shown in figure 1.
An example of that might be:
WITH RECURSIVE emp_hier (emp_id, mgr_id, level) AS
(
SELECT a.emp_id, a.mgr_id, 0
FROM employee a
WHERE a.emp_id = 123
UNION ALL
SELECT b.emp_id, b.mgr_id, c.level+1
FROM employee b,
emp_hier c
WHERE b.mgr_id = c.emp_id
)
SELECT e.emp_title, e.emp_id, e.mgr_id, h.level
FROM employee e,
emp_hier h
WHERE e.emp_id = h.emp_id
AND e.emp_id <> 123;
In this query, the <seed statement> is a simple select that pulls the manager record for which we want all of the descendants. The <recursive statement> selects all records from the employee table where the employee is managed by someone already in the emp_hier temporary table—hence the join on employee.mgr_id and emp_hier.emp_id.
| enlarge
|
|
SELECT * FROM employee
WHERE mgr_id = 123
As a "standard" query, a list of direct reports is retrieved, indicated in the purple area. As a "recursive" query, the list of all employees anywhere under the hierarchy below that same manager is returned.
|
|
| enlarge
|
|
Using our employee hierarchy example, the "recursive" nature of how Teradata executes the effective SQL in each successive call to the <recursive statement>.
|
|
How recursive SQL executes
Teradata executes recursive SQL just as you might expect: recursively. On the first pass through the query, the <seed statement> is executed and the results are written to a temporary table, <table name>. Then the first instance of the <recursive statement> is executed. If the <recursive statement> returns results, they are written to the temporary table. The <recursive statement> is executed again, each time executing against the results from the previous execution. Figure 2 shows the effective SQL being executed in each successive call to the <recursive statement> in our employee hierarchy example.
The <seed statement> returns a single row, the employee record for the CIO. The first call to the <recursive statement> returns all of the employees whose manager ID matches the employee ID in rows returned from the <seed statement>. Effectively:
WHERE mgr_id = 123
This returns the VP of Business Intelligence and the VP of Applications, who both report to the CIO.
The second call to the <recursive statement> uses the previous call's results to continue the recursion: returning employees whose manager ID matches the employee ID in rows returned from the first call to the <recursive statement>. (Read that twice to make sure you're following.) Reduced to the most straightforward condition, this is effectively:
WHERE mbr_id IN (735, 846)
This returns the Sr. Director of Data Warehousing as well as the Director of Server Applications and Director of Web Applications.
The third call to the <recursive statement> uses the results from the second call to continue recursing just as before. The effective condition of this execution becomes:
WHERE mbr_id IN (6574, 6354, 5536)
This returns one row with the Director of Reporting.
The fourth call to the <recursive statement> continues to perform the same recursive evaluation, basing the next query on the previous result set. So, the effective query condition becomes:
WHERE mbr_id = 6574
No one reports to the Director of Reporting, so there are no results to this query.
(Clearly, the recursion will continue through the bottom of the employee hierarchy. For the sake of brevity, this sample hierarchy assumes the company is a bank where nearly everyone is a vice president or above.)
The termination condition of a recursive SQL query is reached when an execution of the <recursive statement> returns no rows. In our case this happens when we've reached all of the employees from the hierarchy.
The results have been merging together into the emp_hier temporary table through execution of the <seed statement> and <recursive statement>. Once the termination condition is reached and the temporary table has been built to completion, the main SQL <statement> is executed. In this case, the main SQL is a join back to the employee table to gather name and title information and return all of the results except for the CIO herself.
The EXPLAIN plan
Teradata's EXPLAIN plan shows the recursive nature of the execution as well. The following table shows steps 3 through 7 of the execution plan indicating how the Teradata Database SQL Optimizer works in a specific example.
| 3) We do a single-AMP RETRIEVE step from EMPLOYEE by way of the unique primary index "EMPLOYEE.emp_id = 123" with no residual conditions into Spool 3 (all_amps), which is built locally on that AMP.
|
<seed statement>
Step three of the explain plan shows the execution of the seed condition into spool 3.
|
| 4) We do an all-AMPs RETRIEVE step from Spool 3 by way of an all-rows scan.
|
<recursive statement>
Steps 4 through 7 show the recursive part of the query execution.
|
| 5) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of an all-rows scan into Spool 4 (all_amps), which is duplicated on all AMPs.
|
| 6) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of an all-rows scan, which is joined to PEMPLOYEE by way of an all-rows scan with no residual conditions. Spool 4 and PEMPLOYEE are joined using a product join, with a join condition of ("PEMPLOYEE.mgr_id = EMP_ID"). The result goes into Spool 5 (all_amps).
|
The current temporary spool table is joined with the employee table using the conditions of the <recursive statement>.
|
| 7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of an all-rows scan into Spool 3 (all_amps) ... If one or more rows are inserted into spool 3, then go to step 4.
|
The execution engine loops back around the until no more rows are returned.
|
As with any other type of query, the physical definition of the underlying tables can dramatically impact the execution performance. With recursive queries this is all the more true, though, because tables can be joined and rejoined to themselves dozens or hundreds or thousands of times depending on the depth of recursion needed to resolve the complete query. Using a primary index—especially a unique primary index—in the parent-child join can have dramatic positive impacts on recursive query execution. T
Paul Boal is the St. Louis Application Development Manager for the Data Warehouse and Reporting team at Express Scripts, Inc, a major pharmaceutical benefit management company. He has worked in the data warehousing field as a consultant, architect and manager for the past seven years across various industries.
� Teradata Magazine-September 2006
|