New SAS and Teradata integrated solution will move analytical data inside the data warehouse for faster performance.
by David Shamlin and Mike Rote
Businesses today want answers fast, with automated processes and complete accuracy. Organizations that use a data warehouse are getting these
results, but they can soon be getting them faster and with even less effort.
In October 2007, SAS™ and Teradata announced a partnership designed to create breakthrough value by strengthening the integration between
the SAS System and the Teradata enterprise data warehouse (EDW). This strengthened integration will move the analytic processing power of SAS
closer to the data managed by Teradata. The end result will be an environment that looks and behaves like SAS from the analyst's perspective
and like Teradata from the system administrator's perspective.
Figure 1 displays the current and future process architectures. The future environment is based on the SAS function under development. To
perform an analysis today, large volumes of data are often transferred from the Teradata EDW to SAS. In the future environment, key SAS
functions are packaged to run inside the Teradata EDW, significantly reducing data movement and duplication. This new method will minimize
the effort and shorten the time it takes to implement mission-critical applications, will reduce the time needed to answer business questions,
and will simplify overall system administration. The net effect will be organizations with more agile business processes and more resources
freed to focus on tasks with the greatest business value.
| enlarge |
|
Currently, data analytics are done outside of the Teradata enterprise data warehouse (EDW), as shown on the left in the figure. With
the new SAS and Teradata solution (right), SAS functions are contained within the Teradata EDW, reducing data movement
and enabling faster analysis.
|
|
Proven results
The new method was tested to ensure the functionality of the SAS and Teradata solution. A series of proof-of-concept (POC) projects was
executed to test a prototype integration of the two companies' software products. Four key SAS components were selected for the tests because
they embody basic SAS constructs that recur in many SAS applications. The following are descriptions of how these SAS components worked within
the Teradata system:
|
SAS formats. Enabled repackaging proprietary SAS software as user-defined functions (UDFs), which can be executed inside
the Teradata system
|
|
Analytic model scoring algorithms. Deployed SAS-generated custom models as UDFs inside the Teradata database
|
|
SAS FREQ. Showed that SAS components can directly generate and "push down" SQL to Teradata, significantly reducing data
movement between the two systems
|
|
SAS Stored Processes. Proved SAS programs can be accessed by Teradata as if they are Teradata stored procedures
|
The FREQ procedure
The SAS System comprises hundreds of modules of prepackaged functionality referred to as procedures. One common functionality is the FREQ
procedure (aka PROC FREQ). The following is a simplified example showing how the frequencies and percentages of customer credit scores by
state are calculated:
proc freq data=customer.credit_data;
table state * credit_score;
The DATA= option tells the PROC FREQ to use the data found in customer.credit_data for its calculation. "Customer" refers to the location of
the data (in this case, a Teradata Database) and "Credit_data" is the table in that database. The TABLE statement instructs PROC FREQ to
gather statistics on the credit_score column and group the results by the unique values of the state column.
To produce the report shown in figure 2, the credit_score values must be counted by state. Traditionally, these counts are done by SAS as
illustrated in figure 3 below: SAS executes the PROC step and connects to Teradata via a SAS engine, the PROC directs the engine to fetch
the rows in the input table, and the engine turns this into a basic SQL select statement. This results in a transfer of all the rows in the
Teradata credit_data table to SAS.
Counting column values by group is also a standard SQL operation. In this part of the POC, PROC FREQ was modified to pass a more substantial
SQL query to the SAS engine that could be pushed to Teradata. This was accomplished by enabling PROC FREQ to dynamically generate a SAS SQL
view. Below is the SQL query generated by the PROC FREQ example:
select count(*), state, min(state),
credit_score, min(credit_score),
max(credit_score)
customer.credit_data
group by state,credit_score
The view is given to the access engine, where it is re-textualized into Teradata SQL. Regardless of the number of existing rows in credit_data,
51 rows (one for each state plus Washington, D.C.) passed from Teradata to SAS.
Running SAS scoring code inside Teradata
Many businesses use analytic modeling to provide insight into questions such as: "Is a customer a good candidate for a marketing campaign?" and
"Does a credit card transaction appear fraudulent?" However, as data volumes required to answer these questions continue to grow, in some
cases doubling in size, companies are challenged to create and deploy scoring models in a timely manner. These businesses are turning to
vendors to accelerate the analytic cycle to meet their growing data problem.
| enlarge |
|
This report is an example of what is produced when the frequency procedure, also known as PROC FREQ, computes basic
descriptive statistics and produces n-way tabular reports.
|
|
SAS software has a rich set of data mining tools for developing models and scoring functions. The SAS DATA step, typically used to implement
scoring functions, is inherently table-driven and lends itself neatly to many scoring operations. The input table contains customer transaction
data to be scored; the output table stores the results.
Because the DATA step scoring functions are done in a row-independent manner, they can be executed in Teradata's shared-nothing environment.
To test the performance of the system's scoring functions, the POC used:
|
An analytic model that scores a hypothetical bank customer's loan risk
|
|
A UDF approach to develop a prototype for deploying a scoring function created with SAS data mining tools
|
|
A six-node Teradata 5500 Server
|
Since the DATA step employed a "just in time" compiler with dependencies on other SAS subsystems, challenges arose. Therefore, in place of the
DATA step, a new language processor called the Table Server Programming Language (TSPL) was implemented. The TSPL supports a syntax similar to
that of the DATA step but is built on a more encapsulated framework.
Next, a tool was created to translate the subset of DATA step syntax generated by SAS data mining tools into TSPL statements. These statements
were then deployed and executed as a Teradata UDF.
When the system was tested using DATA step (external to Teradata), about 16,400 rows per second were scored, and execution bottlenecks were
observed on the I/O channel between SAS and Teradata. Also, processor activity on the Teradata nodes was intermittent and appeared randomly
distributed. By comparison, executing TSPL as a UDF inside Teradata scored up to 750,000 rows per second when all six nodes were utilized
(125,000 per node/second). Furthermore, the processors appeared more uniformly active as work was distributed across the systems.
This prototype demonstrated that a complex SAS program can be executed inside Teradata, offering enhanced performance and additional business
value; that is, by moving the scoring function to the data, any security "envelope" implemented using the Teradata system is honored. This
keeps IT governance concerns focused on the Teradata EDW and reduces costs incurred whenever data from the EDW is replicated in analytic data
marts implemented with SAS data sets.
SAS Stored Process integration
To further test the system, another POC project was executed: the binding of SAS Stored Processes to Teradata stored procedures. A SAS Stored
Process is any SAS program packaged for execution with the SAS Stored Process Server, one of the SAS9 platform cornerstones on which SAS
solutions are built. Input and output parameters can be defined on a stored process to allow the end user to control the behavior of an
individual invocation. In short, the SAS Stored Process Server allows a programmer to publish SAS programs that can subsequently be run on
demand.
| enlarge |
|
In a much more efficient and faster method than the traditional technique, SQL pushdown processes the data directly in the
Teradata enterprise data warehouse (EDW).
|
|
SAS solutions use stored processes to implement parts of their functionality. The SAS Stored Process Server can be deployed inside Teradata,
giving organizations the option to implement their Teradata EDW and SAS solutions in a single hardware environment, thereby lowering their IT
infrastructure complexity and cost.
To test this option, a new set of SAS Stored Process Server interfaces was created inside Teradata to allow SAS Stored Processes to be invoked
as Teradata stored procedures. A SAS Stored Process Server was also deployed on a node without database software. This procedure enabled a
Teradata SQL statement to reference an arbitrary SAS program; the full breadth of SAS functionality became available for embedding in a
business process driven by a Teradata query. It also allowed data movement to stay on the Teradata environment.
Integrated technology
The POC tests demonstrate the ability to deploy proprietary and user-defined SAS processes inside a Teradata EDW thus serving as a basis for a
tightly integrated Teradata/SAS architecture. The potential performance gains of such an approach are significant.
At the time of this printing, the POC code is being reviewed and plans are being put in place to deliver this initial functionality in
mid-2008, offering organizations immediate benefits to the SAS In-Database implementations. Also being investigated are ways to leverage the
strategy of dynamical SQL generation with SAS statistical algorithms repackaged as Teradata UDFs in data mining procedures.
The partnership between SAS and Teradata is positioned to deliver truly scalable analytic solutions, improve time-to-value and reduce
technology infrastructure costs. Businesses will benefit from the integrated technology platform with improved performance, enhanced
productivity and a lower total cost of ownership. T
David Shamlin, R&D storage director at SAS, has been with the company for 20 years and, as a member of the joint engineering team, helped
develop the technology integration POC.
Mike Rote, director of the Teradata/SAS Center of Excellence, has been with Teradata for more than 15 years, most recently as director of the
advance analytics team.
Teradata Magazine-March 2008
|