Flex your database muscles with User Defined Functions
Now it's easier to create the right function for the
job at hand.
by Jim Calvert
Teradata database V2R5.1 introduced user defined functions (UDFs), which
offer a huge leap in the user's ability to extend and customize traditional
SQL language commands. A UDF is a user-designed and -written algorithm
that executes as part of the Teradata Database and provides a great deal
of flexibility in its operation.
Since UDFs execute as part of the AMP VPROC (virtual processor), multiple
copies of a given UDF will operate simultaneously, enjoying all the benefits
of Teradata's parallelism. They are written in C and are automatically
compiled into shared objects (SOs) in UNIX or into dynamic link libraries
(DLLs) in Windows. Teradata UDFs are based on the SQL/PSM SQL-99 syntax.
Thanks to their user-friendly properties, UDFs are an excellent way to
customize a database environment. Here's a rundown of what they can do
for you.
UDF capabilities
UDFs can be created to serve many uses and requirements, including:
> Data transformations-Most companies move data from one system or database
to another. This invariably requires some datatype transformations and
value calculations. UDFs are flexible, programmable tools for building
custom transformation algorithms.
> Business rules-UDFs are ideal for standardizing a company's business
rules or proprietary formulae. Users can store a UDF in one place, allowing
public access and giving everyone a centralized, easy, dependable method
of rule calculation.
> Multimedia-What if we could search a database of thousands of digital
images and find just those that are predominantly yellow or search audio
files for matching voice patterns? UDFs, coupled with large object (CLOB
and BLOB) datatypes, can answer those kinds of questions.
> Execute anywhere-UDFs can be executed almost anywhere normal SQL can
execute. They can be part of a SELECT statement, used in UPDATEs and INSERTs
and used in major Teradata utilities including FastLoad,MultiLoad and
FastExport.
UDF types
A UDF can be written as a scalar function that operates upon the values
in a single row or as an aggregate function that operates on a group of
rows, much like a MAX or SUM function. The scalar form is easier to write
but lacks the power of the aggregate form.
A scalar UDF, as the mathematical name implies, relates to a single outcome
from a single input. In this case, the input is a single row. A scalar
UDF will accept any values from a single row, manipulate those values
in one pass of the UDF logic and then produce the calculated result. This
is straightforward and easy to construct.
| Head to the library |
| Teradata Professional Services has created a comprehensive
UDF library, available for download at the Tech Center on
teradata.com. Here's
a sampling of what's available. |
| ORACLE FUNCTIONS |
DESCRIPTION |
| decode(A,B,C,D,E[,F]) |
If A = B then C; if A = D then E; otherwise F, if specified,
or null |
| translate(str, From-Mask, To-Mask) |
Translates str so that each occurrence of a character in
From-Mask is translated to the corresponding character in
To-Mask |
| DB2 FUNCTIONS |
DESCRIPTION |
| REPLACE |
Replaces occurrences of strings |
| CONCAT |
Concatenations of strings |
| STRING PROCESSING FUNCTIONS |
DESCRIPTION |
| ngram(S1, S2, Length, Position ) |
The n-gram matching function returns the number of n-gram
matches between S1 and S2. |
| editdistance(S1, S2, ci,cd,cs,ct) |
The EDIT distance function returns the minimum number of
edit operations; insertions, deletions, substitutions and
transpositions required to transform S1 into S2 |
| BYTE OPERATION FUNCTIONS |
DESCRIPTION |
| shiftRight( b (var)byte, i int ) returns (var)byte |
Shifts bits right by i places. |
| shiftLeft( b (var)byte, i int ) returns (var)byte |
Shifts bits left by i places. |
| CONVERSION FUNCTIONS |
DESCRIPTION |
| isTime( c (var)char ) returns integer |
Checks if a string can be converted to time: 1 if so, 0
if not, null for null input. |
| toChar( b (var)byte ) returns (var)char |
Returns a byte value as Hexstring |
| BYTE CASTING FUNCTIONS |
DESCRIPTION |
| BytetoInt |
Returns an Integer from a BYTE(4) field, error or NULL if
invalid or really NULL |
| BytetoSmallInt |
Returns a Smallint from a BYTE(2) field, error or NULL if
invalid or really NULL |
| SECURITY FUNCTIONS |
DESCRIPTION |
| hash_md5 |
Generates MD5 Message-Digest |
|
The scalar UDF is used for one-to-one data transformations and common
arithmetic calculations. For a given row, it will execute completely within
a single AMP.
An aggregate UDF takes its input from all of the rows in an SQL group
or the entire result set from a query if no group is present. An aggregate
UDF is used for creating complex, customized statistical functions, processing
simulations and calculating household scoring. The aggregate form is commonly
used to write complex statistical functions not available in standard
Teradata Database software.
There are five phases in the operation of an aggregate UDF. These phases
operate on every AMP in parallel to handle each row in the group for that
AMP. Take a look.
Phase 1: AGR_INIT
The UDF begins the aggregation by allocating context memory, initializing
variables and receiving the first data row.
Phase 2: AGR_DETAIL
The UDF enters this phase once for every data row and stores results in
the context area.
Phase 3: AGR_COMBINE
The UDF enters this phase once per AMP per group.
Phase 4: AGR_FINAL
The UDF enters this phase once per group, combines the values from all
AMPs and returns the final result. Allocated memory is automatically freed.
Phase 5: AGR_NODATA
This final phase is called only if the aggregate set is empty.
UDF management
Within the Teradata SQL framework, UDFs are treated just like any other
Teradata object. SQL verbs can CREATE/REPLACE, DROP, ALTER, SHOW or HELP,
and users are granted permission to use a UDF just like any other Teradata
object.
CREATE/REPLACE constructs the UDF by locating the user's C source code,
compiling it into a UNIX SO or Windows DLL, placing it into the user's
choice of database, and then adding references to DBC.TVM and DBC.UDFINFO
dictionary tables.
A new database, called SYSLIB, is available to act as a "public" container
for UDFs.When Teradata begins searching for an unqualified UDF to execute,
it looks first in the user's default database and, failing that, in SYSLIB.
DROP deletes the function from Teradata's dictionary and unlinks it from
the container library (SO or DLL).
ALTER allows the DBA to change the operating mode of a given UDF from
PROTECTED (for development) to NOT PROTECTED (for production). Running
in PROTECTED mode in development helps avoid AMP crashes due to errors
like stack overflow, division by zero and illegal memory access.
SHOW and HELP are common Teradata commands that help users inspect the
UDF's code or parameters.
UDF creation
In order to create a UDF, two scripts must be written: the C source and
a CREATE or REPLACE FUNCTION statement. The CREATE statement has a number
of clauses, a few of which we will address.
First, input parameters are declared with their datatypes. Parameter datatypes
are required for Teradata to properly distinguish between like-named (or
nameoverloaded) UDFs.
Second, the output parameter and its datatype are declared. Other important
clauses include the declaration of NULL handling and of parameter style-either
SQL style, where the UDF code handles NULLs, or TD style, where the Teradata
systems simply returns NULL with any NULL parameter. A further clause
tells Teradata whether this is an aggregate function; otherwise the function
is assumed to be scalar.
Finally, a declaration called EXTERNAL NAME is made, indicating where
the source code can be located for compilation. It can either be found
on the user's client workstation/server or on the Teradata server. If
the source code is on the client's workstation/server, it must be created
by a CLI-based utility such as BTEQ 7.1 or better.
A sample scalar function CREATE statement follows:
CREATE FUNCTION Your DB.VARCHARTOVARBYTE(
InVarbyte VARCHAR(1024))
RETURNS VARBYTE(1024)
SPECIFIC VarcharToVarbyte
LANGUAGE C
NO SQL
PARAMETER STYLE TD_GENERAL
NOT DETERMINISTIC
RETURNS NULL ON NULL INPUT
EXTERNAL NAME
'SS!VarcharToVarbyte!/Yourhome/YourDirectory/ VarcharToVarbyte.c!F!VarcharToVarbyte'
;
Note that LANGUAGE C and NO SQL are default, mandatory clauses. Calls
to the operating system and calls for external I/O of any kind are not
permitted. For instance, because the UDF is operating as part of the AMP,
it would wreak havoc on the AMP if the UDF attempted to print to an offline
printer or to call a disconnected modem.
UDF development
UDF development can be more complex than a comparable C routine. Here
are a few things to consider:
> As mentioned, UDFs can execute in one of two modes: NOT PROTECTED and
PROTECTED. UDF development takes place in the PROTECTED mode where the
UDF runs in its own address space apart from but in communication with
the AMP. This permits Teradata to trap certain errors such as stack overflow,
division by zero and illegal memory access. In most cases, it prevents
the UDF from causing the AMP to crash. However, only two UDFs can be tested
in PROTECTED mode at a given time in any system. Therefore, once a UDF
is rigorously tested, it must be designated NOT PROTECTED by using the
ALTER statement.
> UDFs are available that can read, write and interrogate Teradata large
objects. UDFs have the functionality to write rows of data to a non-hashed
global temporary trace table using any kind of data except BLOBs. This
data can then be used in development as a log for debugging purposes.
The user can externally control how this log is turned on or operated
through a new SET SESSION FUNCTION TRACE command. This log can subsequently
be copied to normal (i.e. hashed table) for review. If employed with care,
the TRACE feature can provide a basic path for multi-row external data
output.
> For large or complex UDFs, Teradata recommends that you construct a
workstation-based, C testbench program to assist in debugging and testing.
The testbench program can execute calls to the actual UDF using parameters
arranged to simulate execution within the Teradata system. The developer
can then "walk" through most of the logic, a feature that is unavailable
when in actual Teradata execution. This may not be necessary for simple
routines, but it is definitely worthwhile once the code becomes complex.
As testing moves to the Teradata system, you can ascertain where something
has failed by writing rows to the UDF log with status codes and loop counters.
Another method is to issue error messages at the point of failure just
before exiting. The contents of the standard error message in SQL Parameter
mode can be generated by your C code.
UDF distribution
By permitting object code to be used, third-party developers can create
and safely distribute their UDF products without having to divulge their
source codes. Earlier we mentioned the EXTERNAL NAME clause of the UDF
CREATE statement. This clause is quite important ina number of ways because:
> The clause informs the compiler whether the source code is being compiled,
pre-compiled object modules are being used or a combination of both is
involved. Teradata will complete the compile, and link and place the resultant
UDF in a DLL or SO.
> The clause tells Teradata and the C compiler the external name of the
UDF source code or objects and from what directory or member they should
be drawn. The source code or objects can be placed on the Teradata distribution
node or on the user's server or workstation. If located on the user's
server or workstation, the UDF must be created by using a CLI-based utility
such as BTEQ in order to have the code shipped to Teradata.
UDF tips and traps
In my experience with writing and using UDFs, I noticed a number of unexpected
problems or issues. Here are some tips to help you avoid the traps:
> When developers begin to deploy a UDF-for example, in a query or MultiLoad
script-the system will often return a message stating that "the UDF cannot
be found." The problem is that Teradata cannot determine the exact UDF
if the parameters you use as input to the UDF differ in any way from those
defined in the dictionary. In such cases, always check your parameters
first.
> Always keep a copy of the SHOW FUNCTION at hand for each function you
are testing. Knowing correct parameter types and lengths is vital for
accurate coding and use.
> Generic UDFs that use decimal datatypes require more thought and thorough
testing than those that don't. Decimal datatypes are always stored within
the UDF as two 4-byte integers. Externally, you always must declare your
decimal parameters with the precision and scale that you are expecting.
> Handling NULLS is much easier if you code the UDF with PARAMETER STYLE
TD GENERAL and RETURNS NULL ON NULL INPUT. However, if you really need
to manipulate NULL fields, then PARAMETER STYLE SQL is the requisite method.With
this style, you must code parameter fields to contain the equivalent of
indicator bits for each field.
> When handling external data directly from another platform in BYTE format,
remember that Teradata runs in an Intel environment that considers integers
to be organized with the least significant bits first. Other platforms,
such as those using IBM's PowerPC CPUs, use the opposite order (most significant
bits first). You might never see this issue, but I did and it was a novel
puzzle at the time.
> The Teradata Dictionary has two tables that describe a UDF. DBC.TVM
contains the usual information about Teradata objects, but it also points
at a UDF with its "specific" name. Remember, that's where you look for
name-overloaded UDFs. DBC.UDFINFO has interesting and useful data such
as parameter types, source location, parameter style and mode of execution
(PROTECTED or NOT PROTECTED).
Take control
UDFs enhance Teradata's SQL with new flexibility and greater power to
create customized functions for your database environment. They exploit
the massively parallel architecture of Teradata and add to the overall
system performance.When combined with Teradata's new BLOB datatypes, UDFs
become a very potent weapon in your data warehouse arsenal.
T
Jim Calvert, a Teradata senior data warehouse consultant,
can be reached via e-mail at jim.calvert@teradata-ncr.com.