Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
 




E-mail questions and comments to Todd at:
teradata.query@ncr.com


Teradata's forte is moving large 
volumes of data quickly, easily and efficiently, providing the right information at the right time.


Improvements to TeraBuilder will expand the limits of the suite's load performance.

The ins and outs of the 
Teradata Active Warehouse

by Todd Walter

When people think of a Teradata Data Warehouse, they usually think of queries-queries that can't be done with any other technology, queries that run amazingly fast, query volumes not even imagined when the warehouse was implemented, queries that deliver business value every day.

But what is a query without data? A successful data warehouse thrives on the constant inflow of new data- data against which to ask today's questions, data to discover business trends, data on recent transactions that allow analysis of and insights into customer behavior.

Teradata's tools for moving data into and out of the data warehouse are second to none, ensuring that the right data is available at the right time. Here are the ins and outs on the in and out of Teradata Active Warehouse.

What tools are available for data extract and load?

There are four key tools for managing data movement in Teradata: Fastload, Multiload, TPump and FastExport. Fastload quickly populates empty tables. There are many data warehouse processes that require the capture of new data as quickly as possible. For instance, new tables must be populated, often with data from historical sources or other data stores outside the warehouse. Other processes require new data to be loaded into working tables, where analysis, transformation and data integrity are completed on the way to warehouse tables. And when source systems do not supply incremental data to the warehouse, you have to reload tables from full extracts of the source system. For all of these applications, Fastload is the tool of choice.

Multiload is a batch data-editing tool. Unlike Fastload, Multiload operates on tables that already contain data. It can insert new records or update records that already exist, and it can delete records either individually or in groups.

TPump is the newest addition to the tool set. In today's active warehouse, it's not enough for data to load overnight; it must be acquired continuously to enable immediate business decisions. TPump has the capabilities of Multiload, but it's designed to run continuously, making data available immediately rather than operating in a batch mode.

Finally, in many warehouse environments, it is necessary to extract volumes of data, whether to supply an external tool or process, to populate another system, to distribute to an outside organization or to generate a large report. In these cases, FastExport is ideal because it executes any SQL query of any complexity and quickly extracts data.

What makes these tools "second to none"?

From Teradata's beginning, it was clear that large volumes of data needed to move from outside sources into the data warehouse. Our developers spent a lot of time making this process fast, efficient and easy to use. Developers working on non-Teradata database systems counted on OLTP applications placing the data into the database rather than focusing on the importation of high data volumes from other sources. But our focus led us to develop tools that are clearly different in the data warehouse marketplace.

Speed is critical. When operational systems generate tens or hundreds of millions of records each day, data loading must be quick. Teradata tools take full advantage of the parallelism and scalability built into both the platform and the database by using all available data paths and system resources to acquire and place the data in target tables as quickly as possible.

The process must also be efficient. If it requires all system resources to do one data load, then other processes in the warehouse environment will not operate and users won't achieve their goals. The batch tools operate at a very low level within the database management system to ensure the cost for operating on each row is minimal and the platform and database resources are used efficiently.

A successful data warehouse has many tables, each with its own data source(s); therefore, building and changing load processes as the warehouse evolves must be quick and easy. Parallel data loads to DBMSs from other vendors require you to move data to the database platform and deal directly with table partitioning in a parallel environment.

Teradata's tools eliminate that complexity by loading data directly from the source into the database table with no intermediate files or data transfers. They also automatically perform the partitioning as part of the load process. In Teradata, most data loads are as simple as describing the source record, identifying the target table and giving the start command. Although other DBMSs require you to change the load job when a partitioning change occurs to the table, Teradata eliminates manual changes, making it completely invisible and automatic.

How do the data load tools achieve their speed and efficiency?

To fully leverage the physical and database resources of the parallel Teradata system, the load tools use a "parallel data flow architecture" (fig. 1). This means data moves directly from the source into the database without any intermediate stops. The data load tool executes on the source platform, reading the source data from wherever it resides. The data is then packaged into blocks for shipment to the database.

At the start of the load process, the tool creates as many connections to the database as there are units of parallelism (AMPs) defined in the Teradata configuration. It spreads the connections evenly over all the available physical connections to ensure all the available bandwidth is used. Data packages travel on the connections one after the other. Because the tool only has to package data and keep all the connections busy, it requires relatively few resources on the source platform.

Within the database, the AMPs are the receiving end of the tool's connections. When an AMP receives a block from the tool, it pulls each record individually and transforms it from the source platform data type to the database data type. Then it runs the primary index fields through the hashing algorithm and sends the transformed record to a destination AMP.

When the destination AMP receives a row, it places it in a work area on disk space managed by the tool. When all rows are received, transformed and redistributed, the load process sorts the working table local to each AMP, then applies the changes to the data blocks. Fastload places the sorted rows into the file system. Multiload goes through each affected block, taking action according to the next incoming row and its associated operation. All changes to one data block are made at the same time so each block is only read and written once. Logging the changes isn't necessary.

Added together, these architectural features minimize the cost of loading or updating large volumes of data. Every AMP is working on the load at all times, guaranteeing that the full power of Teradata parallelism is applied to the problem to maximize the load's speed. When the system expands, new AMPs are automatically included in the load process so it scales linearly with the table configuration and size.

What makes the tools so easy to use?

When loading data on other parallel database systems, you have to deal with many issues that are otherwise handled automatically by Teradata tools. Data partitioning is a big issue, especially if the load process has to understand how the table is partitioned and then pre-partition incoming data accordingly. Partitioning is also a maintenance issue, because every time the partitioning changes, load jobs change as well.

In addition, you have to consider data location. If you have to move data to the database platform or correct processing node for its partition, then you must perform an FTP or other type of file transfer, reserve and manage landing areas, and execute and monitor many jobs, often one for each partition.

Teradata eliminates partitioning and data placement issues by moving the data directly from the source to the database and performing the hashing and data distribution automatically. Even if you change the Teradata configuration, you don't have to change the load jobs.

What if a failure occurs during the load?

The tools have built-in recovery and restart mechanisms. If a system failure occurs during a load, then simply restart the same job. The tool and Teradata work together to pick up from the most recent checkpoint, reposition the source data and complete the load from the point of failure without manual steps or cleanup. Users can specify the checkpoint interval, by number of rows or elapsed time, at the desired level of recovery granularity.

How is TPump different?

TPump has many similarities to the batch load utilities, but it works a little differently inside. Instead of connecting directly to the AMP and sending data to a work area, TPump sends packages of SQL statements, which are immediately executed in parallel. While this costs more per record to update, the new data is immediately available to all users of the system. This allows tactical decisions and ODS-type workloads to be executed against up-to-the-minute enterprise data.

Do we have to load all the rows from the source data?

The tools have a restriction mechanism that allows conditional row loading, which checks each row as it is packaged for user-specified conditions. Rows that fail the check are not sent to the database.

What if we do not know whether a row is to be inserted or updated?

Multiload and TPump both implement UPSERT. If the row already exists in the database, it is updated from the contents of the source row. If the row does not exist, the source row becomes an insert into the database. Multiload is very efficient at performing this operation on large volumes of data.

Is there a way to do more complex data processing prior to loading?

The Teradata tool set has always had exits for performing other operations on the source data. If such an exit is used, the exit code is called to supply a record as the source. Any desired processing can be written into that code. More recently, TeraBuilder Operators, which were introduced as part of the TeraBuilder release, opens and simplifies the interface to any processing or piping desired before the load tool handles a source row. This new mechanism allows easy access to various other sources including databases, pipes, application middleware and ETL tools from other vendors.

The source data affects several tables. Do we need a separate job for each table?

Multiload allows up to five, and TPump allows up to 100 tables to be updated from a single pass through the source data. For example, a transaction log could, in one single-pass job, update debit, credit and log-history records.

Is it possible for other applications to run while data is being loaded?

Teradata designs everything to run in a concurrent workload environment-no other tables or databases are affected while loading data. The target table is locked to other updates but can be read using an access lock that reads through the utility. Data access is not the only issue, however. It is also important to be able to manage the tools' resource consumption versus other work running concurrently. Teradata tools honor the Priority Scheduler facility settings that allow system administrators to control the service levels provided to users during the load process.

What limits Teradata's load performance?

Assuming that appropriate bandwidth has been supplied, the only load bottleneck is the speed of the source data supply. Tape drives, disk drives and other databases all have limited bandwidth to supply data. The TeraBuilder suite, now in its first release, is working to resolve those limits by paralleling the client tool execution in addition to the database operations. For instance, several concurrent pipes can feed several Fastload tool instances that are loading concurrently to the same table. Even faster data consumption will be the result of this exciting new tool release.

How do we get data and results back out of Teradata?

Every SQL query submitted to Teradata returns a result, and typical methods for returning those results work well. The standard interfaces, such as ODBC and JDBC, buffer bulk fetch mechanisms beneath the standard row processing functions, drastically lowering the overhead of retrieving rows. The BTEQ tool uses the same technology and can write results to an output target (file, tape, pipe, etc.) without having to write a program.

These methods use a single connection to the database system, which uses the BYNET interconnect technology to assemble the result set from the AMPs that computed it and merge it after the database has sorted the portion of the result in parallel.

What if the result is very large or we need it very quickly?

Teradata provides a tool called FastExport to quickly extract large result sets. The tool's architecture is similar to load utilities, but the arrows go the opposite direction (fig. 1). FastExport uses many database connections to simultaneously pull data from all AMPs and send it to the utility program on the application system, which then writes it to the output target. The tool uses all available network or channel connections to eliminate the bottleneck of physical connections.

FastExport starts with a SQL query. It supports all query functions so that any query of any level of complexity, including all joins, aggregations and sorting of the result, can be performed, and it optimizes and executes each query as if it had been submitted through a standard query interface. Once the result is created, FastExport formats it for the target, packages it into blocks for efficient transport and then organizes the blocks among the AMPs by their sort order. The utility retrieves from all AMPs simultaneously and creates the ordered output result without retrieving more than one block from any one AMP.

The process is fully scalable because it uses all connections evenly while all AMPs work at the same pace to deliver results. The utility is very efficient because it only has to receive the blocks and write them in the correct order. All the hard work of data packaging and sorting is quickly finished inside the database through parallel, scalable algorithms.

Summary

Teradata tools optimize the process of getting data into the Teradata Active Warehouse in every way. Parallelism and scalability optimize performance. Deeply embedded algorithms optimize resource utilization and efficiency. The tool hides all of this wonderful complexity from the user, making it automatic, easy to use, minimizing required changes and simplifying management. In benchmark situations, Teradata is typically done with the entire benchmark before the competitor has loaded the data, in large part due to the strength of these tools. A strong road map will deliver even more capability and performance. The Teradata tools provide a clear differentiator in the data warehouse market in an area that is crucial to the success of data warehouses but which is often overlooked amid the excitement over business queries. T

 

E-MAIL ME
Would you like to know more about the Teradata active warehouse and related applications? Ask the Expert! E-mail questions and comments to Todd at:
todd.walter@ncr.com




Copyright by Teradata Corporation 2001-2007.