|

The ins and outs of the
Teradata Active Warehouse
by Todd Walter
When people think of a Teradata Data Warehouse, they
usually think of queries-queries that can't be done with any other
technology, queries that run amazingly fast, query volumes not even
imagined when the warehouse was implemented, queries that deliver
business value every day.
But what is a query without data? A successful data
warehouse thrives on the constant inflow of new data- data against
which to ask today's questions, data to discover business trends, data
on recent transactions that allow analysis of and insights into customer
behavior.
Teradata's tools for moving data into and out of
the data warehouse are second to none, ensuring that the right data is
available at the right time. Here are the ins and outs on the in and out
of Teradata Active Warehouse.
What tools are available for data extract and load?
There are four key tools for managing data movement
in Teradata: Fastload, Multiload, TPump and FastExport. Fastload quickly
populates empty tables. There are many data warehouse processes that
require the capture of new data as quickly as possible. For instance,
new tables must be populated, often with data from historical sources or
other data stores outside the warehouse. Other processes require new
data to be loaded into working tables, where analysis, transformation
and data integrity are completed on the way to warehouse tables. And
when source systems do not supply incremental data to the warehouse, you
have to reload tables from full extracts of the source system. For all
of these applications, Fastload is the tool of choice.
Multiload is a batch data-editing tool. Unlike
Fastload, Multiload operates on tables that already contain data. It can
insert new records or update records that already exist, and it can
delete records either individually or in groups.
TPump is the newest addition to the tool set. In
today's active warehouse, it's not enough for data to load
overnight; it must be acquired continuously to enable immediate business
decisions. TPump has the capabilities of Multiload, but it's designed
to run continuously, making data available immediately rather than
operating in a batch mode.
Finally, in many warehouse environments, it is
necessary to extract volumes of data, whether to supply an external tool
or process, to populate another system, to distribute to an outside
organization or to generate a large report. In these cases, FastExport
is ideal because it executes any SQL query of any complexity and quickly
extracts data.
What makes these tools "second to none"?
From Teradata's beginning, it was clear that large
volumes of data needed to move from outside sources into the data
warehouse. Our developers spent a lot of time making this process fast,
efficient and easy to use. Developers working on non-Teradata database
systems counted on OLTP applications placing the data into the database
rather than focusing on the importation of high data volumes from other
sources. But our focus led us to develop tools that are clearly
different in the data warehouse marketplace.
Speed is critical. When operational systems generate
tens or hundreds of millions of records each day, data loading must be
quick. Teradata tools take full advantage of the parallelism and
scalability built into both the platform and the database by using all
available data paths and system resources to acquire and place the data
in target tables as quickly as possible.
The process must also be efficient. If it requires
all system resources to do one data load, then other processes in the
warehouse environment will not operate and users won't achieve their
goals. The batch tools operate at a very low level within the database
management system to ensure the cost for operating on each row is
minimal and the platform and database resources are used efficiently.
A successful data warehouse has many tables, each
with its own data source(s); therefore, building and changing load
processes as the warehouse evolves must be quick and easy. Parallel data
loads to DBMSs from other vendors require you to move data to the
database platform and deal directly with table partitioning in a
parallel environment.
Teradata's tools eliminate that complexity by
loading data directly from the source into the database table with no
intermediate files or data transfers. They also automatically perform
the partitioning as part of the load process. In Teradata, most data
loads are as simple as describing the source record, identifying the
target table and giving the start command. Although other DBMSs require
you to change the load job when a partitioning change occurs to the
table, Teradata eliminates manual changes, making it completely
invisible and automatic.
How do the data load tools achieve their speed and efficiency?
To fully leverage the physical and database resources
of the parallel Teradata system, the load tools use a "parallel data
flow architecture" (fig. 1). This means data moves directly from the
source into the database without any intermediate stops. The data load
tool executes on the source platform, reading the source data from
wherever it resides. The data is then packaged into blocks for shipment
to the database.
At the start of the load process, the tool creates as
many connections to the database as there are units of parallelism (AMPs)
defined in the Teradata configuration. It spreads the connections evenly
over all the available physical connections to ensure all the available
bandwidth is used. Data packages travel on the connections one after the
other. Because the tool only has to package data and keep all the connections busy, it requires relatively few
resources on the source platform.
Within the database, the AMPs are the receiving end
of the tool's connections. When an AMP receives a block from the tool,
it pulls each record individually and transforms it from the source
platform data type to the database data type. Then it runs the primary index fields through the hashing algorithm and sends
the transformed record to a destination AMP.
When the destination AMP receives a row, it places it
in a work area on disk space managed by the tool. When all rows are received, transformed and
redistributed, the load process sorts the working table local to each
AMP, then applies the changes to the data blocks. Fastload places the
sorted rows into the file system. Multiload goes through each affected
block, taking action according to the next incoming row and its
associated operation. All changes to one data block are made at the same
time so each block is only read and written once. Logging the changes
isn't necessary.
Added together, these architectural features minimize
the cost of loading or updating large volumes of data. Every AMP is
working on the load at all times, guaranteeing that the full power of
Teradata parallelism is applied to the problem to maximize the load's
speed. When the system expands, new AMPs are automatically included in
the load process so it scales linearly with the table configuration and
size.
What makes the tools so easy to use?
When loading data on other parallel database systems,
you have to deal with many issues that are otherwise handled
automatically by Teradata tools. Data partitioning is a big issue,
especially if the load process has to understand how the table is
partitioned and then pre-partition incoming data accordingly.
Partitioning is also a maintenance issue, because every time the
partitioning changes, load jobs change as well.
In addition, you have to consider data location. If
you have to move data to the database platform or correct processing
node for its partition, then you must perform an FTP or other type of
file transfer, reserve and manage landing areas, and execute and monitor
many jobs, often one for each partition.
Teradata eliminates partitioning and data placement
issues by moving the data directly from the source to the database and
performing the hashing and data distribution automatically. Even if you
change the Teradata configuration, you don't have to change the load
jobs.
What if a failure occurs during the load?
The tools have built-in recovery and restart
mechanisms. If a system failure occurs during a load, then simply
restart the same job. The tool and Teradata work together to pick up
from the most recent checkpoint, reposition the source data and complete
the load from the point of failure without manual steps or cleanup.
Users can specify the checkpoint interval, by number of rows or elapsed
time, at the desired level of recovery granularity.
How is TPump different?
TPump has many similarities to the batch load
utilities, but it works a little differently inside. Instead of
connecting directly to the AMP and sending data to a work area, TPump
sends packages of SQL statements, which are immediately executed in parallel. While this costs more per record to update, the new data is
immediately available to all users of the system. This allows tactical
decisions and ODS-type workloads to be executed against up-to-the-minute
enterprise data.
Do we have to load all the rows from the source data?
The tools have a restriction mechanism that allows
conditional row loading, which checks each row as it is packaged for user-specified conditions. Rows
that fail the check are not sent to the database.
What if we do not know whether a row is to be inserted or updated?
Multiload and TPump both implement UPSERT. If the row
already exists in the database, it is updated from the contents of the
source row. If the row does not exist, the source row becomes an insert
into the database. Multiload is very efficient at performing this
operation on large volumes of data.
Is there a way to do more complex data processing prior to loading?
The Teradata tool set has always had exits for
performing other operations on the source data. If such an exit is used,
the exit code is called to supply a record as the source. Any desired
processing can be written into that code. More recently, TeraBuilder
Operators, which were introduced as part of the TeraBuilder release,
opens and simplifies the interface to any processing or piping desired
before the load tool handles a source row. This new mechanism allows
easy access to various other sources including databases, pipes,
application middleware and ETL tools from other vendors.
The source data affects several tables. Do we need a separate job for
each table?
Multiload allows up to five, and TPump allows up to
100 tables to be updated from a single pass through the source data. For
example, a transaction log could, in one single-pass job, update debit,
credit and log-history records.
Is it possible for other applications to run while data is being
loaded?
Teradata designs everything to run in a concurrent
workload environment-no other tables or databases are affected while
loading data. The target table is locked to other updates but can be
read using an access lock that reads through the utility. Data access is
not the only issue, however. It is also important to be able to manage
the tools' resource consumption versus other work running
concurrently. Teradata tools honor the Priority Scheduler facility
settings that allow system administrators to control the service levels
provided to users during the load process.
What limits Teradata's load performance?
Assuming that appropriate bandwidth has been
supplied, the only load bottleneck is the speed of the source data
supply. Tape drives, disk drives and other databases all have limited
bandwidth to supply data. The TeraBuilder suite, now in its first
release, is working to resolve those limits by paralleling the client
tool execution in addition to the database operations. For instance,
several concurrent pipes can feed several Fastload tool instances that
are loading concurrently to the same table. Even faster data consumption
will be the result of this exciting new tool release.
How do we get data and results back out of Teradata?
Every SQL query submitted to Teradata returns a
result, and typical methods for returning those results work well. The
standard interfaces, such as ODBC and JDBC, buffer bulk fetch mechanisms
beneath the standard row processing functions, drastically lowering the
overhead of retrieving rows. The BTEQ tool uses the same technology and
can write results to an output target (file, tape, pipe, etc.) without
having to write a program.
These methods use a single connection to the database
system, which uses the BYNET interconnect technology to assemble the
result set from the AMPs that computed it and merge it after the
database has sorted the portion of the result in parallel.
What if the result is very large or we need it very quickly?
Teradata provides a tool called FastExport to quickly
extract large result sets. The tool's architecture is similar to load
utilities, but the arrows go the opposite direction (fig. 1). FastExport
uses many database connections to simultaneously pull data from all AMPs
and send it to the utility program on the application system, which then
writes it to the output target. The tool uses all available network or
channel connections to eliminate the bottleneck of physical connections.
FastExport starts with a SQL query. It supports all
query functions so that any query of any level of complexity, including
all joins, aggregations and sorting of the result, can be performed, and
it optimizes and executes each query as if it had been submitted through
a standard query interface. Once the result is created, FastExport
formats it for the target, packages it into blocks for efficient
transport and then organizes the blocks among the AMPs by their sort
order. The utility retrieves from all AMPs simultaneously and creates
the ordered output result without retrieving more than one block from
any one AMP.
The process is fully scalable because it uses all
connections evenly while all AMPs work at the same pace to deliver
results. The utility is very efficient because it only has to receive
the blocks and write them in the correct order. All the hard work of
data packaging and sorting is quickly finished inside the database
through parallel, scalable algorithms.
Summary
Teradata tools optimize the process of getting data
into the Teradata Active Warehouse in every way. Parallelism and
scalability optimize performance. Deeply embedded algorithms optimize
resource utilization and efficiency. The tool hides all of this
wonderful complexity from the user, making it automatic, easy to use,
minimizing required changes and simplifying management. In benchmark
situations, Teradata is typically done with the entire benchmark before
the competitor has loaded the data, in large part due to the strength of
these tools. A strong road map will deliver even more capability and
performance. The Teradata tools provide a clear differentiator in the
data warehouse market in an area that is crucial to the success of data
warehouses but which is often overlooked amid the excitement over
business queries. T
|
E-MAIL ME
Would you like to know more about the
Teradata active warehouse and related applications? Ask the Expert!
E-mail questions and comments to Todd at:
todd.walter@ncr.com |
|