A Teradata Certified Master answers readers' technical queries.
by Carrie Ballinger, Senior Technical Consultant, Teradata Certified Master V2R5
Since the beginning of time, curiosity has proved to be a good thing. In the third century B.C., curiosity induced Archimedes to discover the
principle of buoyancy in his bathtub. Curiosity led Chinese papermaker Cai Lun through numerous efforts and materials, including bark, hemp and
silk, before he invented paper and the papermaking process in 105 A.D. And, driven by curiosity, French microbiologist Louis Pasteur labored
from experiment to experiment until he came up with the rabies vaccine in 1885.
While pursuing the answer to a technical question may not be as important as discovering a life-saving drug, curiosity-driven endeavors always
get my respect. Wanting to demonstrate that respect is one reason I enjoy putting together this column of questions and subsequent answers.
So, if you happen to be curious, read on.
BYNET merge of final answer set
Dear Carrie: I'm an old-time Teradata user, and I remember back in the Version 1 days that the Ynet
interconnect performed a hardware merge when returning the final rows in an answer set, avoiding the requirements of building up the entire
answer set on a "coordinator" node. Now that the BYNET has replaced the Ynet, how is the sort/merge of a final answer set handled?
—Merging on My Mind
Dear Merging: The final sort/merge that occurred in the Ynet happens in a similar way with the BYNET today.
(See figure 1, below.) With Version 2 MPP configurations, three levels participate in returning a sorted answer set. This process was a little
simpler in Version 1 when each AMP was a node, but in Version 2 we have multiple AMPs per node:
| Level 1. Each AMP performs a local sort and spools its part of the answer set in parallel with other AMPs. |
| Level 2. Each node merges one buffer's worth of data from all of its contributing AMPs. |
| Level 3. The parsing engine (PE) receives and merges node-level buffers, building one buffer's worth of sorted data to
return to the client. |
| enlarge |
|
BYNET merge processing eliminates the need to bring the data to one node for a large final sort.
|
|
With small (4KB) sort/merge buffers at the node level, the mini sort/merges are done very quickly in memory. The highest values are sent up
through the three tiers first, while the part of the answer set that contains the lower values remains in the spool files at the AMP level
until the higher values have been returned to the client.
During this process, minimal physical I/O is required. Rather, only a single write into the AMP's spool table is performed for a given row,
and only a single read is needed to retrieve the data in the correct order.
A benefit of this approach is that the final answer set never has to be brought together for a possibly large final sort. All three levels
proceed in parallel, and data is transferred using point-to-point messages. This depth of parallel activity can drive linear scalability
as the size of the answer set or number of nodes increases. The "big sort" penalty has been eliminated—actually, it never existed.
Index maintenance during mini-batch
Dear Carrie: I've got hundreds of millions of rows of clickstream data I'd like to get into my Teradata
Database pretty quickly. The data doesn't have to be available immediately, so I'm considering a mini-batch approach that runs once an hour.
Queries accessing the clickstream data will need one, maybe two indexes, but I'm considering join indexes as well. So I can evaluate
trade-offs, could you tell me how mini-batch processing maintains non-unique secondary indexes (NUSIs) compared with single-table join
indexes?
—Seeking Solutions
Dear Seeking: First, if you need to view the data soon but not immediately, a mini-batch approach is
always a more efficient choice than continuous load. Mini-batch also streamlines the maintenance of any secondary structures because it can
batch up those changes and get more done with less effort. (See figure 2, below.) A common mini-batch approach is multi-phased:
- FastLoad is used to load the new data quickly into a staging table that has no secondary structures.
- INSERT/SELECT moves data from the staging table into the indexed base table.
- The rows-just-inserted are deleted from the staging table.
| enlarge |
|
Slightly less work is done to maintain a non-unique secondary index (NUSI) compared with a single-table join index.
|
|
Data transformations can take place before, during or after the INSERT/SELECT.
With a NUSI on the target table, mini-batch index maintenance is done locally and in parallel across all AMPs during the INSERT/SELECT step.
NUSI change rows are collected as the data rows are processed. These changes are sorted and applied to the NUSI subtable after all inserts to
the base table have been completed. Below is an explain of an INSERT/SELECT when one NUSI is present.
If the target table has a single-table join index (a structure that may offer your queries more efficient single-AMP access), some step-to-step
data passing will be involved. This can be seen in the explain text. A spool file with join index updates is written in the step that performs
the base table inserts, and it is then read in a subsequent step. Since, in most cases, the join index primary index is different from the base
table primary index, a redistribution of that spool is required to get the join index maintenance rows to the correct AMP. Redistribution is
never required with NUSI maintenance.
The overhead you experience will depend on the demographics of the indexed column and other system characteristics. With mini-batch, however,
both approaches result in smaller maintenance overhead than would row-at-a-time inserts.
Below are three explains that illustrate:
|
A target table of an INSERT/SELECT that has no secondary structures
|
|
A target table that has one NUSI
|
|
A target table that has one single-table join index with a different primary index from the target table
|
1) No NUSIs or join indexes on the Orders:
EXPLAIN
INSERT INTO Orders
SELECT * FROM Staging_Orders
Explanation
1) First, we lock a distinct CAB."pseudo table" for write on a
RowHash to prevent global deadlock for CAB.Orders.
2) Next, we lock a distinct CAB."pseudo table" for read on a
RowHash to prevent global deadlock for CAB.Staging_Orders.
3) We lock CAB.Orders for write, and we lock
CAB.Staging_Orders for read.
4) We do an all-AMPs MERGE into CAB.Orders from
CAB.Staging_Orders.
5) We spoil the parser's dictionary cache for the table.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
2) With one NUSI on the Orders:
create index(o_orderdate) on Orders;
EXPLAIN
INSERT INTO Orders
SELECT * FROM Staging_Orders
Explanation
1) First, we lock a distinct CAB."pseudo table" for write on a
RowHash to prevent global deadlock for CAB.Orders.
2) Next, we lock a distinct CAB."pseudo table" for read on a
RowHash to prevent global deadlock for CAB.Staging_Orders.
3) We lock CAB.Orders for write, and we lock
CAB.Staging_Orders for read.
4) We do an all-AMPs MERGE into CAB.Orders from
CAB.Staging_Orders.
5) We spoil the parser's dictionary cache for the table.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
3) With one single-table join index on the Orders:
CREATE JOIN INDEX globalJI AS
SELECT (o_orderdate),(rowID)
FROM Orders
PRIMARY INDEX(o_orderdate)
EXPLAIN
INSERT INTO Orders
SELECT * FROM Staging_Orders
Explanation
1) First, we lock a distinct CAB."pseudo table" for write on a
RowHash to prevent global deadlock for CAB.GLOBALJI.
2) Next, we lock a distinct CAB."pseudo table" for write on a
RowHash to prevent global deadlock for CAB.Orders.
3) We lock a distinct CAB."pseudo table" for read on a RowHash to
prevent global deadlock for CAB.Staging_Orders.
4) We lock CAB.GLOBALJI for write, we lock CAB.Orders
for write, and we lock CAB.Staging_Orders for read.
5) We do an all-AMPs MERGE into CAB.Orders from
CAB.Staging_Orders followed by an insert in Spool 2.
6) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
an all-rows scan into Spool 3 (all_amps), which is redistributed
by hash code to all AMPs. Then we do a SORT to order Spool 3 by
join index. The result spool file will not be cached in memory.
The size of Spool 3 is estimated with high confidence to be
74,958,940 rows. The estimated time for this step is 1 minute and
39 seconds.
7) We do an all-AMPs MERGE into CAB.GLOBALJI from Spool 3 (Last
Use).
8) We spoil the parser's dictionary cache for the table.
9) We spoil the parser's dictionary cache for the table.
10) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
Delay utility jobs with Teradata Active System Management
Dear Carrie: I'm looking for a way to gracefully delay load jobs so no more than a certain number can be
active at any point in time. I've considered Utility Throttles, but frankly, the fact that load jobs that would exceed the throttle limit
are rejected is a non-starter. Since Teradata Active System Management can classify by load utility type, I wonder whether I could set up a
Workload Definition (WD) just for loads and include a WD throttle to delay them. What do you think?
—Tired of Rejection
Dear Tired: You can define a WD that supports utility jobs and, if you add a throttle, such a WD throttle
will delay. However, WD throttles were designed to manage SQL requests and are not well-suited for use with utilities. Let's take a closer
look at why that is.
If you selected "load utility type" when you defined your WD, the classification to the WD happens based on a special load utility SQL
statement being recognized. In the case of FastLoad, this would be the BEGIN LOADING statement. This is the first point in the load job (at
the time of classification) where the job could be delayed.
However, if a delay does happen at the BEGIN LOADING statement, all of the sessions logged on before that statement will still be logged on
and sitting idle through the delay time. Consequently, several load jobs with statements in the delay queue could cause a shortage of
available sessions at the PE level.
In addition, delays could occur in multiple places during the processing of a load job. A delay can happen as a result of the INSERT statement,
and also by any checkpoint statement. A moderate load job might have 10 or 20 different places where a delay could occur and multiple delays
could be experienced.
Once the number of load jobs has exceeded the throttle limit, there is cause for additional concern. As each active job attempts to execute a
new statement, it will be intermittently interrupted by a delay in order for the next-out statement in the delay queue to run. In these
instances, you will see an interleaving effect among all active jobs: A short amount of work is done by one job, and it is delayed when a
checkpoint command is encountered. When the checkpoint statement makes it to the top of the delay queue, the job will run again, but only
until the next checkpoint when another delay takes place. The more active jobs in the queue, the more time each will take to complete—holding
locks and AMP worker tasks longer than you may like.
My advice to you is not to use throttles on WDs that support load jobs. One alternative is to use Utility Throttles with the reject option, but
include a tenacity parameter in the load script. The job will be periodically retried until it is able to run. In Teradata Database Release
12.0, Utility Throttles has a delay option that you can use to perform the delay before the AMP sessions are logged on. That's really the
solution you are looking for.
ResUsage when running at 100%
Dear Carrie: We just brought a new application onto our platform and ever since have been running at 100%
CPU utilization for several hours mid-day, almost every day. My ResUsage reporting sometimes looks incomplete during those times, which makes
it difficult to get a good picture of what is happening. Am I losing logged information?
—Analyzing in Atlanta
Dear Analyzing: Teradata was designed as a throughput engine and has a number of techniques to make sure
work continues even when demand is at extremely high levels.
However, on a very busy system such as yours, logged ResUsage data may appear to be missing at times of peak demand. While this is considered
unusual, the internal data buffering can be overrun by new data before the old data is able to be written to the database. The exception will
be reported in the event logs and will show up as missing data rows in the ResUsage tables.
This situation has been addressed in Teradata Database V2R6.2 with an enhancement to the behavior of data logging on busy systems. With this
change, if the ResUsage rows cannot be written to the database faster than they are being generated, the logging periods will be doubled,
when possible, and the data will be effectively "summarized" rather than lost. When the utilization levels of the platform ease up, the rates
will be returned automatically to the previous values. Explanatory information will then be logged to the Event log so you will know exactly
what is happening and when.
While it is still possible to lose ResUsage data, it is less likely. Let me suggest several things for you to check to avoid adding unnecessary
overhead on your system and to reduce the likelihood of losing desired ResUsage data:
|
Enable only the ResUsage tables that you are going to use. Typically only the ResUsageSpma table is enabled for logging under
normal circumstances, as it provides a general summary of the system's performance.
|
|
If tables other than Spma are enabled, consider logging them in summary mode to reduce the number of rows that they generate,
unless you need the detailed reports.
|
|
Consider how often you need the rows logged. The default is once every 10 minutes. If you have switched to a shorter logging
period for debugging purposes, remember to return the logging periods to the normal value once the problem is solved.
|
This answer would not be complete without my pointing you to the Resource Usage Macros and Tables manual at the Information
Products Web site at www.Info.Teradata.com. The manual provides a wealth of detail
and background on ResUsage and how to interpret it. T
Teradata Magazine-September 2007
|