Register | Log in


Subscribe Now>>
Home News Tech2Tech Features Viewpoints Facts & Fun Teradata.com
Tech Support
Send to Colleague

Curiosity killed the cat ... or did it?

A Teradata Certified Master answers readers' technical queries.

by Carrie Ballinger, Senior Technical Consultant, Teradata Certified Master V2R5

Since the beginning of time, curiosity has proved to be a good thing. In the third century B.C., curiosity induced Archimedes to discover the principle of buoyancy in his bathtub. Curiosity led Chinese papermaker Cai Lun through numerous efforts and materials, including bark, hemp and silk, before he invented paper and the papermaking process in 105 A.D. And, driven by curiosity, French microbiologist Louis Pasteur labored from experiment to experiment until he came up with the rabies vaccine in 1885.

While pursuing the answer to a technical question may not be as important as discovering a life-saving drug, curiosity-driven endeavors always get my respect. Wanting to demonstrate that respect is one reason I enjoy putting together this column of questions and subsequent answers.

So, if you happen to be curious, read on.

BYNET merge of final answer set
Dear Carrie: I'm an old-time Teradata user, and I remember back in the Version 1 days that the Ynet interconnect performed a hardware merge when returning the final rows in an answer set, avoiding the requirements of building up the entire answer set on a "coordinator" node. Now that the BYNET has replaced the Ynet, how is the sort/merge of a final answer set handled?
—Merging on My Mind

Dear Merging: The final sort/merge that occurred in the Ynet happens in a similar way with the BYNET today. (See figure 1, below.) With Version 2 MPP configurations, three levels participate in returning a sorted answer set. This process was a little simpler in Version 1 when each AMP was a node, but in Version 2 we have multiple AMPs per node:
Level 1. Each AMP performs a local sort and spools its part of the answer set in parallel with other AMPs.
Level 2. Each node merges one buffer's worth of data from all of its contributing AMPs.
Level 3. The parsing engine (PE) receives and merges node-level buffers, building one buffer's worth of sorted data to return to the client.

Figure 1: BYNET performs the final sort/merge
enlarge
BYNET merge processing eliminates the need to bring the data to one node for a large final sort.

With small (4KB) sort/merge buffers at the node level, the mini sort/merges are done very quickly in memory. The highest values are sent up through the three tiers first, while the part of the answer set that contains the lower values remains in the spool files at the AMP level until the higher values have been returned to the client.

During this process, minimal physical I/O is required. Rather, only a single write into the AMP's spool table is performed for a given row, and only a single read is needed to retrieve the data in the correct order.

A benefit of this approach is that the final answer set never has to be brought together for a possibly large final sort. All three levels proceed in parallel, and data is transferred using point-to-point messages. This depth of parallel activity can drive linear scalability as the size of the answer set or number of nodes increases. The "big sort" penalty has been eliminated—actually, it never existed.

Index maintenance during mini-batch
Dear Carrie: I've got hundreds of millions of rows of clickstream data I'd like to get into my Teradata Database pretty quickly. The data doesn't have to be available immediately, so I'm considering a mini-batch approach that runs once an hour. Queries accessing the clickstream data will need one, maybe two indexes, but I'm considering join indexes as well. So I can evaluate trade-offs, could you tell me how mini-batch processing maintains non-unique secondary indexes (NUSIs) compared with single-table join indexes?
—Seeking Solutions

Dear Seeking: First, if you need to view the data soon but not immediately, a mini-batch approach is always a more efficient choice than continuous load. Mini-batch also streamlines the maintenance of any secondary structures because it can batch up those changes and get more done with less effort. (See figure 2, below.) A common mini-batch approach is multi-phased:

  1. FastLoad is used to load the new data quickly into a staging table that has no secondary structures.
  2. INSERT/SELECT moves data from the staging table into the indexed base table.
  3. The rows-just-inserted are deleted from the staging table.

Figure 2: Mini-batch 'batches up' index updates
enlarge
Slightly less work is done to maintain a non-unique secondary index (NUSI) compared with a single-table join index.

Data transformations can take place before, during or after the INSERT/SELECT.

With a NUSI on the target table, mini-batch index maintenance is done locally and in parallel across all AMPs during the INSERT/SELECT step. NUSI change rows are collected as the data rows are processed. These changes are sorted and applied to the NUSI subtable after all inserts to the base table have been completed. Below is an explain of an INSERT/SELECT when one NUSI is present.

If the target table has a single-table join index (a structure that may offer your queries more efficient single-AMP access), some step-to-step data passing will be involved. This can be seen in the explain text. A spool file with join index updates is written in the step that performs the base table inserts, and it is then read in a subsequent step. Since, in most cases, the join index primary index is different from the base table primary index, a redistribution of that spool is required to get the join index maintenance rows to the correct AMP. Redistribution is never required with NUSI maintenance.

The overhead you experience will depend on the demographics of the indexed column and other system characteristics. With mini-batch, however, both approaches result in smaller maintenance overhead than would row-at-a-time inserts.

Below are three explains that illustrate:
A target table of an INSERT/SELECT that has no secondary structures
A target table that has one NUSI
A target table that has one single-table join index with a different primary index from the target table

1) No NUSIs or join indexes on the Orders:

EXPLAIN
INSERT INTO Orders
SELECT * FROM Staging_Orders

Explanation
  1) First, we lock a distinct CAB."pseudo table" for write on a
     RowHash to prevent global deadlock for CAB.Orders. 
  2) Next, we lock a distinct CAB."pseudo table" for read on a
     RowHash to prevent global deadlock for CAB.Staging_Orders. 
  3) We lock CAB.Orders for write, and we lock
     CAB.Staging_Orders for read. 
  4) We do an all-AMPs MERGE into CAB.Orders from
     CAB.Staging_Orders. 
  5) We spoil the parser's dictionary cache for the table. 
  6) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
	

2) With one NUSI on the Orders:

create index(o_orderdate) on Orders;

EXPLAIN
INSERT INTO Orders
SELECT * FROM Staging_Orders

Explanation
  1) First, we lock a distinct CAB."pseudo table" for write on a
     RowHash to prevent global deadlock for CAB.Orders. 
  2) Next, we lock a distinct CAB."pseudo table" for read on a
     RowHash to prevent global deadlock for CAB.Staging_Orders. 
  3) We lock CAB.Orders for write, and we lock
     CAB.Staging_Orders for read. 
  4) We do an all-AMPs MERGE into CAB.Orders from
     CAB.Staging_Orders. 
  5) We spoil the parser's dictionary cache for the table. 
  6) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
	

3) With one single-table join index on the Orders:

CREATE JOIN INDEX globalJI AS
SELECT (o_orderdate),(rowID)
FROM Orders
PRIMARY INDEX(o_orderdate)

EXPLAIN
INSERT INTO Orders
SELECT * FROM Staging_Orders

Explanation
  1) First, we lock a distinct CAB."pseudo table" for write on a
     RowHash to prevent global deadlock for CAB.GLOBALJI. 
  2) Next, we lock a distinct CAB."pseudo table" for write on a
     RowHash to prevent global deadlock for CAB.Orders. 
  3) We lock a distinct CAB."pseudo table" for read on a RowHash to
     prevent global deadlock for CAB.Staging_Orders. 
  4) We lock CAB.GLOBALJI for write, we lock CAB.Orders
     for write, and we lock CAB.Staging_Orders for read. 
  5) We do an all-AMPs MERGE into CAB.Orders from
     CAB.Staging_Orders followed by an insert in Spool 2. 
  6) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
     an all-rows scan into Spool 3 (all_amps), which is redistributed
     by hash code to all AMPs.  Then we do a SORT to order Spool 3 by
     join index.  The result spool file will not be cached in memory. 
     The size of Spool 3 is estimated with high confidence to be
     74,958,940 rows.  The estimated time for this step is 1 minute and
     39 seconds. 
  7) We do an all-AMPs MERGE into CAB.GLOBALJI from Spool 3 (Last
     Use). 
  8) We spoil the parser's dictionary cache for the table. 
  9) We spoil the parser's dictionary cache for the table. 
 10) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
	

Delay utility jobs with Teradata Active System Management
Dear Carrie: I'm looking for a way to gracefully delay load jobs so no more than a certain number can be active at any point in time. I've considered Utility Throttles, but frankly, the fact that load jobs that would exceed the throttle limit are rejected is a non-starter. Since Teradata Active System Management can classify by load utility type, I wonder whether I could set up a Workload Definition (WD) just for loads and include a WD throttle to delay them. What do you think?
—Tired of Rejection

Dear Tired: You can define a WD that supports utility jobs and, if you add a throttle, such a WD throttle will delay. However, WD throttles were designed to manage SQL requests and are not well-suited for use with utilities. Let's take a closer look at why that is.

If you selected "load utility type" when you defined your WD, the classification to the WD happens based on a special load utility SQL statement being recognized. In the case of FastLoad, this would be the BEGIN LOADING statement. This is the first point in the load job (at the time of classification) where the job could be delayed.

However, if a delay does happen at the BEGIN LOADING statement, all of the sessions logged on before that statement will still be logged on and sitting idle through the delay time. Consequently, several load jobs with statements in the delay queue could cause a shortage of available sessions at the PE level.

In addition, delays could occur in multiple places during the processing of a load job. A delay can happen as a result of the INSERT statement, and also by any checkpoint statement. A moderate load job might have 10 or 20 different places where a delay could occur and multiple delays could be experienced.

Once the number of load jobs has exceeded the throttle limit, there is cause for additional concern. As each active job attempts to execute a new statement, it will be intermittently interrupted by a delay in order for the next-out statement in the delay queue to run. In these instances, you will see an interleaving effect among all active jobs: A short amount of work is done by one job, and it is delayed when a checkpoint command is encountered. When the checkpoint statement makes it to the top of the delay queue, the job will run again, but only until the next checkpoint when another delay takes place. The more active jobs in the queue, the more time each will take to complete—holding locks and AMP worker tasks longer than you may like.

My advice to you is not to use throttles on WDs that support load jobs. One alternative is to use Utility Throttles with the reject option, but include a tenacity parameter in the load script. The job will be periodically retried until it is able to run. In Teradata Database Release 12.0, Utility Throttles has a delay option that you can use to perform the delay before the AMP sessions are logged on. That's really the solution you are looking for.

ResUsage when running at 100%
Dear Carrie: We just brought a new application onto our platform and ever since have been running at 100% CPU utilization for several hours mid-day, almost every day. My ResUsage reporting sometimes looks incomplete during those times, which makes it difficult to get a good picture of what is happening. Am I losing logged information?
—Analyzing in Atlanta

Dear Analyzing: Teradata was designed as a throughput engine and has a number of techniques to make sure work continues even when demand is at extremely high levels.

However, on a very busy system such as yours, logged ResUsage data may appear to be missing at times of peak demand. While this is considered unusual, the internal data buffering can be overrun by new data before the old data is able to be written to the database. The exception will be reported in the event logs and will show up as missing data rows in the ResUsage tables.

This situation has been addressed in Teradata Database V2R6.2 with an enhancement to the behavior of data logging on busy systems. With this change, if the ResUsage rows cannot be written to the database faster than they are being generated, the logging periods will be doubled, when possible, and the data will be effectively "summarized" rather than lost. When the utilization levels of the platform ease up, the rates will be returned automatically to the previous values. Explanatory information will then be logged to the Event log so you will know exactly what is happening and when.

While it is still possible to lose ResUsage data, it is less likely. Let me suggest several things for you to check to avoid adding unnecessary overhead on your system and to reduce the likelihood of losing desired ResUsage data:
Enable only the ResUsage tables that you are going to use. Typically only the ResUsageSpma table is enabled for logging under normal circumstances, as it provides a general summary of the system's performance.
If tables other than Spma are enabled, consider logging them in summary mode to reduce the number of rows that they generate, unless you need the detailed reports.
Consider how often you need the rows logged. The default is once every 10 minutes. If you have switched to a shorter logging period for debugging purposes, remember to return the logging periods to the normal value once the problem is solved.

This answer would not be complete without my pointing you to the Resource Usage Macros and Tables manual at the Information Products Web site at www.Info.Teradata.com. The manual provides a wealth of detail and background on ResUsage and how to interpret it. T

Teradata Magazine-September 2007

Related Links

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:


Protegrity

Teradata.com | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.