Register | Log in


Subscribe Now>>
Home News Tech2Tech Features Viewpoints Facts & Fun Teradata.com
Tech Support
Download PDF|Send to Colleague

Questions lead to knowledge

A Teradata Certified Master answers readers' technical queries.

by Carrie Ballinger, senior database analyst, Special Projects, Teradata Certified Master V2R5

Asking questions as a method of self-education has a long history. The first well-known thinker who systematically introduced questioning into Western civilization was the Greek philosopher Socrates. Although Socrates started out poor and illiterate, he established himself by traversing his native Athens, asking questions and absorbing knowledge. Part of Socrates' contribution was that he recognized the importance of constantly questioning and evaluating ourselves and the world around us.

When MultiLoad is in the acquisition phase, the following actions occur:

1. All sender AWTs are initially set to inactive status.
2. The client sends multiple rows in a message to a sender AWT.
3. The sender AWT that receives rows from the client performs the following steps:
 
> Sets its state to active
> Unpacks the rows in the message and converts them to internal format
> Sends converted rows to the correct AMP's receiver AWT, based on hash codes of the rows
> Sets its state to inactive
> Sends a response to the client and waits for the next message
4. Meanwhile, receiver AWTs process received rows independently.

This column is a compilation of questions I have received and the answers I have provided. And while these are small technical questions, as opposed to the grander questions Socrates may have asked, they carry on the age-old tradition of self-education through asking.

Inactive sessions with MultiLoad jobs
Dear Carrie: We use the default number of sessions (one per AMP) for MultiLoad jobs, which for us is more than 70 sessions per load. However, many of these sessions show up as inactive through Query Session at any point in time. Are we holding AMP worker tasks (AWTs) for all of these inactive sessions?
—Confused in Auckland

Dear Confused: Those inactive MultiLoad sessions are indeed holding AWTs. All AWTs required in the acquisition phase will be held until the end of the data transfer from client to database. "Inactive" means the session/AMP is waiting for data from the client at the moment; "active" means the session/AMP is actively processing data from the client.

A particular session may be changing between active and inactive many times during a load job. A client machine (and/or network) may not be powerful enough to keep all sessions/AMPs busy at the same time. When that happens, only a few sessions will be active simultaneously and most will be inactive. Because the client sends data round-robin to all sessions, a different set of sessions/AMPs is likely to be active at a given point.

On each AMP, one sender AWT and one receiver AWT are always allocated and held until the end of the data transfer, regardless of the number of sessions specified. If the number of sessions is high, a subset of sender tasks is likely to be inactive at any point in time. Yet all sender tasks are involved in processing—just not all at the same time.

On the other hand, if you specified max sessions = 5, only five sessions could be used for data transfer, and it is likely they would show as "active" all of the time. But even with a lower number of sessions specified, the sender AWTs of the other 65 sessions (assuming 70 AMPs) will still be allocated and held until the end of the data transfer. They would be used only during initialization and for spawning receiver tasks, but not during data transfer.

Estimated processing time
Dear Carrie: What is the Estimated Proc Time in the Database Query Log (DBQL) step table? Is it clock seconds or CPU seconds? And is it the same time as the estimated processing time that appears in steps in the Explain text?
—Classifying in California

Dear Classifying: Estimated Proc Time comes from the same source whether it's within DBQL or the Explain text. It represents the Optimizer's cost estimate for a given query step, or for the entire query. Estimated processing time takes into account CPU, I/O and network (BYNET) estimated costs combined, and self-adjusts for higher-powered hardware. So, technically, it is neither CPU nor clock seconds.

The original purpose of estimated processing time was to allow the Optimizer to compare different costs for different database operations when building a plan. Even today, it really has no absolute meaning except within the Optimizer's algorithms. However, if good statistics have been collected, estimated processing time can give you a reasonable sense of whether the query is likely to be very short, very long or somewhere in between. Because it usually correlates with the resulting resource requirements of the request, estimated processing time is a popular secondary classification criterion when using Teradata Active System Management.

Review the DBQL detail to determine where to set the estimated processing time so it can be used as a classification criterion for a Workload Definition (WD). What detail you will need to look at will depend on which software release is installed on your platform.

The purpose of analyzing the estimated processing time of previously run queries is to correlate the resource-usage levels of these queries with the estimates they produced. This correlation can increase confidence that when an estimated processing time value is specified for classification purposes, queries with a predictable level of resource-usage demand will classify to that WD. Analyzing DBQL data can help to take the guesswork out of the classification process.

Teradata Database V2R6.x
The following is an example of where to look for the estimated processing times of previously executed queries when you are using Teradata Database V2R6.1 or V2R6.2. The assumption is that Teradata Active System Management has not yet been enabled.

First, consider the actual Explain text shown below, then compare the estimates it produces with the DBQLStepTbl estimates that are illustrated later.

EXPLAIN
SELECT *
FROM customer, ordertbl
WHERE c_custkey = o_custkey;

Explanation
 1) First, we lock a distinct TPCD."pseudo table" for read on a
   RowHash to prevent global deadlock for TPCD.ordertbl. 
 2) Next, we lock a distinct TPCD."pseudo table" for read on a RowHash
   to prevent global deadlock for TPCD.customer. 
 3) We lock TPCD.ordertbl for read, and we lock TPCD.customer for read. 
 4) We do an all-AMPs RETRIEVE step from TPCD.ordertbl by way of an
   all-rows scan with no residual conditions into Spool 2 (all_amps),
   which is redistributed by hash code to all AMPs. Then we do a
   SORT to order Spool 2 by row hash. The result spool file will not
   be cached in memory. The size of Spool 2 is estimated with high
   confidence to be 5,974,999 rows. The estimated time for this step
   is 40.29 seconds.
 5) We do an all-AMPs JOIN step from TPCD.customer by way of a RowHash
   match scan with no residual conditions, which is joined to Spool 2
   (Last Use) by way of a RowHash match scan. TPCD.customer and
   Spool 2 are joined using a merge join, with a join condition of (
   "TPCD.customer.C_CUSTKEY = O_CUSTKEY"). The result goes into
   Spool 1 (group_amps), which is built locally on the AMPs. The
   result spool file will not be cached in memory. The size of Spool
   1 is estimated with low confidence to be 5,974,999 rows. The
   estimated time for this step is 18.77 seconds.
 6) Finally, we send out an END TRANSACTION step to all AMPs involved
   in processing the request.
 -> The contents of Spool 1 are sent back to the user as the result of
   statement 1. The total estimated time is 59.06 seconds.

Prior to Teradata Database 12.0, data from the DBQL step table (DBC.DBQLStepTbl) will be required to understand the estimated processing time for a query that has already run. The "EstProcTime" column values for each step will need to be summed to get the total estimated time. In this example, the sum of EstProcTime values for the retrieve and the join step is the same value as in the final step of the Explain above: 59.06 seconds. (See table 1, below.)

Table 1: Subset of columns in DBQL step table
enlarge

Teradata Database 12.0
The following illustrates a Teradata Database 12.0 example. This test was run on a similar but somewhat different hardware platform from the one illustrated in table 1. The plan and estimates slightly differ.

First, consider the actual Explain text shown below, then compare the estimates it produces with the DBQLStepTbl estimates that are illustrated later.

EXPLAIN
SELECT *
FROM customer, ordertbl
WHERE c_custkey = o_custkey;

Explanation
 1) First, we lock a distinct TPCD."pseudo table" for read on a
   RowHash to prevent global deadlock for TPCD.ordertbl. 
 2) Next, we lock a distinct TPCD."pseudo table" for read on a RowHash
   to prevent global deadlock for TPCD.customer. 
 3) We lock TPCD.ordertbl for read, and we lock TPCD.customer
   for read. 
 4) We do an all-AMPs RETRIEVE step from TPCD.customer by way of an
   all-rows scan with no residual conditions into Spool 2 (all_amps)
   fanned out into 18 hash join partitions, which is built locally on
   the AMPs. The size of Spool 2 is estimated with high confidence
   to be 600,000 rows (72,000,000 bytes). The estimated time for
   this step is 0.99 seconds. 
 5) We do an all-AMPs RETRIEVE step from TPCD.ordertbl by way of
   an all-rows scan with no residual conditions into Spool 3
   (all_amps) fanned out into 18 hash join partitions, which is
   redistributed by the hash code of (TPCD.ordertbl.O_CUSTKEY)
   to all AMPs. The input table will not be cached in memory, but it
   is eligible for synchronized scanning. The result spool file will
   not be cached in memory. The size of Spool 3 is estimated with
   low confidence to be 5,971,680 rows (573,281,280 bytes). The
   estimated time for this step is 17.91 seconds. 
 6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
   all-rows scan, which is joined to Spool 3 (Last Use) by way of an
   all-rows scan. Spool 2 and Spool 3 are joined using a hash join
   of 18 partitions, with a join condition of ("C_CUSTKEY = O_CUSTKEY"). 
   The result goes into Spool 1 (group_amps), which is built locally
   on the AMPs. The result spool file will not be cached in memory. 
   The size of Spool 1 is estimated with low confidence to be
   5,971,680 rows (1,260,024,480 bytes). The estimated time for this
   step is 14.79 seconds.
 7) Finally, we send out an END TRANSACTION step to all AMPs involved
   in processing the request.
 -> The contents of Spool 1 are sent back to the user as the result of
   statement 1. The total estimated time is 33.69 seconds.

With this newer release, the DBQLogTbl has a new column named "EstProcTime" that provides the same estimated processing time that can be seen in the bottom of the Explain. (See table 2.)

Table 2: Subset of DBQL

While there is no longer any need to access the DBQL step table to get the total estimated processing time, the contents of DBQLStepTbl are shown in table 3 for validation purposes. The values in the "EstProcTime" column match to steps in the Explain shown above.

Table 3: Subset of columns in DBQL step table
enlarge

Redistribution buffers
Dear Carrie: We have two nodes and increased from 10 AMPs per node to 14 AMPs per node on our MP-RAS platform. We set the RedistBufSize setting in the DBS Control record at the maximum of 63, but I am wondering if we shouldn't be at a smaller buffer size to get better query performance. The default value is 4KB. I've read that if you have many AMPs per node, a small redistribution buffer size is generally better.
—Super-sizing in Cincinnati

Dear Super-sizing: To avoid the overhead of sending lots of little messages across the BYNET, buffers are used to batch up individual rows during the row redistribution process. Both load utilities and queries involve such redistributions, but their approach to outbound buffering is different. (See figure 1.)

In the past, the RedistBufSize setting was used by all types of redistributions, but today it only influences the size of row redistribution buffers used by the load utilities. This is a setting in the DBS Control record that determines the number of kilobytes for the units that make up those buffers.

Some recommendations come with this setting, because if this setting is set high and a large number of AMPs are in the system, it is possible that memory demand could be excessive. When MultiLoad or FastLoad are running, each AMP must set up a separate buffer in memory for every AMP in the configuration, and it must allocate space equal to the RedistBufSize number of kilobytes for each of those AMPs.

Which Workload Definition controls pre-AMP processing?
enlarge
Redistribution buffers batch up rows that are going to be sent to other AMPs or other nodes, increasing the efficiency of the row redistribution process.

RedistBufSize works the same on MPRAS, Linux and Windows platforms, and the default value, 4KB, is the same across all platforms.

On your platform each AMP would require 4KB * 28 AMPs = 112KB just for its redistribution buffer when the default value of 4KB is being used. When you change this setting to 63, each AMP will require 63KB per AMP, totaling 1,764KB system-wide. That's approaching 2MB of memory per AMP to support a single load utility's redistribution buffering needs. On small or moderate configurations, this may not put pressure on memory resource. But if you were to double the number of AMPs, for example, the buffer size per AMP will double as well.

Node-level buffering is used for query row redistributions. Node buffering on MP-RAS requires that each AMP set up a 32KB buffer for every node in the system when a query is doing row redistribution. On your platform, each AMP would have a buffer with its size calculated as 32KB * 2 nodes = 64KB. As a side comment, this 32KB-per-node buffer doubles to 64KB on 64-bit platforms, to take advantage of the larger memory available on those platforms.

Bottom line: While you may be getting some improved throughput on your loads, increasing the default RedistBufSize to the maximum is not helping to deliver better query performance.

Priority prior to classification
Dear Carrie: I am a new user of Teradata Active System Management. I know that query priority is determined by the Workload Definition (WD) to which the query classifies, but can you explain to me what happens before the query gets classified to a WD? Specifically, since query classification to a WD doesn't happen until the final module in the parsing engine (PE) (namely, the dispatcher), and since classification determines priority, what priority controls query optimization and other processing that happen upstream of classification?
—Active in Atlanta

Which Workload Definition controls pre-AMP processing?
enlarge
The part of the query processing that takes place in the parsing engine (PE) before query classification occurs will be under the control of the Workload Definition (WD) into which the previous query in that session was classified.

Dear Active: Typically, a query is part of an ongoing session that has completed one or more queries at an earlier time. The WD used by the most recent query in a session is remembered. That WD will control the next query in the session until such time as the new query has been successfully classified. (See figure 2.) All PE processing that is required before query classification takes place will be under the control of the session's previous WD.

If the previous query was demoted to another WD before it completed, the original classification WD will be used for the next-up query's PE work, as a demotion is considered an exception. Whatever allocation group (AG) the previous query's WD maps to will be responsible for all processing prior to AMP execution. Under normal circumstances this should make very little difference, since PE CPU is generally a very small percentage of total query CPU time.

If the session has just logged on and this is the first query, the PE processing will be under the control of an AG already set up in Teradata Active System Management named "M." If a tactical application is composed of queries that are intended to have a very high priority, the initial query in the session will be running at a lower priority during its parsing and optimizing phases. However, once the initial query within that session is completed, both the PE and AMP processing will be controlled by the same high priority—assuming the session's queries consistently classify to the same, high-priority WD and the session remains logged on. T

Teradata Magazine-March 2008

Related Links

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:
Manthan
Trillium
Protegrity
Teradata.com | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.