Register | Log in


Subscribe Now>>
Home News Tech2Tech Features Viewpoints Facts & Fun Teradata.com
Tech Support
Download PDF|Send to Colleague

Search high and low for an answer

A Teradata Certified Master answers readers' technical queries.

by Carrie Ballinger, Senior Technical Consultant, Teradata Certified Master V2R5

Recently at my Saturday morning tai chi class, my teacher shared with us this traditional Chinese adage: "When you are trying to solve a problem, look above it and below it." She was instructing us on how to keep our knees from dropping inward by looking above our knees and considering the posture of our hips.

While this ancient wisdom has improved my tai chi moves, the idea of looking above and below for a solution applies just as well to technical problem-solving.

The following is a collection of questions I have received and answers I have provided that will, I hope, increase your knowledge of the Teradata Database. So the next time you face a technical problem, you will be better equipped to look above it and below it, and solve the problem on your own.

Figure 1: Overlapping MultiLoad phases
enlarge
If MaxLoadAWT is less than double the MaxLoadTasks setting, then you will want to overlap the MultiLoad job phases to get the MaxLoadTask number of jobs active without reaching the MaxLoadAWT limit.

Setting the parameters to increase load concurrency
Dear Carrie: Now that I'm on Teradata Database V2R6.1 I'd like to run more MultiLoad jobs concurrently, above the previous limit of 15. But I'm a little confused about how to set the new MaxLoadAWT parameter. Should MaxLoadAWT be equal to MaxLoadTasks or should it double MaxLoadTasks? Are there any rules of thumb?
—Looking Before I Leap

Dear Looking: The purpose of MaxLoadAWT is to prevent an excessive number of AMP worker tasks (AWTs) from being used while supporting an increased number of load jobs. Because you have a limited number of AWTs per node, over-consumption of AWTs by load work could prevent other work from completing.

MaxLoadAWT must be set to a number greater than zero to enable higher load concurrencies. A new load job will be allowed to start only if both MaxLoadTasks and MaxLoadAWT limits have not yet been reached. You need to consider two things in setting MaxLoadAWTs:
The number of concurrent load jobs you would like to run
The number of available AWTs you are willing to burn up to support that work

Your MaxLoadAWT setting cannot be so high that it would exceed 60% of the AWTs defined per AMP. In other words, if your default is 80 AWTs per AMP, your maximum setting would be 48.

Because the first phase, the acquisition phase, of MultiLoad uses two AWTs per AMP, to start multiple loads at the same time you want a MaxLoadAWT setting that is double the MaxLoadTasks setting. For example, you could set MaxLoadTasks to 18 and MaxLoadAWT to 36.

The second phase of MultiLoad requires only one AWT, so if you're able to overlap the acquisition and application phases of the load jobs you can make MaxLoadAWT less than double the MaxLoadTasks. Figure 1, above, illustrates how overlapping jobs requires fewer AWTs.

I'd suggest starting out that you make MaxLoadAWT double the MaxLoadTasks setting and tune it from there.

Priority Scheduler changes going from MP-RAS to Linux
Dear Carrie: We've used Priority Scheduler extensively for the last several years on MP-RAS, but we have just moved to Linux. Do I need to make any changes to my workload management setup because the operating system (OS) has changed?
—Operating With Caution

Dear Cautious: While Priority Scheduler does use services from the OS, the user interface and the setup options are the same across the different OSs supported by the Teradata Database. All of the underlying prioritization algorithms are identical whether you are on MP-RAS, Windows or Linux. Monitoring will look and work the same as well.

While OS particularities have led to some minor differences in Priority Scheduler internals across platforms, I can think of no reason for you to re-assess how you are using priorities or their definitions.

Figure 2: Adjusting query milestones
enlarge
How much you want to reduce your query milestone limit will depend on the amount of increase in hardware power of the new node, as well as any change to the number of AMPs per node.

Of course, it's a good idea to periodically review the effectiveness of your workload management decisions. User growth, new applications, hardware upgrades or a change in OS are all good reasons to give your priority scheme, as well as any Object Throttles or Workload Definitions, a health check.

Finally, as a reminder, if you are moving to a Linux platform with faster CPUs, consider adjusting your query milestone thresholds to a lower speed, as faster CPUs accomplish more work in the same number of CPU seconds (see figure 2, left).

Collected statistics and case sensitivity
Dear Carrie: When columns in a table are defined as NOT CASESPECIFIC, does case sensitivity of values in those columns have any effect on the way Optimizer uses the statistics collected on that table? Let's say I define a column as NOT CASESPECIFIC and then insert values in both uppercase and lowercase. When I collect statistics on that column, will the number of unique values shown be different from what is returned when an SQL query is fired?
—A Sensitive Guy

Dear Sensitive: The short answer is yes. By default, the statistics collection process uses a case-specific approach, even if the column is defined as NOT CASESPECIFIC.

This is how it works: After statistics have been collected, values in the column or index are divided across multiple intervals (200 intervals as of Teradata Database 12.0) in order to build a histogram. Each interval contains details that reflect that interval's demographics. Uppercase and lowercase representations of the same value will map to different intervals.

When statistics with mixed-case values are collected, interval buckets incorporate both lowercase and uppercase values for that particular string. Even if the lowercase (or uppercase) values do not appear in the max or the mode value for the interval bucket, the Optimizer will assume they are there and the non-modal information from these intervals will be used to make estimates.

Lowercase and uppercase values will not be adjacent in the histogram because of how the sorting algorithm places them, which is why they will always appear in different intervals. However, if no intervals exist for lowercase values, only uppercase values are considered for estimates (and vice versa). If estimates are assumed to be present from different case representations, they will be added together. Estimates will still be accurate if the query-predicate comparison is case-specific. But there is a potential for over-estimation if non-case-specific comparisons are made.

Since the hashing algorithm is not case-specific, "CA" and "ca" will hash to the same AMP. However, the Optimizer assumes that every different representation is a different value and has a distinct hash. Based on that assumption, it computes an internal rows-per-value metric that is applied to join cardinalities and other estimates.

I strongly recommend, therefore, that you keep the data clean with every value having the same case representation—unless there is a business requirement for not doing so. For example, if the user transforms the character data to be always uppercase (or always lowercase), then the histogram will only reflect intervals that are all uppercase (or all lowercase) and only a single interval will be used for estimates. This consistent approach will ensure that the statistics collection operation will find only a single instance per value.

Two sets of column-level statistics, illustrated below, show the impact of mixed-case data in a column that is NOT CASESPECIFIC. The column used in this example, c_mktsegment, has five distinct values. In both sets, statistics were collected on the column prior to issuing the command that displays the column statistics detail.

The c_mktsegment column in the Customer table contains only uppercase values for the table. (I have added comments next to the output to explain the function of the different columns in the statistics histogram.)

Because there are only five distinct values in this column, high-biased intervals are used. Each high-biased interval can store statistical information for about two values. Values that appear in high-biased intervals are considered to be representative of skew and are labeled "loner values." Any column with 200 or less distinct values will be represented by such high-biased intervals—even if the data values are evenly distributed as they are in the example illustrated below. To find out more about intervals and how they are constructed, read the Query Optimization Interval Histogram chapter in the SQL Reference: Statement and Transaction Processing publication, found at www.Info.Teradata.com.

HELP STATISTICS customer COLUMN (c_mktsegment);
 
Date         7/6/2018         <— Start of Interval 0, summary data
Time         8:48:41                
Number of Rows         7500000                
Number of Nulls         0                
Number of Intervals         3         <— Indicates 3 intervals are used below
Number of Uniques         5         <— 5 distinct values were found
Numeric         N                
Sampled         0                
Sampled Percent         0                
Version         2                
FieldCount         1                
Min Value         AUTOMOBILE                
Mode Value         AUTOMOBILE                
Mode Frequency         1501166                
 
                                <— Start of Interval 1
Max Value         BUILDING         <— 2nd loner value in the interval
Mode Value         AUTOMOBILE         <— 1st loner value in the interval
Mode Frequency         1501166         <— # of rows with 1st loner value
Non-Modal Values         -2         <— Indicates 2 values are in this interval
Non-Modal Rows         1500036         <— # of rows with 2nd loner value
 
                                <— Start of Interval 2
Max Value         HOUSEHOLD                
Mode Value         FURNITURE                
Mode Frequency         1498469                
Non-Modal Values         -2         <— Indicates 2 values are in this interval
Non-Modal Rows         1500554                
 
                                <— Start of Interval 3
Max Value         MACHINERY                
Mode Value         MACHINERY                
Mode Frequency         1499775                
Non-Modal Values         -1         <— Indicates 1 value is in this interval
Non-Modal Rows         1499775                

The second histogram shows the results of modifying a subset of that same column's values so that one-third are now lowercase, instead of all of them being uppercase.

HELP STATISTICS customer COLUMN (c_mktsegment);
 
Date         7/6/2018                
Time         9:22:28                
Number of Rows         7500000                
Number of Nulls         0                
Number of Intervals         5         <— 5 intervals are now being used, instead of 3
Number of Uniques         10         <— Distinct values have doubled, from 5 to 10
Numeric         N                
Sampled         0                
Version         2                
FieldCount         1                
Min Value         AUTOMOBILE                
Mode Value         AUTOMOBILE                
Mode Frequency         900949                
 
Max Value         BUILDING                
Mode Value         AUTOMOBILE                
Mode Frequency         900949                
Non-Modal Values         -2                
Non-Modal Rows         899148                
 
Max Value         HOUSEHOLD                
Mode Value         FURNITURE                
Mode Frequency         900148                
Non-Modal Values         -2                
Non-Modal Rows         900690                
 
Max Value         automobile         <— Lower case representations begin here
Mode Value         MACHINERY                
Mode Frequency         899066                
Non-Modal Values         -2                
Non-Modal Rows         600217                
 
Max Value         furniture                
Mode Value         building                
Mode Frequency         600888                
Non-Modal Values         -2                
Non-Modal Rows         598321                
 
Max Value         machinery                
Mode Value         household                
Mode Frequency         599864                
Non-Modal Values         -2                
Non-Modal Rows         600709                
Figure 3: New work can start with a subset of AWTs
enlarge
There is no all-or-nothing coordination across AMPs involved in processing new work messages on behalf of a query. Whichever AMPs can get an AWT right away will begin their work immediately.

Uneven availability of AWTs across AMPs
Dear Carrie: I loved your column on AWTs in the June 2006 issue of Teradata Magazine. But there's one thing you didn't address that confuses me. What happens when a WorkNew message is sent from the parsing engine (PE) to all AMPs, and some of the AMPs do not have an AWT available to process it, while others do? Do all of the AMPs that were able to get an AWT give theirs up and put the message back on their queue until every single AMP has an available AWT? That seems like it could get messy.
—Trying to See the Big Picture

Dear Big Picture: I'm glad you enjoyed that column. First, there's no MPP coordination involved in processing WorkNew messages (messages that come from the PE and go to all AMPs). I agree with you—that level of coordination could get messy.

Instead, the AMPs with available AWTs will immediately begin processing the WorkNew messages. (See figure 3, right.) In this scenario, each AMP gets its AWTs, does its share of the processing, and releases its AWTs—all independently from what other AMPs are doing. AMPs that have to wait for an AWT are likely to take longer than other AMPs to complete each step. The synchronization point comes about when all AMPs, in combination, must complete the step they are currently working on before the next step can be sent by the dispatcher. The ones that obtained their AWTs sooner will finish that step sooner.

Where there is a bit of coordination is when there is spawned work, such as row redistribution. For spawned work, a WorkOne message appears on each AMP; this requires an additional AWT to act as a receiver. Row redistribution cannot begin until all AMPs have receiver tasks ready to receive the redistributed rows. If some AMPs cannot accommodate that, and have the WorkOne message queued up, then all AMPs will wait to begin row redistribution.

AMPs use special BYNET pathways called channels to signal to one another during query processing so they know when the last AMP has its spawned AWT. At this point, row redistribution can begin. However, let me add that it is less likely for WorkOne messages to get out of sync in their access to AWTs because WorkOne messages sort closer to the front of the message queue than do WorkNew messages. That means that if an AMP is out of AWTs, spawned work will get available AWTs before additional new work will be started. T

Teradata Magazine-December 2007

Related Links

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:


Protegrity

Teradata.com | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.