Tech2Tech
Tech Support

Carrie Ballinger,
Senior database analyst,
Teradata Certified Master
Same, only different
Teradata Certified Master assists readers by answering their technical questions.
by Carrie Ballinger
When I drive home after my monthly mystery book club meeting, I’m always amazed at how a group of people all reading the same book could see and experience the content so differently. Some book club members make connections between characters that I miss, and I often find holes in the plot that others don’t catch.
When people put their heads together about anything—whether it’s a crime thriller or a technical problem—they usually end up with a broader outlook on and a deeper appreciation of the subject.
I enjoy researching and answering the technical questions that show up in my inbox in part because they often change my perspective on the topic. In this column, I focus on the statistics collection process, with the intention of taking some of the mystery out of this frequently performed activity.
Statistics extrapolation
Dear Carrie:
I read about a feature in Teradata 12 that extrapolates what the statistics should be when the user falls behind in re-collecting them. That’s an important feature for us, but I’m not sure it’s happening. Are there times it doesn’t work? How can I tell?
—Doubting DBA
Dear Doubting:
Statistics extrapolation is useful for large tables that are growing—for example, large partitioned primary index (PPI) tables. Often, these are the most time-consuming tables when re-collecting statistics. If stale statistics are detected when a query is submitted, an upward adjustment is made to the table row count that is carried in the histogram. (See figure 1.)

Click to enlarge
You will not experience extrapolation under some circumstances:
- Insignificant growth. The table must have increased its row count by the default thresholds of 10% or 10,000 rows.
- Small tables. They must have at least 25 rows per AMP.
- Tables with a skewed primary index (PI). If the PI histogram shows a high mode frequency greater than four times the average rows per value, extrapolation is not attempted.
You can force extrapolation of small tables or tables with a skewed PI by switching to an all-AMP random AMP sampling. Staleness is detected by comparing the estimated row count from random AMP sampling against what is in the histogram; therefore, using all AMPs to build up the sampling estimate makes it more trustworthy as a source.
To verify that extrapolation is occurring, follow these steps:
- Run an explain on the table for a query that does a full-table scan with no selection. Record the returned estimated row count.
- Load the table to meet one of the two extrapolation thresholds (row growth of 10% or 10,000).
- Run a new explain using the same query. If the estimated row count has changed compared with the original explain, extrapolation is taking place.
In the following example, a transaction table (txn) that is partitioned by date (T_TXNDATE) starts with 13.6 million rows. There is no skew in the txn table’s PI. Statistics are collected on the date column, and a query is run that asks for the number of distinct T_TXNDATE values. Both the HELP STATS output and the query result indicate 731 distinct values. The following demographics apply:
> Minimum T_TXNDATE = 92-01-01
> Maximum T_TXNDATE = 93-12-31
SELECT count(*) FROM txn;
Count(*)
13670282
HELP STATS TXN;
Date Time Unique values Column names
08/12/08 10:37:00 731 T_TXNDATE
08/12/08 14:30:36 13,670,282 T_TXNKEY
SELECT COUNT (DISTINCT t_txndate) FROM txn;
Count(DISTINCT(T_TXNDATE));
731
This next explain shows that a query that asks for the last six months of dates will receive a good estimate when compared to the actual rows returned from a similar count: estimated 3.44 million rows versus actual 3.43 million rows.
EXPLAIN SELECT * FROM txn
WHERE t_txndate BETWEEN 930701 AND 931231;
Explanation
1) First, we lock a distinct CAB.“pseudo table” for read on a RowHash to
prevent global deadlock for CAB.txn.
2) Next, we lock CAB.txn for read.
3) We do an all-AMPs RETRIEVE step from CAB.txn by way of an all-rows
scan with a condition of (“(CAB.txn.T_TXNDATE >= DATE ‘1993-07-01’)
AND (CAB.txn.T_TXNDATE <= DATE ‘1993-12-31’)”) into Spool 1
(group_amps), which is built locally on the AMPs. The size of Spool 1
is estimated with high confidence to be 3,424,736 rows (356,172,544
bytes). The estimated time for this step is 2.97 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 2.97 seconds.
SELECT count(*) FROM txn
WHERE t_txndate BETWEEN 930701 AND 931231;
Count(*)
3438677
When a query changes its date range to include six months of past dates plus six months of future dates, the estimate remains about the same. No extrapolation has taken place because the row count in the base table is the same as in the histogram.
EXPLAIN SELECT * FROM txn
WHERE T_TXNDATE BETWEEN 930701 AND 940630;
Explanation
1) First, we lock a distinct CAB.“pseudo table” for read on a RowHash to
prevent global deadlock for CAB.txn.
2) Next, we lock CAB.txn for read.
3) We do an all-AMPs RETRIEVE step from CAB.txn by way of an all-rows
scan with a condition of (“(CAB.txn.T_TXNDATE >= DATE ‘1993-07-01’)
AND (CAB.txn.T_TXNDATE <= DATE ‘1994-06-30’)”) into Spool 1
(group_amps), which is built locally on the AMPs. The size of Spool 1
is estimated with high confidence to be 3,424,736 rows (356,172,544
bytes). The estimated time for this step is 2.25 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 2.25 seconds.
Now, 50% more rows are inserted into the txn table. No statistics are re-collected. The row count in the table has grown from 13.6 million to 20.4 million after the load.
INSERT INTO txn
SELECT * FROM txn_stage
WHERE T_TXNDATE BETWEEN 940101 AND 941231;
SELECT COUNT(*) FROM txn;
Count(*)
20495830
However, when a HELP STATS is issued, the results are the same because the histogram has not changed. The statistics are now stale.
HELP STATS txn;
Date Time Unique values Column names
08/12/08 10:37:00 731 T_TXNDATE
08/12/08 14:30:36 13,670,282 T_TXNKEY
SELECT COUNT (DISTINCT T_TXNDATE) FROM txn;
Count(DISTINCT(T_TXNDATE));
1096
Extrapolation takes place when the explain is run for the query that requests rows spanning one year. The one-year span includes six months that the histogram sees and six recently added months that the histogram does not see. Note in the explain text below that the estimate is 6.8 million rows, when previously this same explain carried an estimate of 3.4 million rows before the load. It is this upward increase in the estimated row count that validates that statistics extrapolation has taken place on the t_txndate column. After the explain text, the count using the same selection criteria further validates that the 6.8 million-row estimate is on target.
EXPLAIN SELECT * FROM txn
WHERE t_txndate BETWEEN 930701 AND 940630;
Explanation
1) First, we lock a distinct CAB.“pseudo table” for read on a RowHash to
prevent global deadlock for CAB.txn.
2) Next, we lock CAB.txn for read.
3) We do an all-AMPs RETRIEVE step from CAB.txn by way of an all-rows
scan with a condition of (“(CAB.txn.T_TXNDATE >= DATE ‘1993-07-01’)
AND (CAB.txn.T_TXNDATE <= DATE ‘1994-06-30’)”) into Spool 1
(group_amps), which is built locally on the AMPs. The size of Spool 1
is estimated with high confidence to be 6,831,393 rows (710,464,872
bytes). The estimated time for this step is 3.81 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 3.81 seconds.
SELECT count(*) FROM txn
WHERE T_TXNDATE BETWEEN 930701 AND 940630;
Count(*)
6824121
All-AMP random AMP sampling
Dear Carrie:
Why would I want to change from single-AMP random AMP sampling to all-AMP sampling? I have 375 AMPs in my production platform, and it strikes me as a poor use of resources to get them all working on random AMP samples. Shouldn’t I be minimizing the number of AMPs I access?
—Single and Happy

Click to enlarge
Dear Single:
You are correct that all-AMP sampling uses the same techniques as single-AMP random AMP sampling, the only difference being that more AMPs are participating. Where it can help is if the PI is skewed, then sampling more AMPs can improve the quality of the statistical information. Statistics extrapolation will not occur when the PI exhibits skew unless all-AMP random AMP sampling is in effect. So if you want to enjoy extrapolation on your large PPI tables, and the PI demonstrates skew, consider all-AMP sampling.
Furthermore, rather than relying on multiple messages and all-AMP aggregations, all-AMP sampling relies on BYNET-based “channels” to add up row count estimates. A channel is a software construct within the BYNET that simplifies coordination among all of the resources working on the same job. (See figure 2.) For example, when AMPs are working on the same query step, they use channels to communicate with one another.
In a recent engineering test performed on a relatively small configuration, all-AMP random AMP samples were found to add only 15% more time to the process compared with single-AMP sampling. This overhead may be somewhat higher for larger configurations that have a larger number of AMPs. But even though more work is required when a larger numbers of AMPs are involved, the sampling is performed in parallel across all AMPs.
Ultimately, the random AMP sample results will remain in the dictionary cache for some time, possibly several hours. Although the all-AMP type has slightly more overhead, the sampling activity is not expected to happen frequently, and its results are cached and shared among many plan-producing events.
Re-collecting statistics after an upgrade
Dear Carrie:
We will soon add four nodes to our 12-node system. One of the post-upgrade tasks our site tech recommended was to re-collect statistics on all of our tables.
Re-collecting statistics after an upgrade is a good practice. Even though statistics do not contain any configuration-dependent information, the re-collections validate that they are fresh and accurate.
As we see it, only the data distribution per disk drive is changing, not the data demographics. Statistics on the data should still be valid after a reconfiguration. Do we really need to crank up the stats collection machine after our upgrade?
—Pushing Back
Dear Pushing:
While it’s not absolutely necessary, re-collecting statistics after an upgrade is a good practice. Even though statistics do not contain any configuration-dependent information, the re-collections validate that they are fresh and accurate.
And while the statistics may have been fresh enough to get good plans under the previous configuration, the power of the new hardware is considered along with the statistics when the Optimizer factors the costs of different approaches. After a reconfiguration, the cost calculations may be different, and good plans might not be chosen when combined with somewhat out-of-date statistics. A reconfiguration is usually a good time to get everything as fresh and as accurate as possible.
Average AMP rows-per-value
Dear Carrie:
At a customer site I was reviewing the results returned when a command was issued to look at the collected statistics histogram:
HELP STATISTICS tablename COLUMN columnname;
I noticed that the average AMP rows-per-value (RPV) fields were almost always zero. In one case I knew enough about the data demographics of one column to suspect that zero was unlikely: 60,000 rows uniformly distributed on a unique primary index (UPI) with only 39 distinct values for the particular column. I ran a query to calculate average AMP RPV, and the result was almost 69. Have you heard of any problems with this histogram field?
—Averaging Avatar
Dear Averaging:
The average AMP RPV that appears in Teradata 12 histograms is computed only for non-unique secondary indexes (NUSIs), not for non-indexed columns.
The purpose of the average AMP RPV field in the histogram is to determine the cost of a nested join by performing a system-wide average of the average rows per value that each AMP calculates. This column is used to help the Optimizer decide whether it’s cheaper to scan the table or perform a nested join. Check out your NUSI histograms and you’ll see a value in that field.
In the table below, assume that the two non-PI columns have statistics collected on them:
> P_TYPE is a non-indexed column
> P_SIZE is a NUSI
CREATE MULTISET TABLE CAB.parttbl
(
P_PARTKEY INTEGER NOT NULL,
P_NAME VARCHAR(55) NOT NULL,
P_TYPE VARCHAR(25) NOT NULL,
P_SIZE INTEGER NOT NULL,
P_RETAILPRICE DECIMAL(15,2) NOT NULL)
UNIQUE PRIMARY INDEX ( P_PARTKEY )
INDEX ( P_SIZE );
HELP STATS parttbl;
Date Time Unique values Column names
08/12/04 14:11:00 6,000,000 P_PARTKEY
09/03/16 15:56:47 150 P_TYPE
09/03/16 15:53:15 50 P_SIZE
The table below shows subsets of the fields in interval zero contained within the histograms for the two columns listed above. Because P_SIZE is a NUSI, it has a histogram with the interval zero carrying an average AMP RPV value. The non-indexed column P_TYPE does not have a value in that field.
Locking contention with collect statistics
and continuous load
Dear Carrie:
When statistics are collected, I know that the table has a write lock placed on it at the end of the collection process so the table header can be updated. Is there a way to know how long this lock will be held?
I’m designing a process to collect statistics on tables that are being loaded in real time off a TIBCO queue. My process will be fired off periodically and will calculate the growth of the table from the last time statistics were collected. If the calculation is greater than X percent, I will generate a statistics collection routine on columns and indexes on that table. We currently have a similar process running four times a day on continuous load processes, but these are mini-batches, so the locking during statistics collection is not interfering, as far as I can tell. If the table is being updated continuously using row hash locks, what potential issues might I find?
—Conflict-Averse

Click to enlarge
Dear Conflict:
The table header will not be updated at the time that statistics are collected because none of the full statistics-collected information is kept there. Rather, it is stored in either the DBC.TVFields or the DBC.TVIndexes tables in the data dictionary.
While continuous loading places row hash write locks on individual user table rows, the process does not place the locks on the dictionary tables. In addition, statistics collection places an access lock on the base table, so I wouldn’t expect any added conflict if the collection process co-exists with continuous loading. If Teradata TPump is employed for continuous loads, its use of cached plans reduces demand for statistics information, which would have been required during optimization.
During the statistics collection process on a user table, a row hash write lock will be set on either the DBC.TVFields or DBC.TVIndexes tables. This, in turn, updates the dictionary row with the just-created histogram. After the dictionary updates, any statistics information from that user table, along with associated cached plans, will be flushed from the dictionary cache. The next TPump statement will have a little more parsing engine work to do in order to rebuild the plan and access the new statistics from the data dictionary.
The only time that lock contention could occur because of statistics collection is if another query is being parsed that requires the statistics for the same table. The second query will attempt to get statistical data from the dictionary cache. If the statistics have already been flushed but the row hash write lock on the dictionary table has not yet been released, that second query will experience a no-wait block as it tries to read the dictionary row. To avoid being delayed, the second query will then retrieve the dictionary row using an access lock. Dictionary data that is retrieved via an access lock is never cached.
In general, row hash locks do not contribute to major locking contention when collecting statistics and continuous loading occur simultaneously. While statistic collection locks the dictionary tables, continuous loading locks the user table.