|
|
|
Ask the expert
Add capabilities to the data warehouse dynamically with Teradata's extensibility features.
|
An active user community
Empowering users of business intelligence adds ROI for your enterprise and confidence for your team.
|
Enterprise BPM
Integration of technologies from Hyperion and Teradata result in an enterprise architecture for Business Performance Management.
|
Purpose-built platform
Enterprise data warehousing requirements drive the design and architecture decisions of the Teradata platform.
|
Need technical help? A Teradata Certified Master answers your queries.
|
|
|

|
|
Tech support: Processes and puzzles
Got a technical question?
A Teradata Certified Master is here to help.
by Carrie Ballinger
I love it when my inbox gets jammed with interesting technical "queries," especially the I-never-thought-of-that-before type. Here are just a few of my recent favorites:
Hash joins
Dear Carrie: I'm getting a hash join
with a single partition in my query plan. Can you tell me if that means I get just one hash partition across all AMPs, or does each AMP have its own single hash partition? I'm running out of spool on one AMP when this query is running, and I'm not sure if the hash join is
causing the problem or not.
—Debugging in Dallas
Dear Debugging: The hash join is
not the guilty party. When you get a query plan that says the step performs
a hash join with a single partition, this means that each AMP has a single partition for the purposes of the join activity. It couldn't be on just one AMP because of the way a hash join is performed:
1. Duplicate
the smaller table to all AMPs.
2. Place it
in a hash
lookup table in memory.
3. Join the larger table a row at a time to the small table rows in memory.
One advantage of the hash join is that it allows the larger table to remain local. If there were only a single partition on just one AMP supporting the hash join process, then that large table would have to be redistributed, with all of its rows going to that single AMP. That would
certainly blow your spool limits!
Cached query plans
Dear Carrie: I'm using macros with parameterized SQL so that I can get the performance benefit of cached query plans. I understand that the Teradata optimizer builds a generalized plan when the macro contains input parameters in place of hard-coded values and that in building this generalized plan, the optimizer cannot use statistics for the columns with parameterized values.
My questions is this: If this general-purpose plan is built at the time the macro is created, what happens if the data demographics change and I then
re-collect statistics on columns used in the macro? Won't the macro's plan get more and more outdated over time?
—Developing Apps in Australia
Dear Developing: You are correct that a generalized plan will be produced when a macro uses parameterized SQL, but that plan won't be built until the macro is executed. Such a macro will use a cached plan only if the request cache on that Parsing Engine (PE) contains the plan from a previous execution. If the plan for that macro is not found, the SQL within the macro will undergo optimization, and that plan will be cached on that PE for future executions of the same macro.
There are several reasons why the request cache may not contain the plan for a particular macro, causing the macro to be re-optimized:
1. The macro has not been run recently and the plan has been aged out.
2. Previous executions of the same macro originated from a different PE, the macro is new to this PE.
3. Either a COLLECT STATISTICS or a DDL statement that references objects contained within the macro has been issued since that macro's plan was placed in that PE's statement cache, causing the plan to be flushed.
In your case, as demographics change and you re-collect statistics, all the PEs will have their statement caches purged of this macro's plans. This purge process is actually part of the stats collection procedure. The next execution following the stats collection will cause a new query plan to be built, taking advantage of new stats for the non-parameterized columns.
Number of AMP worker tasks used by utilities
Dear Carrie: I just looked at some
guidelines for increasing load slots on Teradata in V2R6.1. The document states that FastLoad uses 3 AMP worker tasks during acquisition. I thought that
you stated in the Orange Book, "Understanding AMP Worker Tasks," that utilities only ever used 2 AMP worker tasks per AMP. Could you clarify?
—Florida Fact-finder
|
Figure 1
|
|
When FastLoad Phase 1 is loading data into the Teradata Database, each AMP will use three AMP worker tasks—one sender and two receivers.
|
Dear Fact-finder: I'd be happy to clarify this point. I was wrong: FastLoad does indeed use three AMP worker tasks (AWT) for FastLoad Phase 1. Phase 1, also known as the acquisition phase, is the phase in which the data is coming in from the client in large blocks. These huge blocks of data are essentially dropped down to the AMPs as quickly as possible, bypassing the parsing engine (PE).
When one of these blocks of data arrives on an AMP, one AWT (the sole "sender" task on each AMP) is used to read through the block. This sender either hands off the row to another AWT (if the row belongs on that AMP), or it sends off the row to a different AMP via the BYNET.
The two other AWTs that are active on each AMP are "receiver" tasks. The first receiver accepts the redistributed rows sent from sender tasks active on other AMPs (or rows handed over from the sender AWT on the same AMP) and then copies them into a pre-allocated buffer. When this buffer is full, a second receiver task pre-sorts the buffered data and then appends the data to the target table. Both receiver tasks are active during the entire Phase 1.
Thanks for the catch; I've fixed the Orange Book.
Sparse index
Dear Carrie: I'm designing a sparse join index in order to get single-AMP access for a tactical query that is currently an all-AMP query. The problem I am having is that even though I am passing values for the three columns in my sparse join index primary index, I still get an all-AMP plan. My plan, which follows, does show that the sparse join index is being used, but it's being scanned, not directly accessed using the primary index. Why?
—Beating My Head Against a Wall
CREATE JOIN INDEX testJI AS
SELECT txn_nbr, station_nbr, txn_dollars, org_nbr, txn_date, reg_nbr
FROM customer_txn
WHERE txn_date = '2005-06-24'
PRIMARY INDEX
(org_nbr, txn_date, reg_nbr);
EXPLAIN
SELECT txn_nbr, txn_dollars
FROM customer_txn
WHERE txn_date = '2005-06-24'
AND org_nbr = 4214030
AND reg_nbr = 78037;
Explanation
1) First, we lock a distinct DW35."pseudo table" for read on a RowHash to prevent global deadlock for DW35.TESTJI.
2) Next, we lock DW35.TESTJI for read.
3) We do an all-AMPs RETRIEVE step from DW35.TESTJI by way
of an all-rows scan with a condition of "(DW35.TESTJI.REG_NBR =
78037) AND (DW35.TESTJI.ORG_NBR = 4214030)" into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 3,120 rows. The estimated time for this step is 0.05 seconds.
Dear Beating: The reason this is happening is that you have the txn_date column as both your sparse-defining column and as part of your primary index definition. When the optimizer sees that the query has a txn_date value, it uses that value to qualify the query for the sparse join index usage, and, in so doing, it removes that value from being considered for other purposes in the query plan, such as matching to the primary index columns.
To get single-AMP access in this situation, simply remove the sparse-defining column from the primary index definition. You don't need txn_date as part of your sparse join index PI, because in your example all rows in the sparse join index will have the same value for txn_date. Here's the syntax that will give you what you want:
CREATE JOIN INDEX testJI AS
SELECT txn_nbr, station_nbr,
txn_dollars, org_nbr,
txn_date, reg_nbr
FROM customer_txn
WHERE txn_date =
DATE '2005-06-24'
PRIMARY INDEX
(org_nbr, reg_nbr);
If you want a range of txn_date values within your sparse index, then you could successfully include the txn_date column within the primary index definition of the sparse index. For example, this sparse join index definition will also provide you with single-AMP access:
CREATE JOIN INDEX testJI AS
SELECT txn_nbr, station_nbr,
txn_dollars, org_nbr,
txn_date, reg_nbr
FROM customer_txn
WHERE txn_date
BETWEEN DATE '2005-06-01' AND DATE '2005-06-30'
PRIMARY INDEX
(org_nbr, txn_date, reg_nbr);
It's only when the sparse-defining column appears in an equality condition that you will have a problem getting
primary index access if the same
column appears in the primary index definition.
Building a
secondary index on top of a
primary index
Dear Carrie: There's something that confuses me about the new capability in V2R6 that allows you to build a non-unique secondary index (NUSI) on a PPI table non-unique primary index (NUPI).
Since accessing through a NUSI is an all-AMP operation, how can this functionality be a benefit over the probing of each partition, as you would do when you do NUPI access of a PPI table and you don't have the partitioning column value?
I don't mind so much about the probing because it's local. But I'd really like to avoid another all-AMP plan, since I've got over a thousand AMPs in my configuration. This NUSI-on-NUPI feature has been presented to me as a single-AMP operation, but I don't get it.
—Single-AMP Advocate
|
Figure 2
|
|
Building a non-unique secondary index (NUSI) on a partitioned primary index (PPI) table non-unique primary index (NUPI).
|
Dear Advocate: As strange as it may sound, this feature will indeed provide your query with single-AMP access using a NUSI, which traditionally performs all-AMP access. Here's how that works.
Assume you have a PPI table PlacedOrders, with a NUPI of Orderkey partitioned on Orderdate. The primary index, Orderkey, is actually unique, but has to be defined as a NUPI because it does not include the partitioning key of the PPI table. Because it's defined as a NUPI, each partition on that AMP will be probed when you do PI access to account for the possibility of duplicate
PI rows existing in different partitions.
If you define a NUSI on the NUPI column(s), you get a hybrid approach to accessing the row. The PI value is hashed, and the hash bucket part of the resulting row hash value is used to identify the correct AMP where that row resides. After the AMP is identified, the NUSI subtable is read—but only on that AMP—to pull out the exact row ID for the base table row. Since the NUSI carries the row ID and partition number, no probing is required, and it's a single-AMP operation.
Here's how the explain text looks
in V2R6.1 on a PPI table where
orderkey is defined as both the NUPI
and the NUSI:
EXPLAIN
SELECT * from PlacedOrders
WHERE orderkey = 3757
Explanation
1) First, we do a single-AMP
RETRIEVE step from ADW.PlacedOrders by way
of index # 16 "ADW.PlacedOrders.ORDERKEY
= 3757." with no residual conditions into Spool 1 (group_amps), which is built locally on that AMP. The size of Spool 1 is estimated with high confidence to be 1 row. The estimated time for this step is 0.02 seconds.
Just a final note: This only works on PPI tables; you must collect stats in all the obvious places, and it only works if the query passes the primary index in an equality condition. T
Carrie Ballinger, a Teradata Certified Master V2R5, based in El Segundo,
Calif., is a Senior Technical Consultant in Teradata's Active Data Warehouse Center of Expertise. Carrie joined Teradata in 1988 and has focused on benchmarking, database design and performance, authoring several performance benchmarking suites and various Teradata Orange Books. In her current role, Carrie interfaces between Teradata users and engineering to support tactical query implementations and workload management.
© Teradata Magazine-March 2006
back to top
|