Ask the Expert
ADVICE FROM THE CTO
Todd Walter
Realizing the Active Warehouse
New support in V2R4 helps put active warehouse plans into action.
An active warehouse is a necessary foundation for doing business in the Internet age. From frenetic dot-coms that do everything at the speed of the Web to traditional companies changing the way they do business with their customers and suppliers, businesses are finding that traditional analysis of operations and processes is too impersonal, too generic, and too slow. The winners in the first decade of the new millennium will be those who use all the data they can collect to give their customers value. They will communicate personally with their customers rather than bombarding them with unwanted broadcasts. All channels of interaction will be integrated to ensure recognition, relationship, customization, relevance, and respect. Instead of independent links, the supply chain will become a pipeline with continuous two-way communication assuring proper supply delivered at the right time without waste, oversupply, or out-of-stock conditions. Quality issues will be recognized and corrected immediately regardless of the location of the product in the pipeline. And those who realize the promise behind the e-commerce hype will be the ones who reap real benefits through cost savings, waste reduction, customer acquisition and loyalty, and, of course, revenue and profitability.
NCR released Teradata Version 2 Release 4 in July 2000 with enhancements to support these new business requirements. V2R4, combined with new platform choices and a major upgrade of the tool and interface components in the Teradata Warehouse 6 solution, strengthens Teradata's ability to perform as an active warehouse.
How does V2R4 support analysis and reporting functions for the large volumes of data common in active warehousing?
A number of new operations have joined the scalable, parallel analytic functions already integrated into Teradata. These functions allow more analysis to take place inside the database, avoiding costly extractions of large volumes of data.
Group count and sum operations allow a query to use aggregation simultaneously with row-level operations. This combination allows calculations such as contribution percent or rank penetration in a single, straightforward SQL query. Before this feature was introduced, data extraction and programmatic computation was required (see Listing 1).
Although it is possible to write statistical functions as arithmetic expressions in SQL, many people do not. This hesitance is due, I think, to fear of the dusty statistics book on the shelf. However, new functions added to SQL let users simply rely on standard statistical functions, including standard deviation, skew, kurtosis, linear regression, correlation, and covariance. These functions leverage Teradata's aggregation performance and scalability to deliver fast statistical analysis of even the largest volumes of data.
But are these functions standard SQL?
Yes. Teradata was first to market with scalable integrated analytics. To enable these analytics, NCR had to invent the necessary SQL. The SQL committee has since incorporated this SQL with a fast-track addition to the standard. In V2R4, NCR implemented the functions as specified in this addition to the standard (while preserving compatibility for those customers already using Teradata extensions). The adoption of this standard makes it practical for tool and application vendors to integrate use of these analytics into their products.
Queries and applications across the enterprise keep getting more complex. Are there limits to what V2R4 can handle?
Every system has limits, but V2R4 extends Teradata's considerably. The nesting level of views, subqueries, and derived tables has been raised to 64 to match the number of joins per query. The number of intermediate spool files a query uses has been raised by a factor of four to 2,048. The size of a single execution step has been raised by a factor of 16, and the size of a query EXPLAIN is now unlimited. Teradata also raised the maximum number of authorized users to four billion (from 32,000) in order to accommodate active warehouses that had already reached the original limit.
It's harder to understand the queries at these complexity levels. Where can users find help?
V2R4 brings two radical improvements to the process of understanding a complex SQL statement. The new modifier SHOW DML is used like EXPLAIN, except it returns the definition of all schema objects touched by or referred to by a particular SQL statement. This feature is invaluable for understanding the way a complex statement navigates the schema.
A new visual explain tool makes quick work of understanding the most complex query execution plan. With different icons for each type of operation and details behind each icon, users can very quickly understand how Teradata will execute the query. Plans are stored in a database to feed to the visual explain tool. The database also saves the plan across new software versions. When multiple generations of query plans are stored, the visual explain tool can compare them side-by-side, pointing out the differences.
Are there any new features for enabling operational requests to the warehouse?
Teradata now enables stored procedures, which allow application designers and implementors to incorporate tactical and operational decision-making logic into the database. Moving application logic into the database makes it easier to develop and maintain applications. Executing that logic inside the parallel DBMS improves performance, scalability, and response time of operational applications. Teradata stored procedures are ANSI standard compliant.
Many of our requests require only summary information. Can we optimize these queries?
To accelerate queries, warehouse implementors have been using summary tables and data marts throughout the history of data warehousing. In V2R4, Teradata automates these tasks with the implementation of aggregate join indexes (AJIs). Also known as materialized views or aggregate summary tables, AJIs allow DBAs to define summaries that are then managed by the DBMS.
An AJI references one or more base tables, precalculates the result of the specified join or aggregation, and places it on disk. Queries that reference the base tables and that can be answered more efficiently from the precalculated results are rewritten automatically to do so. The rewrite is performed even when the AJI does not completely satisfy the query but must be joined to other tables or indexes to complete the query plan. Updates to the underlying base tables are automatically reflected in the AJI so that queries are always answered from the current data. This capability allows Teradata to minimize the work necessary for creating and operating high-performance applications.
Does V2R4 include improvements for existing applications?
The Teradata query optimizer and execution engine are always learning new tricks. V2R4 contains performance and function upgrades for nonunique secondary indices (NUSIs), improvements to lookup performance and costing algorithms, and enhancements to the LIKE function's use of these indexes. More types of covered indexes are allowed, and the optimizer uses them appropriately when planning a query. Aggregation performance was enhanced to speed up the largest aggregations.
In addition to query enhancements, SQL UPDATE performance was improved. Large, joined UPDATE requests benefit, as do INSERT SELECT statements. Improvements to journaling automatically boost the performance of all SQL updates.
Can a data warehouse deliver mission-critical availability levels?
Companies that rely on an active data warehouse to build their products, stock their stores, and provide ordering information on their B2B Web sites all require the highest levels of availability. Teradata delivers mission-critical availability to many such customers today. V2R4 contains many features that help meet businesses' availability requirements.
Software upgrades, long a sore point for open systems in general, have been reduced to a single reboot time. Going forward to a new release or reverting to an old one using the latest tools requires only a few quick steps, minimizing the outage time and the effort required to manage the system software.
Enhancements to the tools for executing hardware upgrades reduce outage time, too. The reconfig tool moves data to the new nodes faster and more efficiently. The use of block-level operations (rather than row-level) accelerates the cleanup phase on the old nodes. Large indexes no longer need to be dropped before and recreated after the upgrade operation, a process that required precious downtime. In V2R4, the reconfig tool has been upgraded to reconfigure the indexes faster than they can be rebuilt, thereby removing the time required to manage the indexes manually. Realizing how precious customers' data is, NCR recommends a full system integrity check before and after the upgrade. The tools to perform this check have undergone a revision as well; they are now much faster to operate and can be run with a full workload in flight. Altogether, the total time to perform a hardware upgrade is cut approximately in half.
Collecting statistics and creating an index now permit access to the table allowing the DBA to perform his tasks without affecting system users. And, a number of other enhancements make it easier for the DBA, the system administrator, and the operations staff to manage Teradata to the availability levels required by the business.
Will V2R4 be available on Windows?
Last year, Teradata was released on Windows NT for configurations up to four nodes (approximately a terabyte) running V2R3. In September, V2R4 will be released on Windows 2000 for configurations up to 64 nodes (more than 16 terabytes). V2R4 feature functions will work identically on both Windows 2000 and Unix, allowing a choice of operating environments.
What is the operational range of Teradata V2R4?
Teradata is installed on a number of Intel SMP systems from a variety of vendors, running production data warehouses. At the other extreme, the largest production system now has more than 100 terabytes of disk for a single copy of the database on 140 nodes. And this extreme range is not nearly the limit. Teradata is fully capable of scaling to 512 nodes, enabling high performance on nearly half a petabyte of data.
Redefining the Future
The Teradata Active Warehouse vision gained momentum in July 2000 when NCR released Teradata Release 4 as part of the Teradata Warehouse 7 solution. In that release, Teradata did what is expected of a fully scalable database: It grew in every dimension. From strategic queries to continuous update, tactical queries to system management, query complexity to high availability, Teradata redefines the expectations of a DBMS and the future of data warehousing. g
Todd Walteris chief technical officer of NCR's Teradata Development Division.
You can reach him at Todd.Walter@ncr.com.
|
Listing 1: A query to show the rank of sales of a particular product in each
store and the percent contribution to total store sales for the top two
products in each store.
|
SELECT storeid, prodid, sales,
RANK() OVER (PARTITION BY storeid ORDER BY Sales) SalesRank,
SUM(sales) OVER (PARTITION BY storeid) AS Store Sales,
sales = 100 / StoreSales AS Contribution
FROM salestbl
QUALIFY SalesRank <= 2
ORDER BY storeid, Contribution DESC;
|