Choosing wisely
As your Teradata Warehouse matures, what data acquisition and integration strategy best suits your enterprise?
by Randi Zeehandelaar & Margaret Mills
The acquisition, consolidation, transformation, movement and synchronization of data throughout the enterprise is somewhat like your network. You don't think about it much until it goes down. Yet it's indispensable to the way you work every day. In a data warehousing environment, data acquisition and integration takes the form of determining what, when, how and how much data will be moved between transactional systems and the data warehouse and vice versa. As your data warehouse matures, various data acquisition and integration strategies need to be considered throughout its life cycle.
In its early phases, the data warehouse is used primarily as a decision support enabler, and loading data periodically via a background bulk process is usually the accepted strategy. As more frequent updates are needed, the preferred strategy is to execute scheduled, incremental mini-batches throughout the day.
However, as the data warehouse becomes more mission-critical, requirements for availability and data freshness approach near real-time. More complex strategies involving data synchronization, integration and accessibility become a necessity to facilitate business-critical decision-making throughout the enterprise. This article discusses the various strategies and how Teradata accommodates a number of data integration methods for the Teradata Warehouse.
Data integration challenges
The following recent technology and business trends, coupled with an explosion in the number of data acquisition and integration options, are making integration decisions more complex:
- Diverse, increased amounts of data
A Knightsbridge white paper, "Top Ten Trends in Data Warehousing (2004)," referenced research that indicates the top 10% of enterprises expect to have 5-15 TB of raw data by 2006. This rise requires new tools to manage the complexity of integrating disparate data sources without expanding the IT resources needed to support them.
- The shift to real-time updating
Data warehousing, like business intelligence, is moving toward real-time updating. According to "Data Warehousing: Update Frequency is Accelerating," in the TDWI-Forrester 2004 Quarterly Technology Survey, the number of data warehouse customers updating their data warehouses multiple times per day increased by 7% in the last 18 months and will continue to rise.
- Better external ETL tool performance
As noted by Gartner in its 2003 research report "Slow Growth Ahead for ETL Tools Market," the third-party ETL tools market is experiencing growth as more and more companies replace internally developed extract, transform and load programs with commercially available ETL tools to enhance ROI. Using ETL workflows to acquire data warehouse data allows the IT professional to develop complex data transformations prior to loading and accommodates heterogeneous data sources.
- Accessibility 24/7
More and more companies support customers and business partners throughout the world. Global accessibility to businesses enabled by the Internet and Web-based technologies places high availability requirements on companies' underlying IT infrastructure. High availability of decision support systems has become crucial to successfully conducting business, resulting in the virtual elimination of batch windows and giving rise to high-availability data warehouse solutions that guarantee data availability and performance.
- The real-time enterprise
Businesses are experiencing a paradigm shift with their data warehouses. The need to leverage the historical data available in the decision support system to enable "intelligent" decision-making in real-time throughout the enterprise is placing more stringent data quality and accessibility requirements on the data warehouse. The data warehouse will need to contain data more closely synchronized with transactional systems and be accessible to the right users at the right time.
Data integration considerations
The choice of data acquisition and integration solutions depends on a number of factors:
- The number and nature of the data sources
- Data volumes
- Data latency requirements
- Batch window availability
- Degree of customization, cleansing or transformation required
- Database and application integration requirements
- Checkpoint/restart requirements
- Budget and resource constraints
- Corporate standards
- Component reusability
- Time-to-market requirements
For each application, each factor should be evaluated thoroughly, since more than one solution may be appropriate. Fortunately, Teradata offers a set of products that allows you to implement the right strategy.
Batch load
Teradata's batch load utilities (FastLoad and MultiLoad) acquire large volumes of data in very short timeframes with minimal CPU overhead. FastLoad is used to load empty Teradata Database tables, and MultiLoad is used to load or update existing tables. Both utilities allow a customer to leverage existing business logic by invoking stored procedures, user-defined functions (UDFs), user-written programs (INMODs) or access modules in order to select, validate and preprocess input data.The batch load utilities load data in parallel blocks, making them ideal for high volume data loading. Depending upon the volume of the data load, data latency can be from minutes to hours, which can affect your batch window. Because FastLoad and MultiLoad use table locks and the tables are not available during data loading, load jobs must be scheduled to accommodate batch load windows. Therefore, they are not used when continuous updating or 24/7 database access is required.
Batch window limitations can be somewhat mitigated by the use of mini-batches, since FastLoad and MultiLoad can support latencies of one to two minutes. Also, they both have checkpoint/restart capabilities to ensure data integrity during the load process.
While FastLoad and MultiLoad routines are fairly simple to develop, INMODs and access modules may take longer. But once an access module is implemented, it can be used for all applications accessing that particular data source. FastLoad and MultiLoad satisfy a large number of data-loading scenarios with advantages in terms of resource consumption and flexibility.
Continuous load
Teradata TPump addresses the requirement of continuously loading data in near real-time, often referred to as streaming, while simultaneously leaving the tables available for access. Its most common use is for situations requiring continuous data availability, but TPump also can be used to update large Teradata Database tables with relatively low volumes of changes. For example, TPump could be used to update addresses in a customer table since the information changes infrequently in fairly small quantities.
TPump issues a row hash lock for the rows being updated, so users have continuous table access with the exception of locked rows. Like the Teradata batch load utilities, TPump operates in parallel across Teradata nodes. A TPump-based solution may be used in conjunction with a queuing mechanism, an ETL tool or a message broker designed to capture, cleanse and/or transform the data to be loaded into the Teradata Database.
Extract, Transform and Load (ETL)
Teradata is working with customers to maximize the investment they have made in commercially available ETL tools from vendors such as Ab Initio, Ascential and Informatica. All of these vendors can incorporate Teradata Database access into their workflows, thereby issuing SQL statements against a Teradata Database, executing custom programs or directly calling the Teradata load utilities.
An ETL solution should be considered for complex transformations, heterogeneous data access or high-volume data acquisition scenarios. ETL workflows can be reused across applications but can be somewhat resource-intensive to develop. An alternative to ETL is extract, load and transform (ELT). This strategy extracts and loads the data into a Teradata Database first, then uses the power and performance of the Teradata Warehouse to perform the transformation.
Enterprise Application Integration (EAI)
Unlike the above methods, which acquire data at the data level, the following data acquisition and integration method utilizes a middle tier, often referred to as the application level, and accesses the underlying data from there. For example, in order to propagate an address change within SAP to the data warehouse, an SAP business function (called a BAPI, RFC or iDOC, depending on the function) to change the address within SAP would be invoked. Then the following workflow would be executed: 1) transform the output parameters to a message; 2) pass the message to an enterprise message bus like TIBCO, BEA WebLogic, IBM WebSphere or SAP NetWeaver; and then 3) pass the message to the utility that actually updates the Teradata Database using an access module. Enterprise application integration (EAI) should be considered as a data acquisition and integration method when access to the Teradata Warehouse is part of a business process or event or when transactional integrity is required across a heterogeneous environment. The process of developing an EAI workflow requires special application knowledge and can be complex to develop. EAI workflows can be reused but may have to be modified if the application or underlying data structure changes. EAI is the most appropriate method, specifically when it involves exposing services and events in a service-oriented architecture (SOA). Teradata Database V2R6 enables the database to expose services and events via unique UDF interfaces or access modules.
Enterprise Information Integration (EII)
Enterprise information integration (EII) is primarily used as a data acquisition and integration method when supplementing existing data warehouse and analytics tools require read-only access to disparate data. Defined as the virtual federation of disparate data sources, EII leaves the data in place and creates a virtual integrated view of the data rather than creating new physical data structures. It also can be used to access a limited number of data elements existing outside the data warehouse in real-time, without having to import the data. Once a virtual view is created, it can be invoked by any tool issuing SQL statements. Teradata Database V2R6 enables direct access to external sources for loading or to pull volatile data into a query via UDFs and external stored procedures (ESPs). Because significant overhead is sometimes associated with this solution, EII is not suited for high volumes.
Replication
Replication is an enterprise software solution that enables you to capture and apply database change transactions uni-directionally or bi-directionally in near real-time across diverse systems. Uni-directional replication flows in one direction from one or more sources to one or more targets. Bi-directional replication implies data flowing in both directions. Most industry-leading replication vendors such as DataMirror, GoldenGate and Sybase support uni-directional replication from heterogeneous data sources into a Teradata Database. Additionally, many of the leading ETL vendors such as Informatica and Ascential have developed near real-time replication capabilities. To complete the picture, Teradata has introduced Teradata Replication Services with the Teradata Database V2R6 that accommodate uni-directional and bi-directional replication using the Teradata Database as a source.
Business rules are established at design time to indicate which data will be disseminated, where changes will be applied and what filtering or transformations need to be done. After business rules are established, they are invoked automatically every time the data of interest changes in the source tables. Replication can accommodate basic transformations, filtering and field mappings, and has basic checkpoint/restarting capabilities. Latency will generally run from sub-seconds to seconds depending on the complexity of the replication rules and the network topology.
Figure 1 provides a short, high-level architecture and walk-through of replication. Please note that we are referring to general elements of a replication solution; the nomenclature will differ from product to product.
Practical applications
Here is a scenario designed to follow the progress of a hypothetical company as it evolves toward a real-time enterprise: Our hypothetical company has a manufacturing system running on Oracle and wants to build a Teradata Warehouse for strategic analysis of parts information. The first data acquisition and integration challenge is initially loading the data warehouse. The data warehouse requires limited data from this operational system such as part numbers, descriptions and vendor names and failure information. The initial load will be done using FastLoad and an internally developed extract program since it's the most efficient way to load empty Teradata Database tables.
Once the Teradata Warehouse is populated, the next data acquisition and integration challenge becomes the periodic synchronization of updates to the manufacturing system with the Teradata Warehouse. Since the user community in our scenario is a small group and analysis will only be done once a month, incremental updates to the data warehouse are scheduled weekly. MultiLoad is selected as the load strategy to insert the changed data into the data warehouse rather than re-loading the entire database, with updates being triggered from the Oracle system.
The Teradata Warehouse is executing the analysis as planned and is considered to be a success. The data warehouse designers decide to integrate financial information into the data warehouse so vendor price and performance can be analyzed. The data warehouse begins to receive input from a number of data sources, and some of that data needs to go through complex transformations before loading. The company selects a Teradata ETL partner that is equipped to accommodate transformations and input from multiple data sources, then develops an ETL workflow.
As the Teradata Warehouse continues to succeed, accessibility numbers skyrocket to 3,000 users in 90 locations globally. The data warehouse now requires around-the-clock availability, and batch windows start to disappear. With challenging service-level requirements, moving the data from the manufacturing system's database to the Teradata Warehouse introduces a data acquisition and integration challenge that requires a new approach. In this circumstance, TPump is the best choice because it provides near real-time data synchronization requirements. The same Oracle database triggers developed earlier are directed to write the changed data to a queue, saving development costs and expediting implementation.
With a growing number of users and data, an increasing emphasis on the importance of high data deliverability requires the introduction of a new data acquisition and integration strategy. At this time, the company invokes a dual-active strategy.
This strategy enables a "dual load," moving data into both repositories using Dual Apply and Teradata Replication Services to keep the data synchronized between the two systems. Business rules are established where changed data can be directed to each Teradata Warehouse, helping to sustain the new near real-time demands on the data warehouse.
A final data acquisition and integration challenge arises when the company, realizing the value of the Teradata Warehouse, decides to integrate its supply chain management (SCM) system so that historical price performance metrics will be available for engineers choosing the best part for a particular assembly. The company uses a packaged SCM application that interchanges data through an XML format. It also implements an EAI solution to facilitate the flow of messages between the packaged application and the Teradata Warehouse.
Decisions, decisions
These solutions are not mutually exclusive. To meet the wide variety of business requirements, companies need to carefully analyze their needs and determine the optimal data integration approach accordingly. With the right tools and partner products to support the shift from simple data acquisition to complex and near real-time data integration, your Teradata Warehouse will mature gracefully. T
|
Data acquisition and integration in Teradata's Real-Time Enterprise Reference Architecture
|
|
Data acquisition and integration is an intricate part of Teradata's Real-Time Enterprise Reference Architecture. Data from transactional data repositories is collected and stored in the data warehouse. As discussed in this article, both batch and streaming data acquisition and integration solutions are available.Data is acquired in batch mode when data freshness requirements do not warrant real-time data access or when it is a more cost-effective alternative.
Data is acquired in a streaming fashion when real-time data access is required—as in continuous load, for example. The use of extract, transform and load (ETL) tools facilitates data transformations, cleansing and integration and moves the data into the data warehouse in both the batch and streaming solutions.
As data timeliness requirements increase, Teradata's TPump, enterprise application integration (EAI) and message bus tools play roles in streaming data from continuous sources to the data warehouse. Data warehouse availability, recoverability and performance are also impacted—thus solutions that provide synchronization, such as replication, make information available for accessibility throughout the enterprise. (For more on Teradata's Real-Time Enterprise Reference Architecture, please read "Ask the Expert.")
|
(click to enlarge)
|
Randi Zeehandelaar, Teradata enterprise integration marketing manager, supports Teradata tools and utilities, and Teradata Real-Time Enterprise Reference Architecture initiatives. Randi can be reached at randi.zeehandelaar@teradata-ncr.com.
Margaret Mills, Teradata senior product manager, has developed and implemented multi-platform, multi-vendor and multi-size business continuity plans, including Recovery Center solutions and environments. Margaret can be reached at margaret.mills@teradata-ncr.com.