Fallback Protection: Is it Necessary?

An active data warehouse with fallback enabled provides recovery from pathological failure scenarios. Learn more about how the Teradata Fallback offer achieves levels of single system availability unmatched by other RDBMS vendors.

Email Print Download

 Average 0 out of 5

Executive Overview

In the past, a data warehouse was used primarily in back-office business processes. By contrast, the active data warehouse (ADW) is an operational system used during interactions with customers, suppliers, employees, and business partners. As a result, requirements for higher levels of availability are emerging for the ADW.

A data warehouse built with Teradata® Servers and the Teradata Database offers levels of data availability expected in a world-class operational system. An ADW with fallback enabled provides recovery from pathological failure scenarios, such as the complete failure of RAID devices, disconnection of I/O cables, various software errors, and more. The Teradata Fallback offer achieves levels of single system availability unmatched by other RDBMS vendors. This paper provides a high-level discussion of the features of fallback that enhance the availability of a single ADW system.

Introduction

The same can be said when building your active data warehouse. The active data warehouse (ADW) is rapidly becoming a vital part of today's enterprise.Whereas in the past, a data warehouse was used primarily for back-office report generation, today's ADW is a business-critical system used during interactions with customers, suppliers, employees, and business partners. As a result, requirements for higher levels of availability are emerging for data warehousing.

A data warehouse built on the solid foundation of the Teradata Database and the Teradata Server hardware platform offers levels of data availability expected in a world-class operational system. This paper will discuss fallback, a proprietary feature that enhances the availability of Teradata Database services in a single system environment.

Teradata Servers are designed to offer the highest level of reliability – levels expected from a business-critical system. Each hardware subsystem – such as power, cooling, I/O, and disk subsystem – is constructed with redundant hardware components.Most single-point component failures are masked by the hardware subsystem without impact to the Teradata Database, online applications, or user sessions.

Critical components on the Teradata hardware platform are monitored by a sophisticated diagnostic subsystem. In the event of a component failure, the hardware subsystem(s) will identify and isolate the faulty component(s) and return the system to normal service with the surviving components. The diagnostic subsystem will generate a report of the faulty component( s), as well as status of the subsystem recovery and reconfiguration to the Teradata Administrative Workstation. Usually, the only people who become aware of a failure of this kind are the DBA and the field engineers.

To extend the availability of database services beyond the limits offered by traditional hardware redundancy, Teradata engineers have developed novel software techniques to rapidly recover from extreme failure scenarios. Fallback is a feature unique to Teradata Database. It manages redundant copies of data objects on alternate storage subsystems within a single database instance. This secondary copy of the data is referred to as the fallback copy of data. Thus, Teradata Database can minimize the impact of major failure scenarios – such as the complete failure of a disk array cabinet – failures that would disable other RDBMS systems.

Teradata Availability and Recoverability Assessments

The first step toward determining which solutions to pursue in improving the availability of your data warehouse is the Teradata Availability Assessment. This detailed assessment focuses on the technical issues and operational best practices that you should consider for managing the availability of your Teradata environment. It will help you understand which products, services, and features are best suited for your environment, as well as understand how your processes and procedures compare to Teradata best practices.

Teradata Corporation also offers a Business Impact Analysis to help you determine the financial impact of system downtime over various time periods, as well as a Disaster Recovery Planning service, which will document the best practices and processes to follow during a recovery from a disaster.

Fallback Overview

Fallback Clusters

Fallback is an optional feature that provides enhanced levels of data protection over and above that provided by redundant hardware components.When fallback is enabled, copies of data from a given AMP are replicated on one or more fallback AMPs. A small number of AMPs are logically grouped into a fallback cluster. A system with fallback enabled will contain many fallback clusters.

Typically, each AMP in a fallback cluster operates on an independent clique. Thus, the failure of a major hardware component, such as a disk array, will not disable the database. Instead, the data are available via the fallback copy until such time as the failed component is repaired. Replicated data from each AMP are evenly distributed across the other AMPs in the fallback cluster. This architecture ensures that when a given AMP is down, the workload will be evenly distributed across the other AMPs in the fallback cluster.

Failure Detection and Recovery

If an AMP can no longer access its data, or if its data are corrupted, it's flagged as a down AMP. The database system is not concerned with the cause of the failure – which may be failure in the disk, the controller, or the I/O cables. The bottom line is that the AMP can no longer access its data and is deconfigured from the system. Thus, a downed AMP fails in a fail-fast mode, which is the most desirable way to fail. After all failed AMPs have been deconfigured, Teradata Database will perform a database restart and resume database operations using the surviving components. In the Teradata Database, the surviving members of the cluster are aware that there is a down AMP and take responsibility for performing all database operations – reads and writes – against its fallback copy.

While the Teradata system is running in fallback mode and a primary AMP is down, each fallback AMP in the associated fallback cluster maintains a journal of all changes made to the fallback data. This architecture ensures that updates to the database can continue without impacting the ETL processes.When the primary AMP is restored to normal service, the primary AMP data will be updated from the fallback journals. Since all failed components have been deconfigured from the system, repairs to each physical component can be performed while the database system is operational.

This attribute contributes to the overall uptime of database services. Eventually the physical components will be repaired or replaced and will be ready to resume service to the database. But before they can be reintegrated into the system, their file systems must be brought back to a current and consistent state. This process is described in the following sections.

Fallback Recovery Tools

Teradata Database includes a set of utilities that leverages the architecture of fallback to validate the integrity of data and/or reconstruct damaged data. These utilities minimize the work necessary to restore Teradata Database to normal service after a severe failure scenario.

Rebuild

The Rebuild tool is used to restore data for a failed AMP. The entire contents of the failed AMP are rebuilt from data on the fallback AMP. Rebuild can put the contents of an AMP back together in a few hours while the rest of the system is fully operational.

Without fallback and the rebuild tool, such a disk failure would necessitate a restore from an external source while the RDBMS is offline. Rebuild can also recover a single table. If data corruption of some sort occurs to only one table on one AMP (usually found by CheckTable), Rebuild can recover the contents of that table from the fallback copy with the system online rather than requiring a lengthy restore process and requiring the applications using that table to be offline.

Failure scenarios that would bring most other RDBMS engines to a complete halt – followed by a lengthy rebuild procedure – can be managed by a couple of short database restarts when using Teradata Database with fallback.

CheckTable

CheckTable is a Teradata Database utility that exploits the redundant copy of data present in the system. It compares primary and fallback copies of data and reports any discrepancies. If primary rows are missing, the table can be rebuilt from its fallback copy. If fallback rows are missing, they can be rebuilt from the primary.

Fallback Options

The implementation of fallback provides flexibility as to which database objects are replicated on fallback AMPs. Fallback can be declared at the table, database, and system level. The DBA can evaluate the business value of each database object to determine whether or not the object should be protected with fallback.

Some tables (e.g., temp tables for storing query results) should clearly not be fallback protected as their contents are easily recreated. At the other end of the spectrum, a table that is critical to operating the business and is costly or time-consuming to restore or recreate is a prime candidate for fallback protection.

Performance Impacts of Fallback

Fallback is not free – a system with fallback enabled consumes more system resources than one without fallback.

Fallback will have no effect on read performance. Read operations are performed by the primary AMP only unless it is unavailable, at which time the fallback AMP will deliver the data.

The performance of write operations, however, will be affected by fallback. The primary reason for this is that every write must be performed twice – once for the primary and a second time for the fallback copy. This does not mean that a fallback enabled system needs to be twice as large as the non-fallback counterpart. Consider a system that utilizes 20% of its capacity for updates, 60% of its capacity for queries, and has excess capacity on the order of 20%. There are two areas that require additional capacity. First, the 20% CPU utilization consumed by the update process will grow to 40%, effectively consuming the excess capacity. Next, the amount of disk capacity required will grow. But the computation regarding additional disk capacity should only be considered for those tables that are fallback enabled.

The bottom line is that to use the fallback option for every table in the system requires that the system be configured with 1.7 to 1.9 times the current disk capacity and additional processing power commensurate with the update load on the system. This additional cost can be ameliorated by choosing to fallback protect only some of the tables in the system. In this case, the additional disk and processing power needs to be calculated on a table-by-table basis.

Deploying Fallback – A Business Decision

We've discussed how fallback, a proprietary Teradata Database feature, improves the availability of database services. So, how does one determine whether this feature should be enabled on their ADW? We can answer this question by analyzing the business – not the technology. The proper business decision is one where the cost of downtime is balanced by the cost of the additional system components necessary to achieve a desired level of availability.

Cost of Downtime

Data used during interactions with customers, suppliers, employees, and business partners are valuable assets. These end users invoke a variety of application services to perform their daily activities, and they expect that these services will always be available. By proxy, they expect the underlying database services will always be available.

When database services become unavailable, various application services become unavailable – which in turn, creates a financial impact to the business. Some of these impacts will be seen immediately – such as lost revenue due to a reduction in sales. In other cases, the impact is more insidious – such as the case when a disgruntled customer takes his business to a competitor.

Ultimately, the firm needs to derive a function that relates downtime to cost. Extremely short periods of data unavailability will have virtually no cost to the firm – they aren't noticed by the stakeholders. At the other extreme, long periods of data unavailability will result in a dramatic cost to the firm.

Service levels for database availability have two primary dimensions. The first dimension is availability at the macro level. For example, there may be a service level requirement regarding the aggregate hours of downtime per year. This is the metric that is commonly used to discuss availability.

The second dimension involves the time to recover from a given failure scenario. This is referred to as the Recovery Time Objective (RTO). RTO is very important because most critical business processes can tolerate a short outage – typically measured in minutes – but cannot tolerate an outage measured in hours to days. Teradata systems with fallback enabled can achieve extremely high RTO metrics for single systems because major failure scenarios can be recovered with a system restart, which takes three to 10 minutes to complete.

Cost of High Availability

To sustain a given workload, a fallback enabled system will require more components than a system without fallback. The system will require additional compute nodes and disk subsystems.

We have shown how to estimate the additional hardware resources required to deploy fallback on a system, based on current capacity metrics. In other words, we've shown how to estimate the extra cost to deploy fallback.

Having calculated the cost of downtime and the cost to deliver a highly available database solution, the organization can make a business decision regarding fallback – or any other alternative for high availability. The costs for deployment should balance the cost of downtime.

A key factor when evaluating the cost of downtime is the expected time to recover from a single failure scenario. In business-critical environments, we find that the cost grows exponentially for every minute that the system is unavailable. Rapid recovery from failure is therefore a valuable attribute in a database system.

Disaster Recovery

Fallback should not be considered a disaster protection solution. Disasters brought about by tornadoes, hurricanes, floods, fires, or malicious individuals, have the potential to destroy an entire data center. If that occurs, both the primary and fallback copy of data will most likely be destroyed.

Fallback is not guaranteed to be 100% effective even without a disaster. For example, it does not have a 100% separate path for the data.Most of the database code above the level of the file system is common, as are all of the components outside the database, such as the network and the application. Fallback does not protect from failure of these components of the total system. However, fallback can significantly reduce the probability of data loss.

If the enterprise places a high enough value on the analytic services of the ADW, then a disaster recovery plan that employs dual systems should be considered. The enterprise may choose to deploy a second ADW in a geographically distinct data center. Another alternative is to secure the services of a data recovery center that can be used in the event of a disaster at the primary data center. Again, the RTO metric needs to be analyzed and weighed against the time required to reestablish ADW services.

Conclusion

The ADW is an operational system used daily during interactions with customers, suppliers, employees, and business partners. It provides business-critical analytic services for tactical and strategic decision making. 

To optimize your return on investment, you should consider protection strategies that ensure the availability of database services in the face of severe component failures.

Teradata has raised the bar for data availability in the data warehousing sector. Teradata Database offers a wide range of availability features, including fallback. Fallback provides the highest metrics for data availability and RTO possible for a single system and improves the perceived availability of the database by reducing the overall downtime and improving recovery time objectives.

IT managers and system designers need to assess the impact to the enterprise when data suddenly becomes unavailable. The business impact due to system downtime needs to be balanced with the cost to deliver database services that meet a given service level objective.

When homeowners buy a house, they usually purchase insurance against disasters, such as floods, fires, and earthquakes. The odds of these extreme events occurring are low, but the costs involved in restoring one's home are extremely high should the event occur.Most people cannot tolerate the thought of losing everything should such a disaster occur – they're willing to pay insurance premiums to mitigate risk. As an IT manager, you need to make an honest evaluation regarding how much risk you want to take with your active data warehouse and how much you are willing to pay to protect it from major failures – both likely and unlikely.