A practical approach to availability management.
by Mary Pat Simmons and Mark Hancock
Downtime is a fundamental metric for measuring productivity in a data
warehouse, but this number does little to help you understand the basis of a
system's availability. Focusing too much on the end-of-month number can
perpetuate a bias toward a reactive view of availability. Root-cause analysis
is important for preventing specific past problems from recurring, but it
doesn't prevent new issues from causing future downtime.
Potentially more dangerous is the false sense of security encouraged by
historically high availability. Even perfect availability in the past provides
no assurance that you are prepared to handle the risks that may lie just ahead
or to keep pace with the changing needs of your system and users.
So how can you shift your perspective to a progressive view of providing for
availability needs on a continual basis? The answer is availability
management—a proactive approach to availability that applies risk management
concepts to minimize the chance of downtime and prolonged outages. Teradata
recommends four steps for successful availability management.
#1: Understand the risks
Effective availability management begins with understanding the nature of risk.
"There are a variety of occurrences that negatively impact the site, system or
data, which can reduce the availability experienced by end users. We refer to
these as risk events," explains Kevin Lewis, director of Teradata Customer
Services Offer Management.
| The features of effective availability management: |
|
| > |
Improves system productivity and quality of support
|
| > |
Encourages partnering to meet strategic and tactical availability needs
|
| > |
Recognizes all sources and impacts of availability risk
|
| > |
Applies a simple, holistic approach to risk mitigation
|
| > |
Facilitates communication between operations and management
|
| > |
Includes benchmarking using an objective, best-practice assessment
|
| > |
Establishes a clear improvement roadmap to meet evolving needs
|
|
|
The more vulnerable a system is to risk events, the greater the potential for
extended outages or reduced availability and, consequently, lost business
productivity.
Data warehousing risk events can range from the barely detectable to the
inconvenient to the catastrophic. Risk events can be sorted into three familiar
categories of downtime based on their type of impact:
|
Planned downtime is a scheduled system outage, usually during low-usage
or non-critical periods (e.g., upgrades/updates, planned maintenance, testing).
|
|
Unplanned downtime is an unanticipated loss of system, data or
application access (e.g., utility outages, human error, planned downtime
overruns).
|
|
Degraded downtime is "low quality" availability in which the system is
available, but performance is slow and inefficient (e.g., poor workload
management, capacity exhaustion).
|
Although unplanned downtime is usually the most painful, companies have a
growing need to reduce degraded and planned downtime as well. Given the variety
of risk causes and impacts, follow the next step to reduce your system's
vulnerability to risk events.
#2: Assess and strategize
Although the occurrences of risk events to the Teradata system are often
uncontrollable, applying a good availability management framework mitigates
their impact. To meet strategic and tactical availability objectives, Teradata
advocates a holistic system of seven attributes to address all areas that
affect system availability. These availability management attributes are the
tangible real-world IT assets, tools, people and processes that can be
budgeted, assigned, administered and supervised to support system availability.
They are:
|
Environment. The equipment layout and physical conditions within the
data center that houses the infrastructure, including temperature, airflow,
power quality and data center cleanliness
|
|
Infrastructure. The IT assets, the network architecture and
configuration connecting them, and their compatibility with one another. These
assets include the production system; dual systems; backup, archive and restore
(BAR) hardware and software; test and development systems; and disaster
recovery systems
|
|
Technology. The design of each system, including hardware and software
versions, enabled utilities and tools, and remote connectivity
|
|
Support level. Maintenance coverage hours, response times, proactive
processes, support tools employed and the accompanying availability reports
|
|
Operations. Operational procedures and support personnel used in the
daily administration of the system and database
|
|
Data protection. Processes and product features that minimize or
eliminate data loss, corruption and theft; this includes system security,
fallback, hot standby nodes, hot standby disks and large cliques
|
|
Recoverability. Strategies and processes to regularly back up and
archive data and to restore data and functionality in case of data loss or
disaster
|
As evident in this list of attributes, supporting availability goes beyond
maintenance service level agreements and downtime reporting. These attributes
incorporate multiple technologies, service providers, support functions and
management areas. This span necessitates an active partnership between Teradata
and the customer to ensure all areas are adequately addressed. In addition to
being comprehensive, these attributes provide the benefit of a common language
for communicating, identifying and addressing availability management needs.
| enlarge |
|
Answer the sample best-practice questions for each attribute. A "no" response to any yes/no question represents an
availability management gap. Other questions will help you assess your system's overall availability management.
|
|
Dan Odette, Teradata Availability Center of Expertise leader, explains:
"Discussing these attributes with customers makes it easier for them to
understand their system availability gaps and plan an improvement roadmap. This
approach helps customers who are unfamiliar with the technical details of the
Teradata system or IT support best practices such as the Information Technology
Infrastructure Library [ITIL]."
#3: Weigh the odds
To reduce the risk of downtime and/or prolonged outages, your availability
management capabilities must be sufficient to meet your usage needs. (See
figure 1, left.)
According to Chris Bowman, Teradata Technical Solutions architect, "Teradata
encourages customers to obtain a more holistic view of their system
availability and take appropriate action based on benchmarking across all of
the attributes." In order to help customers accomplish this, Teradata offers an
Availability Assessment service. "We apply Teradata technological and ITIL
service management best practices to examine the people, processes, tools and
architectural solutions across the seven attributes to identify system
availability risks," Bowman says.
A comprehensive availability management assessment should consist of three
phases: (See figure 2, below.)
|
Collect. Data is collected across all attributes, including
environmental measurements, current hardware/software configurations, historic
incident data and best-practice conformity by all personnel that support and
administer the Teradata system. This includes customer management and staff,
Teradata support services, and possibly other external service providers. Much
of this data can be collected remotely by Teradata, though an assigned liaison
within the customer organization is requested to facilitate access to the
system and coordinate any personnel interviews.
|
|
Analyze. Data is consolidated and analyzed by an availability management
expert who has a strong understanding of the technical details within each
attribute and their collective impact on availability. During this stage, the
goal is to uncover gaps that may not be apparent because of a lack of
best-practice knowledge or organizational "silos." Silos are characterized by a
lack of cross-functional coordination due to separate decision-making
hierarchies or competing organizational objectives.
|
|
Recommend. The key deliverable of an assessment is a clear list of practical
recommendations for availability management improvements. To have the maximum
positive impact, recommendations must include:
- An unbiased, expert perspective of the customer's specific availability management situation
- Mitigation suggestions to prevent the recurrence of historical outages
- Quantified benchmarking across all attributes to pinpoint the areas of greatest vulnerability to risk events
- Corrective actions provided for every best-practice shortfall
- Operations-level improvement actions with technical details to facilitate tactical implementation
- Management-level guidance in the form of a less technical, executive scorecard to facilitate decision making and budget prioritization
|
| enlarge |
|
Teradata collects data across all attributes and analyzes the current effectiveness of your availability management.
The result is quantified benchmarking and actionable recommendations.
|
|
#4: Plan the next move
The recommendations from the assessment provide the basis for an availability management improvement roadmap.
"Cross-functional participation by both operations and management levels is crucial for maximizing the knowledge transfer of the assessment
findings and ensuring follow-through," Odette says.
Typically, not all of the recommendations can be implemented at once because of resource and budget constraints, so it's common to take a
phased approach. Priorities are based on the assessment benchmarks, the customer's business objectives, the planned evolution for use of the
Teradata system and cost-to-benefit considerations.
Many improvements can be effectively cost-free to implement but still have a big impact. For example, adjusting equipment layout can improve
airflow, which in turn can reduce heat-related equipment failures. Or, having the system/database administrators leverage the full capabilities
of tools already built into Teradata can prevent or reduce outages. Lewis adds, "More significant improvements such as a disaster recovery
capability or dual active systems may require greater investment and effort, but incremental steps can be planned and enacted over time to
ensure availability management keeps pace with the customer's evolving needs."
An effective availability management strategy requires a partnership between you, as the customer, and Teradata. Together, we can apply a
comprehensive framework of best practices to proactively manage risk and provide for your ongoing availability needs. T
Mary Pat Simmons has worked for NCR and Teradata for the last 20 years in a variety of sales and service marketing positions.
Mark Hancock is the offer manager for Teradata's Availability Management Services.
Teradata Magazine-June 2007
|