Availability best practices enhance system uptime and minimize risks.
by Yuri Pinzon and Mary Pat Simmons
Heat in a data center rises to a critical temperature. The systems stop working, and important information
is not delivered when and where it is needed.
A controller goes bad, and spare parts have to be ordered. While
waiting for the parts, the system support team begins blindly trouble-shooting, possibly creating a new set
of problems. The outage lasts five days, and the disruption escalates.
The data warehouse environment is not immune to unforeseen risks and catastrophe. Whether planned or
unplanned, downtime can impede 24x7 accessibility to an organization's decision-support system and prevent
the right information from getting to decision makers at the right time.
Rather than put the organization's decision-making ability at risk, availability best practices should be
implemented. Starting with the development and use of robust software and proceeding with operator
certification training and well-documented procedures, availability best practices cover the gamut of a data
processing operation. Using proven techniques can mitigate risks from an outage by eliminating or reducing
the length of interruption.
The key to establishing and maintaining a high-availability system is to examine an organization's
infrastructure and effectively manage tangible attributes such as people, tools, processes and IT assets so
they can be supervised, selected, administered and budgeted. Adopting availability strategies and
implementing proper tools and features can enable organizations to minimize repair time.
Crisis support team
Each person who interacts with the system—operations, support staff, third-party vendors and users—has a
role and task that work in conjunction to help ensure system availability.
First and foremost, a crisis coordinator should be appointed by management. As the one in charge during a
crisis, the coordinator will orchestrate tasks to ensure each team member focuses as intently on data
integrity and system availability while the system is down as they do when it is operational. In addition,
the coordinator should be aware of any service level agreements (SLAs), including those documented for each
person, role or task.
| 7 attributes of effective availability management |
|
1. Environment. The physical conditions surrounding the IT assets
2. Infrastructure. Which IT assets are deployed and how they work together
3. Technology. The feature/functionality of each IT asset
4. Support level. Maintenance services to keep all IT assets running
5. Operations. Administrative services to manage daily operations
6. Data protection. Prevention of data loss, corruption and intrusion
7. Recoverability. Ability to recover data and user access after an outage
|
|
When an outage occurs, any delays in contacting decision makers can prolong the outage, so contact
information for critical staff must be readily available to operations and the crisis coordinator. It's then
the role of operations and application support team members to get the system operational again.
A variety of tools, such as Teradata Manager and Teradata Viewpoint, is available for system and applications
monitoring that will help maintain system availability and ensure data accessibility. Teradata Manager serves
as a comprehensive point of control for the Teradata Database and allows an administrator to easily identify
performance inconsistencies that require attention. Teradata Viewpoint provides the administrator and
business user simple portal-based access to status information on their servers and queries. In a typical,
healthy environment, these tools work quietly and unobtrusively—the keys are the actions taken when an alert
is escalated, an application misbehaves or a system resource runs low.
Operational staff should also have access to a searchable database that maintains historical records of prior
failures with the resolution and, if available, the root cause. If the situation occurred previously in a
test or development environment, the reason it occurred in a production environment should be investigated to
lessen the chance that it will happen again.
Continuous availability processes
Certain technology should be used by the organization's management and IT staff, along with operational
processes and strategies to ensure ongoing system operation and availability. Following are some
recommendations on how to keep the system functioning at all times, even when it encounters unpredicted
issues or when new system upgrades are introduced:
|
Manage change. Documenting, reviewing, testing and monitoring any adjustments to the
environment that may affect the system users are some precautionary tasks that will save time and
avoid aggravation. Before implementing any modifications to the system, the proposed change
should be reviewed or tested to meet established success criteria. Any changes to the hardware or
software, environment, network or facilities that do not meet the criteria should not move
forward and should have a documented back-out plan.
|
|
Decrease software upgrades. Once a change is tested and approved, expect some associated
downtime for its implementation. Tools and strategies offered in the Teradata environment will
make upgrades run as quickly and smoothly as possible. Parallel Upgrade Tool spools necessary
packages to the node before the actual change window.
For packages that require a reboot or a kernel rebuild, version migration and fallback enable
upgrades to the alternate boot environment while the database is online. Once the maintenance
window is entered, the change is a reboot away from the new environment. After the upgrade is
completed, verification scripts are run to determine whether the change was successful.
Unsuccessful changes can be resolved during the maintenance window or the system can be switched
back to the original environment if fallback is enabled.
Multiple-system architectures can be used to altogether avoid planned outages during system
upgrades. When multiple systems are synchronized, business-critical applications and users can
be directed to an alternate system so the work is uninterrupted.
|
|
Minimize planned downtime. In a typical environment, multiple restarts may be required to
change a memory module or adapter. Retaining a supply of stock parts on-site or having them
readily available off-site is advisable so all work can be completed on a single trip. Node
issues can be resolved with little impact on system performance or availability when hot
standby nodes (HSNs) are used. Also, with remote virtual private network (VPN) connectivity in
place, the Teradata Support Center can respond expeditiously.
|
High-availability technology
By far the most critical availability best-practices component is technology. Technology and system
architecture are the most important aspects to having and maintaining a highly available system. The goal,
of course, is to have a system that is available to users without interruption. Since downtime is
unavoidable, the recovery focus is to isolate and then eliminate any single point of failure. For example,
mirroring a disk can assist in resolving issues if that disk is problematic. Replicating an access module
processor (AMP) via fallback or duplication on another system helps expedite problems encountered by a
troublesome AMP. And using multiple power sources resolves possible issues with the power malfunctioning.
The following are some innovative tools and recommendations that will help minimize the effects of a system
failure:
|
Fault-tolerant hardware. Teradata system architecture is built with high fault tolerance
and availability standards. Mirrored internal disks, replicated AMPs, even replication on
alternate systems are precautionary elements of the architecture. As mentioned earlier, HSNs are
the most versatile hardware component available to minimize system downtime. Hot swappable disks,
fans and controllers are also available. When hardware faults occur, it is critical that hot
swappable items are identified so they can quickly be replaced. If there is any doubt, your
Teradata Customer Services support person can consult with the Teradata Support Center to help
quickly make this determination.
|
|
Environment. The system is set up to consistently monitor the temperature and humidity on
all nodes. If a cabinet reaches hot status, the system is designed to shut down in an orderly
fashion to protect the data. Redundant power and cooling sources should come from different
conduits in case of an outage, and backup power in the form of an uninterruptible power supply
should be available and up to specification at all times.
|
|
Disaster recovery site. Systems can become unavailable because of a hardware failure,
flooding, fire, theft, an extended power outage or countless other unexpected events. When
continuous availability is an SLA requirement, an off-site backup location, separate from the
primary site, is necessary to ensure the organization's data is protected and available.
In addition, a disaster recovery plan must be in place, updated annually and tested for the
system's recoverability. This is extremely critical and can be done either through resources
within the organization or contracted out to disaster recovery experts.
|
Greater availability
Understanding and implementing availability best practices in the decision-support environment is crucial to
mitigating risk. Establishing a benchmark and then targeting consistent, best-in-class processes, proper
personnel alignment and innovative technology will contribute to greater availability and increased
efficiency of the Teradata system. T
| Methodology for mitigating availability risk |
|
Teradata has developed a proven methodology for understanding and mitigating availability risk, based on the IT
Infrastructure Library (ITIL) framework. The methodology includes tools for identifying specific availability
management gaps and a portfolio of products and services to match availability needs. For example, Teradata's
Parallel Upgrade Tool can spool packages so a software upgrade can be reduced to a five-minute restart. If a
reboot is necessary, version migration and fallback can prepare the alternate boot environment.
—M.P.S.
|
|
Yuri Pinzon, a solutions architect, joined Teradata eight years ago and has been in the IT field for more
than 15 years.
Mary Pat Simmons, a customer service marketing manager, has worked for Teradata over the last 20 years.
Photography by Getty Images
Teradata Magazine-December 2008
|