Register | Log in


Subscribe Now>>
Home Tech2Tech Features Viewpoints Facts & Fun Teradata.com
Insider's Warehouse
Download PDF|Send to Colleague

Make your data shine

Protect your company's image by maintaining good quality data.

by Joyce Norris-Montanari

High-quality data drives good business decisions. With data that is integrated, consistent, accurate and timely, personnel can be confident in operational and strategic decision making. Analysis is improved and compliance reporting is made easier, which lowers the organization's risk and exposure. What's more, with less cost associated with data reconciliation and correction, the company's bottom line can grow.

Make your data shine

Most organizations understand these benefits, but in an effort to quickly integrate their data, some companies choose not to address data quality up front. Such inaction can create problems that are not easily, quickly or economically corrected. It is far more productive to develop a well-thought-out program that can be implemented from the start to ensure data quality for the long term.

A major step toward creating the best program for your organization is to clearly define your company's objectives. Small, phased-in implementations driven by an enterprise vision will help establish standard practices that then become part of an ongoing routine.

Your organization may have data quality policies already incorporated in company-wide processes that address master data management as well as customer and corporate data integration. To ensure the success of these processes, certain business rules must be adhered to, such as establishing when to cleanse and profile the data, determining how and when the data is integrated, and identifying where the data comes from and who is allowed to use it. (See "Data trail" figure.)

Furthermore, a good data quality program will ensure that the organization conforms to its business requirements of accurate, timely and certified data. The specifications of when and how the data is delivered to the users must be clearly understood and met, and the costs associated with collecting and maintaining integrated data must be agreed upon and not exceeded.

Data profiling
To fix quality issues, the data must first be understood, then examined for any problems. If problems exist, their origin must be pinpointed and properly corrected. Data profiling can help meet these challenges by identifying data in multiple data stores prior to its movement, conversion and/or integration. From this information, the relationships between data that comes from different sources can be recognized.

Data profiling allows you to:
Know data content. Gaining in-depth insight into the source systems and how they relate to the organization's business requirements will help speed up programming development.
Validate business rules and processes. Checking data consistency on an ongoing basis will maintain data quality throughout the organization, based on the regulations set by the data governance team.
Deliver projects faster. Identifying and correcting data quality issues at the source early in the design and development stage can accelerate project completion.
Ensure high-quality data. Improved data that benefits an individual project will move the entire organization toward better decision making.
Decrease project failure. Determining mitigation plans for data issues early in the design phase rather than in the development stage will help ensure a successful project.

Data quality is set at the project level where processes will develop over time and each project will have a holistic effect on the organization. The ongoing result is better data quality across the entire enterprise.

Tools of the trade
Most companies use a data quality tool or extract, transform and load (ETL) tool in a batch mode, meaning the jobs are run usually at night with sets of data, rather than one record at a time. Handling data quality issues in real time is possible with today's technologies; however, organizations must be realistic about the quantity of data that can be processed in such a tight schedule. No system can integrate 64 sources of data in real time; therefore, priorities must be established.

The sources that are easy to access, make good business sense and are based on pre-determined business requirements should be chosen first for data profiling. But keep in mind that since these elements are always changing, the process must allow for flexibility and revision. For that matter, data issues should be fixed at the source as suggested earlier, so data with lower-end issues can be escalated and rectified as the opportunity exists.

To assist in your data profiling endeavors, tools from DataFlux (a SAS company), Harte-Hanks' Trillium, QualityStage from IBM, Informatica and Teradata Profiler are available. Or you can create a set of functions that are specific to your enterprise. Some functions found in most data profiling tools are:
Structural analysis should be run on each field of the sources being analyzed. It checks the data's format for NULLs, spaces, carriage returns, primary and alternate key structure uniqueness, as well as the frequency in which numbers of distinct values are used. For instance, the structural analysis will identify patterns of numbers that are used as integers and defined by the database management system (DBMS). The analysis also verifies that dates are a date type, numbers are numeric, and alphanumeric fields are posted correctly.
Pattern and frequency analysis identifies defined formats and the number of times each pattern occurs. Examples include telephone numbers, city names, state or country codes and date fields. (Unique identifiers that are always integers will not need pattern analysis since they will be identified through the structural analysis defined earlier.) Table 1 shows some frequency analysis results for a department store's catalog categories. As this example shows, misspelled words will skew report results.
Range or threshold analysis verifies specific fields to ensure that they are within a certain range or threshold.
Analysis of counts validates the number and percentages of records with certain characteristics, including total number of records, unique count by value, NULL and NOT NULL counts, blank and frequency counts, and different data patterns.
Cross-table relationship identification compares columns across multiple tables. An example would be customer name, customer number, customer city, customer state, etc. This function also identifies relationships across tables or—in some cases—systems, and it helps identify primary and foreign key relationships.
Outlier or dangling relationships identifies child records that have no parent. An example would be a student's test record (child) that does not have a corresponding student record (parent). Table 2 shows the number of Web customers who have purchased products but who are not in the customer master.

Keep your data polished
High-quality data is the foundation of a successful business. With clean, consistent and accurate data, the opportunities to outperform the competition are much greater, and your organization won't struggle with sub-par customer service, over- or under-stocked shelves, or a disappointing bottom line. Why risk losing your edge? T

Where to find help

You don't need to tackle your data quality issues alone. Teradata has numerous products and services available to make these tasks easier. A benefit is that these products do not require an extraction of data into a separate database or file for processing. Rather, all functions can be executed in the Teradata Database.
> Teradata Profiler can perform profiling functions as well as sophisticated data mining algorithms (histograms, correlation matrix, scatter plots and adaptive histograms) with drill-down capabilities.
> Hands-on assistance is available through Teradata Professional Services, with offerings like Data Quality Profile Analysis and Data Quality Maturity Assessment and Workshop. Through these services, Teradata can help your organization assess its capabilities and processes that are considered critical to the goal of achieving and maintaining data quality.

—J.N.M.

7 steps to start your data quality initiative

Are you confused with where to start in your data quality initiative? For certain, the number of data sources and tables and the volume of data for a single record can be overwhelming. To make it easier, create a checklist that will guide you through the data analysis steps required for a complete implementation:
1. Determine the subject area to work on first. Usually, the priority subject areas are customer, product or financials.
2. Identify source systems or data stores that require data cleansing.
3. Gather table and file layouts for both the source and target databases. Include data volume and update frequency for each data store. Also, include feeds from each source to other systems. This is where you'll find the spider web of data feeds. For clarity, draw these interfaces in a diagram.
4. Establish the time range and other information for the data that will require extraction if you are creating a data warehouse.
5. Obtain data samples whenever possible, and pull the data required into a staging area or profiling database for analysis.
6. Complete the data profiling functions for each source, table and element of the data required, based on business requirements.
7. Document, maintain and share profiling information with the enterprise via data stewardship or data governance groups. This is when it becomes clear if the organization has data issues.

—J.N.M.

Joyce Norris-Montanari is the principal architect of CIBER Inc.'s Global Enterprise Integration Practice.

Teradata Magazine-September 2008

Related Links

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:


Manthan

Trillium

Protegrity

Teradata.com | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.