Protect your company's image by maintaining good quality data.
by Joyce Norris-Montanari
High-quality data drives good business decisions. With data that is integrated, consistent, accurate and timely, personnel can be confident in
operational and strategic decision making. Analysis is improved and compliance reporting is made easier, which lowers the organization's risk
and exposure. What's more, with less cost associated with data reconciliation and correction, the company's bottom line can grow.
Most organizations understand these benefits, but in an effort to quickly integrate their data, some companies choose not to address data
quality up front. Such inaction can create problems that are not easily, quickly or economically corrected. It is far more productive to
develop a well-thought-out program that can be implemented from the start to ensure data quality for the long term.
A major step toward creating the best program for your organization is to clearly define your company's objectives. Small, phased-in
implementations driven by an enterprise vision will help establish standard practices that then become part of an ongoing routine.
Your organization may have data quality policies already incorporated in company-wide processes that address master data management as well as
customer and corporate data integration. To ensure the success of these processes, certain business rules must be adhered to, such as
establishing when to cleanse and profile the data, determining how and when the data is integrated, and identifying where the data comes from
and who is allowed to use it. (See "Data trail" figure.)
Furthermore, a good data quality program will ensure that the organization conforms to its business requirements of accurate, timely and
certified data. The specifications of when and how the data is delivered to the users must be clearly understood and met, and the costs
associated with collecting and maintaining integrated data must be agreed upon and not exceeded.
Data profiling
To fix quality issues, the data must first be understood, then examined for any problems. If problems exist, their origin must be pinpointed
and properly corrected. Data profiling can help meet these challenges by identifying data in multiple data stores prior to its movement,
conversion and/or integration. From this information, the relationships between data that comes from different sources can be recognized.
Data profiling allows you to:
|
Know data content. Gaining in-depth insight into the source systems and how they relate to the organization's business
requirements will help speed up programming development.
|
|
Validate business rules and processes. Checking data consistency on an ongoing basis will maintain data quality throughout
the organization, based on the regulations set by the data governance team.
|
|
Deliver projects faster. Identifying and correcting data quality issues at the source early in the design and development
stage can accelerate project completion.
|
|
Ensure high-quality data. Improved data that benefits an individual project will move the entire organization toward better
decision making.
|
|
Decrease project failure. Determining mitigation plans for data issues early in the design phase rather than in the
development stage will help ensure a successful project.
|
Data quality is set at the project level where processes will develop over time and each project will have a holistic effect on the
organization. The ongoing result is better data quality across the entire enterprise.
Tools of the trade
Most companies use a data quality tool or extract, transform and load (ETL) tool in a batch mode, meaning the jobs are run usually at night
with sets of data, rather than one record at a time. Handling data quality issues in real time is possible with today's technologies; however,
organizations must be realistic about the quantity of data that can be processed in such a tight schedule. No system can integrate 64 sources
of data in real time; therefore, priorities must be established.
The sources that are easy to access, make good business sense and are based on pre-determined business requirements should be chosen first
for data profiling. But keep in mind that since these elements are always changing, the process must allow for flexibility and revision. For
that matter, data issues should be fixed at the source as suggested earlier, so data with lower-end issues can be escalated and rectified as
the opportunity exists.
To assist in your data profiling endeavors, tools from DataFlux (a SAS company), Harte-Hanks' Trillium, QualityStage from IBM, Informatica and
Teradata Profiler are available. Or you can create a set of functions that are specific to your enterprise. Some functions found in most data
profiling tools are:
|
Structural analysis should be run on each field of the sources being analyzed. It checks the data's format for NULLs,
spaces, carriage returns, primary and alternate key structure uniqueness, as well as the frequency in which numbers of distinct
values are used. For instance, the structural analysis will identify patterns of numbers that are used as integers and defined by
the database management system (DBMS). The analysis also verifies that dates are a date type, numbers are numeric, and alphanumeric
fields are posted correctly.
|
|
Pattern and frequency analysis identifies defined formats and the number of times each pattern occurs. Examples include telephone
numbers, city names, state or country codes and date fields. (Unique identifiers that are always integers will not need pattern
analysis since they will be identified through the structural analysis defined earlier.) Table 1 shows some frequency analysis
results for a department store's catalog categories. As this example shows, misspelled words will skew report results.
|
|
Range or threshold analysis verifies specific fields to ensure that they are within a certain range or threshold.
|
|
Analysis of counts validates the number and percentages of records with certain characteristics, including total number of
records, unique count by value, NULL and NOT NULL counts, blank and frequency counts, and different data patterns.
|
|
Cross-table relationship identification compares columns across multiple tables. An example would be customer name,
customer number, customer city, customer state, etc. This function also identifies relationships across tables or—in some
cases—systems, and it helps identify primary and foreign key relationships.
|
|
Outlier or dangling relationships identifies child records that have no parent. An example would be a student's test record
(child) that does not have a corresponding student record (parent). Table 2 shows the number of Web customers who have purchased
products but who are not in the customer master.
|
Keep your data polished
High-quality data is the foundation of a successful business. With clean, consistent and accurate data, the opportunities to outperform the
competition are much greater, and your organization won't struggle with sub-par customer service, over- or under-stocked shelves, or a
disappointing bottom line. Why risk losing your edge? T
| Where to find help |
|
You don't need to tackle your data quality issues alone. Teradata has numerous products and services available to
make these tasks easier. A benefit is that these products do not require an extraction of data into a separate
database or file for processing. Rather, all functions can be executed in the Teradata Database.
| > |
Teradata Profiler can perform profiling functions as well as sophisticated data mining algorithms
(histograms, correlation matrix, scatter plots and adaptive histograms) with drill-down capabilities.
|
| > |
Hands-on assistance is available through Teradata Professional Services, with offerings like Data
Quality Profile Analysis and Data Quality Maturity Assessment and Workshop. Through these services, Teradata
can help your organization assess its capabilities and processes that are considered critical to the
goal of achieving and maintaining data quality.
|
—J.N.M.
|
|
| 7 steps to start your data quality initiative |
|
Are you confused with where to start in your data quality initiative? For certain, the number of data sources and
tables and the volume of data for a single record can be overwhelming. To make it easier, create a checklist that
will guide you through the data analysis steps required for a complete implementation:
| 1. |
Determine the subject area to work on first. Usually, the priority subject areas are customer, product
or financials.
|
| 2. |
Identify source systems or data stores that require data cleansing.
|
| 3. |
Gather table and file layouts for both the source and target databases. Include data volume and
update frequency for each data store. Also, include feeds from each source to other systems. This is
where you'll find the spider web of data feeds. For clarity, draw these interfaces in a diagram.
|
| 4. |
Establish the time range and other information for the data that will require extraction if you
are creating a data warehouse.
|
| 5. |
Obtain data samples whenever possible, and pull the data required into a staging area or
profiling database for analysis.
|
| 6. |
Complete the data profiling functions for each source, table and element of the data required,
based on business requirements.
|
| 7. |
Document, maintain and share profiling information with the enterprise via data stewardship or
data governance groups. This is when it becomes clear if the organization has data issues.
|
—J.N.M.
|
|
Joyce Norris-Montanari is the principal architect of CIBER Inc.'s Global Enterprise Integration Practice.
Teradata Magazine-September 2008
|