Text analytics helps organizations find the hidden value in unstructured data.
by Mark Shainman
Extracting and analyzing data about customers, partners and vendors gives businesses an upper hand on their competitors. Yet only a fraction of overall corporate
data is used by most companies.
As Merrill Lynch noted in an article published in DM Review in February 2003, more than 85% of the data within most organizations falls outside of the traditional
"structured" data category. The bulk of this unstructured data—information that doesn't nicely fit into the typical data-processing format—remains stored,
but unanalyzed.
One key to tapping into this data, and consequently gaining corporate insight and knowledge, is known as text analytics.
| enlarge |
|
More than 85% of an organization's information is in unstructured data. Text analytics can reformat this data and merge it with
traditional structured data for greater corporate insight.
|
|
When more is better
Searching for and discovering unstructured data is not a new process. Two existing search-and-discovery methods are data mining, which leverages statistical
algorithms to find patterns in the data, and enterprise search-and-retrieval, which locates specific words or sentences in the data. Data mining of unstructured
data can tell the user how often a term exists or if it is clustered with other terms, but it cannot identify relationships among terms. Enterprise
search-and-retrieval is good for finding documents about certain topics, but it is not helpful for understanding the who, what and where of those documents.
Text analytic solutions perform a more in-depth task than either of these methods. After extracting facts and context from unstructured text documents, text
analytics transforms unstructured data into a relational format and stores it in a data warehouse. These steps offer businesses an insight into the context, or
true meaning, of the unstructured text. (See figure 1 for samples of unstructured data.)
Two ways to perform text analytics are the statistical conceptual method and the linguistic fact-based method:
|
The statistical conceptual method reviews various components of unstructured data, such as the frequency of specific
words in a document.
|
As an example, this method can be used to identify the subject matter of documents based on the key words "4x4" and "truck." Let's say that in a multitude of
text documents "4x4" appears 1,000 times and "truck" appears 200 times. When narrowing this analysis to "4x4 trucks," the statistical conceptual method identifies
the phrase 30 times in five documents, leading to the assumption that the common theme of these articles is 4x4 trucks.
More specifically, the statistical conceptual method enables analysis based on statistical approaches (like frequency, average, median and clusters) of the words
in these documents. In most cases, this analysis of unstructured text is facilitated by extraction of particular features within the document that consist of key
words, phrases or combinations of them, assembled manually or by statistical and probabilistic algorithms.
While the conceptual statistical method is focused, the depth of its insight is limited. Inquiries must be word- or phrase-specific, meaning the user must know
exactly for which words to search.
|
The linguistic fact-based approach analyzes the roles and relationships among specific words within a document to
determine the context of the information. By linguistically parsing a sentence into components, the approach identifies word relationships and creates
a relational table consisting of facts extracted from this document. This data can then be combined with existing relational data to give more insight
into, for instance, a business problem or an overview of customer trends.
|
| enlarge |
|
Follow the path of the "voice of the customer" in the telecommunications industry. Text analytics is used in a call center
setting to transform unstructured data into usable, structured data which will enable better customer service.
|
|
Using the same scenario as before, written on one of the text documents (in this case a service note for a 4x4 truck) is "the ball joint on the top of the
steering knuckle cracked due to wear." A linguistic fact-based method extracts the key facts as:
| 1) |
Event type: cracked
|
| 2) |
Part affected: ball joint
|
| 3) |
Location: top of the steering knuckle
|
| 4) |
Cause: wear
|
From this information, the linguistic fact-based approach method identifies the text relationships and populates a relational table with these facts or dimensions
that are then used for analytics.
To allow for this type of analysis, Attensity, a text analytics software provider, has partnered with Teradata to create a solution that leverages its unique
linguistic fact-based approach and the Teradata Warehouse. This solution offers a complete data warehouse made up of structured and unstructured data, which can
be analyzed using commercially available business intelligence (BI) and reporting tools. Companies, therefore, can use their existing investments and skill sets,
while accessing a greatly expanding data offering that provides deep analysis across most of the data available in organizations. (See "Finding structure in the
unstructured world of data," below.)
Business value behind text analytics
The following examples illustrate how combining unstructured text-extracted data and existing structured data within a data warehouse can benefit various
industries:
|
In the travel and transportation industry, text from repair and operation notes can be combined with existing data
such as aircraft configuration, purchasing-supplier information and parts and maintenance history. With this newly combined data, department leaders
can determine whether a defective part should be returned to the manufacturer, or if a new repair strategy should be engineered and implemented.
|
|
Data extracted from the unstructured text in warranty claims at a manufacturing plant can be combined with existing
structured data such as from material invoices and test center data to more accurately determine the scope of a product failure. Combining structured
and unstructured data can also help the company focus on improving product quality, as well as determine and avoid product recalls.
|
|
To better enable predictive modeling in an insurance company, a relational database of the company's unstructured
text in claims can be created. That information can then be combined with its existing customer- and claim-focused structured data. This combination
helps the company determine its risk factors and improve its underwriting processes and rules.
|
|
Retail chains can leverage the product return information gathered by their call centers and in-store associates to
determine whether a consistent problem exists with a given product or manufacturer. The chains can then reduce return costs by eliminating these
products or requesting that the manufacturer take corrective action.
|
|
By combining call center notes with existing customer and profitability data, a telecommunications company can
better determine how to more effectively reduce customer churn. (See figure 2, above.)
|
These examples show how text analytics is a key enabler for organizations that continually strive to increase the business value of analytic interpretation. The
combination of extracted textual data and traditional structured data stored in a data warehouse truly provides a "single view of the business" so that
organizations can be better informed and make more intelligent decisions. T
| Finding structure in the unstructured world of data |
|
Business organizations and government agencies have spent years developing systems and processes for capturing, storing, maintaining
and mining data. While these systems have made great strides in gleaning business value from this data, they may have been missing as
much as 85% of the available information.
Various financial, procurement, customer service and human resource applications located across the business carry about 15% of an
organization's data. Compare that to the vast amount of unstructured text derived from call center and repair notes, claim
descriptions, employee reviews, e-mail content, reports, news feeds and more, and the opportunities for gathering and analyzing more
data is far-reaching.
| enlarge |
|
The Teradata/Attensity Text Analytics solution extracts unstructured data from text documents. It then feeds the
data to the Teradata Warehouse where the information is transformed into a relational table for analysis.
|
|
Because unstructured data is in a format that cannot be readily processed, many organizations are unable to analyze all of their
collected data—these analytics can have a significant impact on their business performance or on the success of their agency's
imperatives. The combination of Attensity's patented Text Analytics solutions with the Teradata enterprise data warehouse enables
companies to transform unstructured text into table-formatted data. This new data is easily merged with the enterprise's other
information for a consistent and in-depth view of the organization. Immense business value is added to the organization, as business
solutions that would not have been apparent using only the 15% available data are now obvious.
Attensity uses linguistic principles to transform unstructured text fields into identifiable facts and events into a relational format.
The company's "fact recognition" software and patented Exhaustive Extraction approach are valuable because they connect the dots in
ways that enable customers to see and respond to critical issues or events that historically have been difficult to analyze or detect.
More than search software, Attensity's Text Analytics technology grasps the nuances, relationships and context of everyday language and
extracts who did what, when, where and under what conditions. It then creates output that is fused with existing structured data in the
Teradata Warehouse so that it can be analyzed using Attensity's applications (Attensity Discover, Attensity Analytics and Attensity
Text Search) or by using business intelligence (BI) applications already installed in the enterprise's data warehouse.
—Lisa Slutter
|
|
Mark Shainman, a senior program manager for Teradata, manages the deployment of Text Analytics across Teradata. He also works with the strategy, market analysis
and master data management marketing teams at Teradata.
Teradata Magazine-March 2007
|