Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  
ENTERPRISE VIEW

Printable versionPrintable version Send to a colleagueSend to a colleague

Bridging the information gap

Together, structured and unstructured data provide a complete picture of your enterprise.

For years there has been a gap between structured and unstructured data and systems. A structured environment is the familiar world of records, databases, transactions and reports. Every bit of information has an assigned format and significance. Because the data fields are clearly defined, it is easy to run queries and formulas that extract meaningful information, not just raw data. An unstructured environment includes data found in e-mails, reports, PowerPoint presentations, voice mail, phone notes and photos. This data typically comprises about 85% of an organization's knowledge stores, but it is not always easy to find, access, analyze or put to use.

Today it seems there is an invisible barrier between the two worlds. Processing takes place in one world or the other, but never in both. Data in one environment is like an AC electrical current; data in the other environment is like a DC electrical current. There is a deep and profound mismatch between the two, but bridging the gap between the two environments is essential.

When you span the gap between the two worlds, you'll be able to build a variety of business applications, all of which fall into one of two basic categories. One depends on unstructured data being read and passed into the structured world, where it is mixed with structured data and developed into an application. The other takes data from both the unstructured world and the structured world, conditions it and blends the data into a usable format. Either way, the two data types are reconciled and accessible to the enterprise.

Here are three examples:

1. Communications-enhanced CRM
Customer relationship management (CRM) systems administrators are fond of saying they create a 360-degree view of the customer. Indeed, we have come a long way since the early days of master files.

But do we truly have a 360-degree view of the customer? Not by a long shot. We have customer demographic information:

  • name and address
  • occupation and income
  • marital status
  • spending and saving habits
  • preferences

What we don't know about is communications—interactions via "electronic interfaces" between the customer and the organization. For example: Did Mrs. Jones write a nasty e-mail last week? Did Mr. Smith call our service desk last month? How serious can we be about our relationships with our customers when we don't know the communications they've had with our organization?

Communications are clearly important to having a 1-to-1 relationship with a customer. Where do communications enter into the equation today? For the most part, they don't.

But if we could bridge the gap between structured data and unstructured data, we could include communications as part of the 360-degree view of our customers, and that would add a very important dimension to today's CRM applications.

2. Corporate communications facility
It is not just the sales and marketing people who need to see communications matched with customer data. There are a lot of people in every organization who should have access to the same information. Why not create a general-purpose corporate communications facility (CCF)?

A CCF is an application that captures, edits and sends unstructured communications (primarily phone conversations and e-mail) into a structured environment. Once there, the CCF matches communications to the appropriate records. The CCF is very similar to communications-enhanced CRM except that the audience for the CCF is much broader.

Who can benefit from a CCF?

  • Executives who are entering sensitive contract negotiations
  • Human resources personnel who are evaluating employee performance
  • Engineers who are considering requirements for product modifications
  • Loan officers who must assign credit scores to individuals
  • Product marketing associates who are looking at acceptance rates

3. Compliance auditing tool
Compliance is on everyone's mind these days. Between Sarbanes-Oxley, HIPAA and Basel II, regulatory compliance affects nearly every company in every industry.

For most organizations, compliance means looking at financial transactions and ensuring the completeness and integrity of those transactions. Unfortunately, many enterprises become stuck in the quagmire of creating and implementing auditing routines. But there is another aspect of analyzing transactions that will quickly become the new frontier of compliance.

To understand this new frontier, consider that, in many ways, a transaction is merely the consummation of a long negotiation. Before the transaction was complete, there were numerous interchanges between client and provider. There were agreements, promises and commitments. Only after all of these things took place did a transaction occur.

Simply auditing completed transactions will not address the issues brought about by pre-transaction activities, which are governed by Sarbanes-Oxley. And where are such pre-transaction activities found? In unstructured data.

Businesses need a way to look at relevant unstructured data to see if improper communications took place, e.g., a customer service representative speaking inappropriately to a customer or a sales associate miscommunicating a store policy. Solving this problem requires delving into the unstructured world, sifting through all the communications, finding those that are relevant to compliance, determining whether they violate regulations and bringing any violations to management's attention. This must be done automatically, efficiently and reliably for millions of communications each day.

Navigating the challenges
Clearly, there are numerous applications awaiting the successful combination of structured and unstructured systems. However, there are some important technological challenges ahead as well.

Structured and unstructured data have only one common denominator: text. In most cases, textual data is fairly easy to understand within a structured environment. Context and meanings are clear.

Unstructured text is a different story. Conversations can be about any subject; they can be presented in any format, using formal or informal language; spelling might not be correct; the context could be ambiguous; words might be used incorrectly; and so forth.

Therefore, the first step in making sense of unstructured text is to edit the data. There are three important editing steps:

  • Separate the "blather" from the meaningful text. Blather is textual information that has no business context. Mary Jones' e-mail to Sue Smith about the weather does not have any business context and only clogs the communications pipeline.
  • Remove "stop" words. A stop word is so common that it has no business relevance. Common stop words include "a," "an," "the," "for," "to," "with" and their non-English equivalents. Removing stop words ensures that the remaining text is more likely to be business-relevant.
  • Recognize word stems. Word stems are root words that form the basis of many other words. For example, "moved," "moving" and "mover" all have the same word stem: "move." To create meaningful word clusters, you must organize them at the stem level.

Other kinds of editing, such as spell checking, are important as well. However, no amount of editing will lend meaning to the text that remains. Yet linking structured and unstructured environments depends on having the ability to derive meaning from the unstructured text.

To that end, there are three basic techniques for reading and deciphering the unstructured text that remains:

  • The linguistic approach requires that unstructured text be read and parsed. The linguistic approach is supported by much academic research, but it is complex and slow. With so many language nuances, understanding a random piece of unstructured text based purely on linguistics can be very difficult.
  • The themed approach operates on the theory that the words that appear most often in a document are central to the theme of that document. The document is edited and its words are stemmed. The stems are then organized by frequency of occurrence. The top 10% of word stems are considered the theme of the document. Other inferences may be made by the order of words and the proximity of certain words.
  • The ontological or "industrially recognized" approach requires that unstructured text be read and compared to structured text. Based on the hits made by matching text against industrially recognized terms, the intent of the unstructured text is inferred. Think of a dictionary of business-related words and phrases. The words and phrases become a filter through which the unstructured text passes. The unstructured text's meaning is inferred based on the "hits" that occur.

The technology of structured and unstructured data is as different as AC and DC electrical currents. In the past these worlds have rarely intersected. But today, they are coming together in meaningful ways. With proper editing and careful integration, structured and unstructured data can blend seamlessly to add a new dimension to business intelligence. T

© Teradata Magazine-June 2005


back to top




Copyright by Teradata Corporation 2001-2007.