Teradata Magazine Cover Teradata Magazine Online  
Register Help Password
Password:
Quick Links
Current Issue
Archives
Teradata.com
Teradata Magazine Rss Feed
ARCHIVES Search Teradata Magazine Online:  


Evan Levy

The Lowdown on Data Mining

Data mining is a high-yield but complex form of knowledge discovery. Before you even think about using this technology, make sure you know what it is, what it takes, and what you need.

10 Questions to Ask Your Vendor

To Sample or Not to Sample
In its 1997-1998 study of data mining market trends, META Group claimed that nearly 80 percent of companies interviewed expected data mining to be a critical success factor by 1999. More recently, Forrester Research weighed in on data mining, claiming that, while many companies were still evaluating the technology, most planned on using it by 2001. Other analysts and independent research firms polling companies to find out who’s doing what in the data mining space are finding that the common denominator is intention, not practice. Is this because companies are solidifying their infrastructures first? Are companies too intimidated to admit they have no intention of doing data mining at all? Or is there still a pervasive misunderstanding of what data mining really is–and isn’t?

I recently spoke at a database marketing conference on this point. The title of my presentation was "Data Mining in the Real World," and the room was brimming with both technicians and marketers. When I got to the part of the presentation that discussed the differences between data mining and OLAP, I noticed a guy a few rows from the front. He had stopped taking notes and had put down his pen. After the presentation, he buttonholed me, taking me to task for my definition of data mining.

At first, I figured he worked for an OLAP vendor, one of the many who had labeled its multidimensional analysis or query generation tool as a data mining product. But after listening to his harangue for a few minutes, I was able to piece together that he was a data analyst for a marketing organization and had been telling everyone that his company was doing data mining. I had burst his bubble by classifying his cherished "data mining" tool as a simple OLAP application, and I had clearly called into question his status as a knowledge worker.

My point is not that OLAP is less valuable than data mining, but that they are two separate breeds of analysis with entirely different objectives, not to mention tools, skill sets, and implementation methods.

UNDERSTANDING THE PLAYERS

Most people wouldn’t use a spreadsheet tool to write a book. Even a crack statistician wouldn’t use SAS to fill out an expense report. Different software tools exist to tackle different business functions, just as different decision-support tools exist because there are different classes of questions. The major classes of decision support are:

Canned reports. This is the most basic type of decision support, if not the most pervasive. Nearly every data warehouse starts out by generating reports. The delivery of timely, accurate reports containing business information is incredibly valuable–especially in places where this data never before existed. Such an application focuses on well-defined, well-understood business questions. It also allows users to gradually change their businesses to leverage this new information.

Ad hoc querying. Submitting free-form or ad hoc questions to the database is the next logical step in the evolution of a company’s data warehouse. After receiving hard copy reports, business users inevitably have additional questions. They can submit ad hoc queries with tools such as Hummingbird Communications’ BI/Query or Cognos Inc.’s Impromptu.

OLAP. Online analytical processing takes various forms (slicing and dicing the data by dimension, complex multi-statement queries, and so on), but the common denominator of these forms is that they all provide analysts with fast, consistent, interactive access to data from a variety of perspectives. OLAP not only enables analysts to ask many questions–each question relating to the answers and details of the previous question–but also to organize the results.

For example, consider a business analyst for a utility company who is reviewing electricity use for customers in a particular geographic region. An OLAP tool lets the analyst ask questions about customers in a particular region. If the analyst happens to identify a region with lower-than-expected usage, he or she could redirect the focus on that region’s usage to a more specific time frame to determine if the usage shortfall relates to a specific day of the week.

OLAP tools let users "drill down" into more detail, allowing them to examine the same data from multiple perspectives, limited only by the metrics available in the database and their own imagination.

Data mining. With canned reports, ad hoc querying, and OLAP, the end user defines a hypothesis and determines which data to examine. With data mining, the tool identifies the hypothesis, and it actually tells the user where in the data to start the exploration process.

Rather than using SQL to filter out values and methodically reduce the data into a concise answer set, data mining uses algorithms that exhaustively review the relationships among data elements to determine if any patterns exist.

The whole purpose of data mining is to yield new business information that a business person can act on. The mining activity itself is, by necessity, "back-office" work. Forget the myth that data mining will eventually "mature" to become a desktop application. (The fact that many organizations are buying into this myth accounts for much of the current reluctance to adopt the technology and the preference, instead, to "wait and see.") The truth is, data mining inherently requires a certain amount of application-specific data manipulation in order to yield effective results. This means that the IT organization must deploy the data mining tools–just as they run other technical functions–to load new business intelligence into the data warehouse. This "closed-loop" process allows end users to query the results of data mining without directly operating the data mining tool.

The actual analysis techniques that current data mining tools use aren’t new; in fact, some of the algorithms have existed for more than 20 years. The innovation is in the recent commercialization of these algorithms into software products that address business-oriented problems. Data mining products have been reengineered from the traditional mainframe and supercomputer class systems to leverage the more popular (and considerably less expensive) SMP and MPP platforms.

Data mining tools are typically classified by the type of algorithm they use to identify hidden patterns. There are many different algorithms in use, but the four most popular are association, sequence, clustering (or segmentation), and predictive modeling.

ASSOCIATION

Association, also frequently referred to as "affinity analysis," reviews numerous sets of items and looks for common groupings. An example of association is market basket analysis, which involves reviewing the products that consumers purchase in a single trip to the grocery store. In this kind of analysis, the association tool looks at every cash register transaction to determine which products customers purchase in combination. Table 1 shows three association rules identified by analyzing all the purchases in a grocery store for an entire day. The rules listed in the table are three examples of product affinities.

Table 1.Grocery purchase associations.
Product Product Frequency % Average $ Average $ Margin Affinity % Inverse Affinity %
Grape juice Bologna 1.4 29.52 3.94 0.56 0.23
Grape juice Corn Chips 0.7 28.79 3.76 0.27 0.18
Grape juice, Bologna Corn Chips 0.40 30.75 4.64 0.87 0.05


The first rule identifies that grape juice is purchased with bologna. The frequency value associated with this rule indicates that these two products are purchased together in 1.4 percent of actual shopping trips. The average revenue is $29.52 for all the shopping trips containing this pair of products, and the profit for all products purchased during the trip is $3.94. The affinity value indicates that for all shopping trips in which grape juice is purchased, 56 percent of the time bologna is also purchased. Inverse affinity value considers all of the shopping trips in which bologna is purchased and indicates the percentage of these trips (23 percent) in which customers also purchased grape juice.

From this rule information, merchandisers can easily understand the wider implications of placing grape juice on sale or placing it in a highly visible position in a store; after all, the more grape juice they sell, the more bologna they may sell as well.

Of course, association analysis isn’t just the identification of two-item pairs. The three-item affinity in the third rule shows the likelihood of three products being purchased together. Association analysis can also illustrate upwards of four-, five-, and six-item affinities.

Association algorithms function by running all the combinations of data within a data set. This operation grows increasingly more complex with the number of items you compare. So although two-item affinities can be performed "manually" using SQL, a five-item affinity would be next to impossible to define accurately. The actual processing requirement, as well as the sheer number of queries, would be enormous – not to mention the time it would take to develop and run these queries!

The commercially available association engines have been specifically engineered to process data sets very quickly. The engine is usually fed a flat file containing the data sets with all the occurring item combinations. The engine then processes the data in a "brute force" manner: It compares every instance of an item to every other item in the set. This type of processing is more oriented toward record-at-a-time processing than the set-oriented processing common to relational databases. Some of the more sophisticated association engines have been engineered to divide the data set into multiple smaller pieces to process in parallel.

With the advent of customer relationship management, association analysis is at the forefront of data mining because it crisply identifies the products customers purchase along with which products and services drive additional sales. Other decision-support tools don’t support the analysis of such product combinations.

SEQUENCE

Sequential analysis helps data miners identify a set of order-specific items or events. Association identifies the existence of patterns or groups of items; sequential analysis identifies the order of those patterns or groups of items.

At a phone company recently, product managers using OLAP and canned queries were monitoring new customer orders and cancellations. They calculated that in one out of every eight orders, a customer canceled Product A, while in one out of every 10 orders, a customer canceled Product B. Based on customer purchase history, these cancellation rates were five times greater than normal. We were asked to employ data mining to find out why.

We uncovered 12 order combinations that included disconnects of both Product A and Product B. Table 2 shows one such combination.

Table 2. Sequential analysis results.
(Percentage of orders displaying this pattern = 7.3)
Pattern Activity Product
Disconnect Product A
Disconnect Product B
Connect Product A
Connect Product B
Connect Product C


In this example, 7.3 percent of all the orders weren’t disconnecting either product, but simply purchasing a new product (Product C). Unfortunately, the legacy order system could not add a new product to a customer’s bill. Instead, it had to disconnect both existing products and then reconnect them together with the new product. The high rate of disconnects actually represented a high rate of service upgrades, not cancellations. The 11 other sequences reflected similar activity with other product additions. Thus, the 7.3 percent problem actually indicated that about 80 percent of the company’s disconnect orders were in fact new product orders.

Customers weren’t disconnecting products out of dissatisfaction; they were upgrading their products! In reality, disconnect levels had not increased over historical norms, and product managers had discovered that their disconnect costs were in fact insignificant.

The company could never have been able to differentiate "disconnects" from "upgrades" without data mining. If OLAP technology were applied to this problem, it would have mandated running queries against connect and disconnect orders for millions of monthly orders, covering a portfolio of more than 200 products. In fact, that approach would require 200 to the fifth power, or 320 billion, queries–each of them most likely a full-table scan. Extrapolation shows that if each of these queries took a second, the work would take 10,000 years to perform!

Sequential analysis is an ideal algorithm for uncovering event patterns that aren’t obvious. In environments that involve thousands or millions of events over time, sequential analysis can be invaluable in revealing interesting characteristics and behaviors.

CLUSTERING

Cluster analysis lets the data miner assemble data into unforeseen groups containing similar characteristics. Also known as "segmentation," this type of data mining is probably the most widely used.

In one case, a company I was consulting with wanted to find new data about customers owning a specific loan product. We performed cluster analysis on data relating to several hundred thousand customers and including more than 150 attributes for each customer. (Customer attributes included account name, home address, and outstanding balance.) We instructed the data mining tool to ignore any groups that contained less than 1 percent of the population. A four-cluster subset of the overall results is illustrated in Figure 1.



Figure 1.Subset of cluster analysis results.


These four clusters provide some rather interesting insight into the behavior of this group of account holders. The first cluster–comprising 3.5 percent of account holders–actually had the most credit cards and the highest money market balance, making it an attractive market for cross-selling new banking products. The same holds true for two other clusters–business checking account holders and the low debt group.

The most interesting cluster, however, is the cluster containing 6.8 percent of account holders. These account holders don’t represent a likely target market: They don’t use credit cards, and they have the least profitable account the bank offers. Obviously, future marketing campaigns should exclude this audience in order to save the bank money.

This example illustrates how data mining can provide a business user with a hypothesis from detailed data. Any experienced analyst could have constructed queries to peruse the data; however, it’s unlikely that he or she could have guessed the precise combination of attributes that would reveal the insights data mining is capable of.

While query tools are certainly useful for examining cluster results, they’re not capable of identifying clusters. Since OLAP and SQL are set-based operators, they don’t classify or calculate the significance of different attributes. This processing is precisely why data mining is best suited to clustering analysis: Clustering requires a completely different data manipulation technique from what a query language can provide. The true value of using data mining here was to save the bank’s analysts time by giving them a head start in their data analysis efforts.

PREDICTIVE MODELING

As the name implies, predictive modeling involves developing a model from historical data for predicting a future event. The power of predictive modeling engines is that they can use a broad range of data attributes to identify future behavior.

Both cluster analysis and predictive modeling tools identify distinct groups of items with common attributes; the difference is that predictive modeling focuses on the likelihood of a particular outcome for a particular group.

Predictive modeling works best when the user specifies the independent variable to be predicted. For example, you might want to predict the likelihood that a customer will purchase voice mail. The related information, or dependent variables, could include other customer attributes, such as the quantity of phone lines, average monthly bill amount, or long-distance usage. The more information you have about the independent variable, the better. The predictive modeling tool will then indicate which attributes influence the predicted outcome. For example, even though there are 57 customer attributes, you may find that only two influence the purchase of voice mail.


Figure 2.Some promising customer segments identified through predictive modeling.

Figure 2 is a subset result of a predictive modeling activity on which I consulted that used the telephone bills of small businesses. The product marketing manager wanted to better understand which customers were purchasing an advanced custom-calling product. This predictive modeling activity used 130 attributes about small-business customer accounts.

The first rule identifies a segment of customers who 1) are in the retail food industry, 2) have flat-rate service without caller ID, and 3) are in three states where customers are highly likely to purchase the custom-calling product. The predictive tool tells us that three different attributes (location, other product usage, and industry) define this segment. It also tells us the attributes that aren’t influencing the outcome. Thus, the user can avoid trial-and-error query submission. The tool will identify the hypothesis.

Predictive modeling tools test themselves by checking their hypotheses against the actual data matching the criteria. The key to supporting predictive modeling, however, is the availability of data relating to the prediction. Predictive modeling can’t occur without information relating to the event or outcome that you want to analyze. In other words, if you want to predict the propensity to buy a particular product, you need sales history about that product and the customers who have purchased it.

Although difficult, it is possible to model events that haven’t yet occurred. To model brand-new events, you must have access to data for similar events. This type of analysis is sometimes used in the entertainment industry. For a new Mel Gibson action movie, a studio can mine box-office data from Gibson’s prior movies and also include information regarding the number of screens, costars, opening seasons, and other details. Such a predictive model could tell a studio the customer segments that are most likely–and least likely–to watch this kind of movie, as well as the probable revenues for opening weekend.

Once again, since query tools, including OLAP, can only answer questions that can be fully qualified, the studio could never uncover this information without data mining.

ARE YOU READY FOR DATA MINING?

Just because you have a data warehouse doesn’t mean you’re necessarily ready for data mining. Much of the work our company does in the data mining arena has more to do with data mining readiness assessment than with actually performing data mining.

Granted, it’s a lot easier for clients to commit to an assessment than it is to justify the purchase and integration of a new data mining tool suite. But many of our clients are still working on the post-assessment recommendations and haven’t yet talked to data mining vendors. So what does it take to be prepared for data mining?

Here are some metrics you can use to gauge your data mining readiness:

Do you have a staff of experienced knowledge workers? Everyone is excited about the opportunities that data mining affords, but few understand the implications of presenting this new type of business intelligence to knowledge workers. Ten years ago, retailers had their hands full when they delivered fresh data warehouse reports to their merchandisers because these reports frequently challenged traditional views of the popular and profitable products. The initial response was that the data warehouse was just plain wrong.

Many companies want to bypass traditional decision support and go directly to data mining, but it’s highly risky. If business users aren’t experienced in using data and they haven’t yet transformed their business processes to use metrics (instead of gut-level instinct) to drive their decisions, it’s unlikely that they’ll accept the results of data mining blindly.

Do you have the data? As funny as this question sounds, it often elicits blank stares. If you haven’t got the data, you can’t mine it. If your data mining business case is to establish customer buying trends, but you don’t have access to customer purchase data, you’ve picked the wrong business case. You have to have data relevant to the problem you’re targeting.

Can your data support data mining? Even with advanced data mining technologies, the old garbage in, garbage out adage still applies. Data mining focuses on the quantity and accuracy of the attribute detail. If your mining activity focuses on customer traits and habits, it’s important to provide as many data attributes about the customer as possible. No business has perfect data; what’s important is knowing the inherent limitations of the data before beginning a data mining effort.

Do you have marketing processes in place that can use this data? I once reviewed a data mining activity that analyzed the pricing of products in a Midwestern state. It proved to be a highly insightful activity; unfortunately, it was worthless. There was no way to reprice telecommunications products regulated by state and federal agencies.

It’s important to understand how the results of a mining activity will be used. In fact, this is something you should review during the requirements gathering step.

Is your problem a data mining problem? As I’ve discussed, data mining provides the ability to identify patterns and new hypotheses about data. Data analysts usually implement it after they have exhausted their ability to identify new business intelligence from the data warehouse.

Because of the hype and visibility of data mining within many different industries, many business users new to decision support are convinced they need data mining. In many instances, what they need is desktop ad hoc query support, not advanced data mining algorithms.

I recently attended a meeting with a specialty retailer. Business users were screaming that they wanted advanced analysis, and the technology group wanted data mining. However, after a short discussion with the marketing analysts, I established that their advanced analysis needs didn’t indicate data mining at all, but rather a way to drill down into their weekly report information. Their requirements in fact pointed to ad hoc and OLAP analysis, not data mining.

Do you have a business champion who can embrace the process and results? The tried and true principle with data warehousing is that without a business champion, your data warehouse won’t succeed. The same holds true for data mining. Without a champion who is interested in new business intelligence, there’s little likelihood that a new technology that challenges traditional thought and practices is going to be accepted.

Do you have the technology infrastructure to support advanced analysis? Data mining analysis is new and complex technology. Consequently, it requires additional hardware, software, and technical skills. Although this should be of no surprise to most IT professionals, it is almost always an issue.

There’s little question that in order to be successful with data mining, you need a lot of detailed data–not summary or aggregated detail, but baseline business detail. As discussed in "To Sample Or Not To Sample", anything that filters or rolls up data has the potential of filtering out an important facet of new business intelligence. It’s also important to realize that, even with the relative newness of this technology to the commercial marketplace, data mining consumes a significant amount of processing horsepower and storage.

Data mining tool planning, implementation, and use mandates additional expertise, particularly in areas of data transformation and results analysis. These skills aren’t brain surgery; many vendors offer classes to adjust to these new methods of analysis. As with any new technology, data mining comes with a learning curve.

LIGHTS, CAMERA…

Unlike other types of decision support that can be deployed directly to end-user desktops, deployment of data mining relies on different metrics. It is only effective once a business question is identified that existing decision-support technologies cannot address.

Like the companies surveyed by META Group and other market research firms, everyone’s expecting to use data mining. But for many companies, data mining may be a tool in search of an unknown problem, the proverbial hammer looking for a nail. The fact is, planning for data mining means ensuring that your existing data warehouse infrastructure is solid–clean data, good transformation rules, and meaningful metadata. This increases the likelihood that your first foray into data mining will yield meaningful results.

Simply put, you can’t really know how to plan for data mining–or what tool you’ll need–until you’ve defined the problem it’s intended to solve.


 
Evan Levy is the president of Baseline Consulting Group, a worldwide consulting firm specializing in industry-specific business intelligence and database marketing solutions. You can contact him at evanlevy@baseline-consulting.com or through Baseline’s Web site at www.baseline-consulting.com.


Ten Questions to Ask Your Vendor
Before purchasing a data mining tool, be certain it fits your organization’s specific needs by asking prospective tool vendors these questions. Keep in mind, there are no right or wrong answers per se. Every user organization will have different scalability, process, formatting, and data sourcing requirements. Weigh the responses you get from vendors, and select the tool that most closely matches your core requirements.

  1. Which algorithms does your tool support?

  2. What do the results look like?

  3. What data formatting does it require?

  4. How does the tool acquire data?

  5. How does the user or business analyst interact with the tool? Is it a GUI interface? Is it command-line?

  6. What level of data or statistical analysis experience is necessary to use the tool?

  7. Does your data mining tool support continuous or range value analysis? Can the system identify relative groupings for a continuous set of values, or is the user required to identify those values? For example, does the tool subdivide age ranges, or must the user define the ranges?

  8. Does your tool scale? Can it break problems into multiple concurrent steps? If so, how?

  9. Is your data mining tool business or function focused? A "business- focused" data mining tool focuses on a specific function such as "churn." A function-focused tool is more aligned with the type of algorithm (such as, cluster) and can usually apply to more than one business problem.

  10. Is it a learning or static model tool? (A static model tool requires the user to identify the specific attributes and their relative weightings. A learning model tool analyzes all available data attributes and determines the appropriate weightings and values itself.)

To Sample or Not to Sample

One of the most heated debates in data mining circles is whether or not to employ data sampling. Sampling is a method by which the data mining engines use only a subset of data to perform the analysis activity. The benefit of this approach is obvious: The data mining involves fewer processing resources because it’s analyzing less data.

Although the concept is very straightforward, the impact to results isn’t as obvious. The whole premise behind data mining is to identify hidden patterns in data. Sampling introduces the risk of omitting hidden patterns.

Nonetheless, sampling has proven to be a very successful statistical analysis strategy. The benefit of sampling is that you can use 10 percent of the total data. Assuming your sample reflects the aspects of information contained in the full volume of the data, analyzing only 10 percent of your total data will give you the same insight at lower overhead. You should thus be able to analyze smaller data quantities more exhaustively. In order to sample effectively, it is important to sample data that is consistent and homogenous throughout. And that’s the problem. Sampling assumes that the data is homogenous and can be sampled without losing vital detail. Mining assumes that there are hidden patterns in the data and that you need all the detail to find the hidden patterns. How can you take an accurate sample of data that preserves the informational content of the base data unless you know the content? This, by the way, isn’t practical until you mine the data to identify the hidden patterns.

In much the same way that data warehousing has established detailed data as the only sure way to capture reality, the only surefire means of mining data is to use as much detail as possible.

In simplistic terms, sampling’s biggest advocates are the statistical software tool vendors who rely on data sampling techniques in order to extrapolate findings. The fact that most of the data mining tools currently on the market cater to much smaller volumes of data than typically found in a data warehouse has lent a lot of backing to the sampling approach. Sampling’s biggest detractors are vendors and commercial companies who want the data mining tool to consider all their data, not just a sample.

Sampling’s proponents will insist, and rightly so, that sampling is a statistically valid way to apply analysis findings. Does that mean, however, that you should join the sampling bandwagon? It depends on several factors:

  • First, how much data do you have? The answer to this question is probably the greatest factor. If you haven’t got a platform that can support mining execution of an entire data set, sampling or subsetting may be the only reasonable solutions.

  • Can the data be subsetted based on the business problem? Although a utility may have 20 million customers in its five-state region, practical issues dictate that it can only market and manage customers along the state boundaries (for legal reasons). This is a perfect situation for subsetting the data into five smaller data sets. In situations in which a business is managed at a divisional or regional level, subsetting the data into multiple sets along the boundaries of operations is actually beneficial. Identifying customer traits and profiles unique to a region will add more value than identifying a more generalized profile across the entire company.

  • Do you need all the data in the first place? I can’t count the number of times that one of my initial mining activities used all the data in the warehouse. Be careful to focus on data that’s useful and valuable to the outcome. I once worked on a project for a prison system that initially included race and religion as part of an analysis; however, such analysis is against the law. New rules and business practices cannot be created simply based on the data you’ve got.

  • Will your analysis technique require lots of data? Predictive modeling can use every piece of data that you can throw at it. Association and sequence algorithms have practical limits. Additionally, mining focuses on specific data elements, not necessarily the entire warehouse. The golden rule of data mining should be "concentrate first on how you will use the data and what your business drivers are." Only then should you decide on the analysis technique, the specific tool, and whether or not to sample.




    Copyright by Teradata Corporation 2001-2007.