Keeping your customers
Advanced analytics makes it possible.
by Tim Miller & Arlene Zaima
What if you could predict who among your current customers will no longer be customers next month? The wireless telecommunications industry, for instance, experiences an average churn rate of 2.4% per month; for a carrier with 20 million customers, that translates into a loss of as many as 480,000 customers in just one month. With average monthly revenue of $53 per user, that means a large wireless company can lose more than $25.4 million every month.
The problem is not limited to telecommunications providers. Businesses from a variety of industries have incorporated advanced analytic techniques to help mitigate risk and minimize loss, improving the bottom line.
Advanced analytics, also known as predictive or descriptive analytics, provides intelligence in the form of predictions, descriptions, scores and profiles that help businesses better understand customer behavior and business trends. Advanced analytical capabilities are commonly used to drive a wide range of applications, from right-time tactical applications such as fraud detection to strategic analysis such as market basket analyses and customer segmentation. Models supporting right-time or tactical applications typically process smaller data volumes; however, they are expected to return response within seconds. In contrast, strategic models process large volumes of complex data and are usually run in batch mode because of the processing time required for larger volumes of data and because these models are not necessarily time critical.
Traditional analytic environments are based on analytic servers that are suitable for a large number of applications with small to medium data volume requirements. However, businesses in the retail, financial, insurance, telecommunications and transportation industries typically have large volumes of data where the model deployment becomes a problem. Data must be exported, merged, reconciled, transformed and manipulated on a server before the analytic model can begin processing. In many cases, this process proves too costly. Businesses are demanding that database vendors and analytic providers bridge the gap between data and analytics by providing a scalable environment that addresses business problems.
Emerging standards may go a long way in helping to bridge this gap. One of these standards, known as predictive model markup language (PMML), is an XML standard being developed by the Data Mining Group, a vendor-led consortium established in 1998 to develop data mining standards. NCR co-developed the initial PMML specification along with Angoss, Magnify, SPSS and The National Center for Data Mining at the University of Illinois at Chicago.
PMML enables the definition and subsequent sharing of predictive models between applications. It represents and describes data mining and statistical models, as well as some of the operations required for cleaning and transforming data prior to modeling. PMML aims to provide enough infrastructure for an application to be able to produce a model (the PMML producer) and another application to consume it (the PMML consumer) simply by reading the PMML data file. This means that a model developed in a desktop data mining tool can be deployed or scored against an entire data warehouse (see "Defining PMML" at bottom).
Although PMML is a great step forward, it is not without its flaws. According to Robert Grossman, director of the National Center for Data Mining and chair of data mining vendor Magnify, Inc., "The challenge now being tackled is to extend PMML to cover the process of cleaning, transforming and aggregating data-a task that is more complex and not so easily standardized."
Teradata recognized this limitation early on-if the PMML document could not represent the analytic variables that were input to the analytic tools, it would be nearly impossible to consume PMML for scoring predictive models. This is because the deployment (scoring phase) of a predictive model requires the existence of the same variables upon which the model was built.
Even prior to developing a PMML consumer, Teradata developed several components that help build the analytic data set (ADS) (see "An ADS Primer" at bottom). The Teradata ADS Generator has the complete capabilities to explore, clean, transform and aggregate highly normalized data within a data warehouse into a format that is acceptable to an analytic tool.
The form of the data in the ADS is usually one row per subject being modeled, typically a customer, household or account. For example, if that subject is "household" and there are 100 attributes the analytic tool must consider when creating a model, then the Teradata ADS Generator can transform transaction-level, product and/or account-level data down to the household level, producing a data structure with one row per household with 100 attributes or variables of that household.
Through the use of the Teradata ADS Generator, users can explore and assemble their ADS using as much data as required and then sample that down to their company-standard analytic modeling tool.
Assuming that the modeling tool can produce PMML-compliant models, users can build and evaluate models in a familiar environment, represent those models in PMML and then pass the PMML back to the Teradata ADS Generator, which can consume and score the models. This creates an optimal analytic environment for model development and deployment in Teradata.
Let's walk through the optimal analytic environment diagram (Figure 1) step by step, working through the model development and deployment process and identifying the various elements that enable an optimal analytic environment.
Model development
Analytic models can take weeks or months to develop and are typically refreshed every six to 12 months, depending on the business requirements and changes in your data. Teradata ADS Generator can significantly reduce the time it takes to build an analytic model by leveraging the parallelism of the database engine and eliminating data movement. In addition, the optimal ADS and intelligence derived from the models are stored directly in the Teradata Database, making it easily accessible to users throughout the company. The three phases of model development include:
Data exploration
Highly repetitive and data-intensive data exploration tasks are best performed directly in the database to eliminate the cost associated with data movement. During the data exploration phase, the analytic modeler searches for patterns and anomalies within the data, learning more with each iteration of analysis. Techniques commonly used during data exploration are descriptive statistics and visualization techniques.
Data preprocessing
The value of the analytic model is directly proportional to the relevance and computational consistency of the data provided. To create a better analytic model, analysts begin to build an ADS by creating predictive variables, replacing data that is missing or invalid, as well as transforming the data into a format that is suitable for analytic tools. For example, many analytic tools require non-numeric data to be converted into numeric representation or scaled within a particular range of values. The resulting ADS is typically a large flat table that can have hundreds of columns and millions of rows. This requires manipulating large volumes of data and can be performed most efficiently directly in the Teradata Database.
Model building, testing and validation
Traditionally, data mining tools have focused much of their attention on model development, where algorithms and advanced statistical techniques are applied to the ADS to search for interesting patterns that can be used to predict behavior and trends. This, like the previous steps, is iterative, with the models being refined with each iteration. However, this phase is usually CPU intensive and performed on three data samples: training, test and validation. Teradata, understanding the need to support multiple tools to address the business problem requirements and wide range of user skills, recommends an open data mining environment. An open environment enables users to have the flexibility to use their favorite data mining tool against the Teradata Database to leverage the efficiencies of in-database data mining for the data intensive tasks.
Model deployment
Model deployment is the application of predictive models to current data to solve business problems. Deployment is quite different from development and is performed on a monthly, weekly, daily, hourly or near real-time basis. Some models, such as the churn model described earlier, may be run on a monthly basis against your entire customer base to reduce churn; other models, such as fraud detection, must be run on a near real-time basis so you can prevent fraudulent transactions before they occur.
Traditionally, models were built to execute on servers, but with data warehouses you are required to duplicate select data elements from the data onto the server. How long will it take to extract and process all 20 million customer records?
PMML helps resolve this issue by bringing the model to your data. In addition, if you've created the ADS with Teradata ADS Generator, you can easily refresh this data by executing the same process stored in the tool's metadata. For model deployment, this is a key point, as PMML assumes that the ADS has already been built.
Teradata has partnered with the leading data mining and analytic application providers to deliver on the vision of an open, scalable analytic environment. Teradata and SAS have developed a PMML interface that allows SAS Enterprise Miner to export models as PMML in order to score them directly in the Teradata Database.
Teradata has also partnered with Fair Isaac to deliver analytic applications that are optimized for Teradata. In addition, Teradata and Fair Isaac are developing a PMML interface that allows businesses to deploy enterprise decision manager models such as Fair Isaac's Scorecard and other patented algorithms directly in the Teradata Database (see "Enabling Fair Isaac's enterprise decision management vision by partnering with Teradata" at bottom).
Teradata ADS Generator with the PMML integration with Fair Isaac and SAS bridges the gap between the advanced analytic models and an enterprise data warehouse to deliver the optimal analytic environment for complex decisioning. With Teradata's analytic environment, analysts no longer have to choose between their preferred tools and cost efficiencies of an in-database solution. You can have the best of both worlds. T
|
Defining PMML
|
|
PMML (Predictive Model Markup Language) is an XML-based specification that enables the definition and sharing of predictive models between applications. A predictive model is a statistical model that is designed to predict the likelihood of target occurrences given established variables or factors. Increasingly, predictive models are being used to forecast business-related phenomena, such as customer behavior. The PMML specifications establish a vendor-independent means of defining these models, so that problems with proprietary applications and compatibility issues can be circumvented. PMML-compliant XML documents consist of the following major constructs:
| Feature |
|
Function |
|
Data Dictionary
|
Defines the data to the model and specifies each data attribute's type and value range.
|
|
Mining Schema
|
Defines attribute information specific to a certain model. It specifies an attribute's usage type, whether it be active or independent (an input of the model), predicted or dependent (an output of the model) or supplementary (descriptive information that is ignored by the model).
|
|
Transformation Dictionary
|
Contains simple algorithm-specific data transformations such as normalization (map values to numbers), discretization (map continuous values to discrete values),value mapping (map discrete values to discrete values) and aggregation (simple averages and counts).
|
|
Models
|
Identifies model parameters for regression models, cluster models, decision trees models, neutral networks Bayesian models, association rules and sequence models.
|
Each PMML construct supports a mechanism for extending the content of a model. Liberal use of such extensions requires that vendors who produce PMML-based models collaborate closely with vendors who wish to consume that PMML.
|
|
An ADS Primer
|
|
Analytic modeling tools require that the data being modeled is presented in a particular format. This analytic data set (ADS) contains all the attributes of the subject being modeled (usually customer, household or account identifiers) in a single row of data.
Typically the foundation of a Teradata Warehouse is a highly normalized logical data model-very different in structure from an ADS. The physical implementation should resemble this logical model, but it may be slightly denormalized in practice. The ADS, however, is almost totally denormalized. The creation of attributes within an ADS-like business reports or KPIs-often involves massive amounts of aggregation. This is not core data, but it is absolutely required for advanced analytics.
An ADS has a focus, starting with what is known and identifiable. Data from past time periods is assembled in the ADS to identify subjects with a behavior of interest-for example, customers who have terminated their cellular phone service in the last six months or households that have purchased additional phones for their calling plan. Other behavioral attributes are added, as is demographic data, creating a wide, single row of information. This data is combined with the same attributes for those subjects who did not display the behavior.
Together, these data sets comprise the ADS. An analytical modeling tool can then help determine how the characteristics in the ADS combine to best predict and/or describe that behavior.
Because mature data warehouses have many subject areas, each of which may contain important data elements, the SQL required to build the ADS can be very complex, particularly when joining many tables together. Teradata ADS Generator was developed to optimize the creation of the ADS in a Teradata environment.
Teradata ADS Generator is a set of components that allows a user to visually create the attributes or variables in an ADS, specifying all join paths required and dimensioning the variables in as optimal a way as possible. For example, a variable such as "Average Minutes of Usage" can be created via a drag-and-drop interface and dimensioned by the last four quarters, yielding four variables representing the average minutes of usage in Q1-Q4.
Further, Teradata ADS Generator enables the transformation and cleansing of data, which may be a requirement of the analytic algorithm being used. It can create the required SQL to transform, say, non-numeric data into a numeric equivalent as well as replace NULL values in a variety of ways.
The variable creation and transformation capabilities of Teradata ADS Generator allow those doing advanced analytics to build up their initial analytic data sets in a parallel, scalable manner. Just as important, though, is the ability to refresh the ADS to present the same variables to a model built outside of Teradata and represented in PMML. Together, Teradata ADS Generator and PMML provide an optimal analytic environment.
|
|
Fair Isaac partnership enables enterprise decision management vision
|
|
Enterprise decision management (EDM) ensures that every operational decision made within a company is optimal not only in the context of meeting business objectives, but also in the context of making consistent decisions across the enterprise. EDM also ensures that the correct decisions are made as quickly as possible. It provides significant competitive advantage to companies by equipping them with the means to not only execute better decisions than the competition, but to do so faster.
EDM's architectural framework consists of decision support, using Fair Isaac's Model Builder and Decision Optimizer products, and decision execution, using Fair Isaac Blaze Advisor software.
Model Builder is designed to support the modeling life cycle through data management, model development and continuous adjustment. It is also designed to automate modeling tasks and facilitate the reuse of models and modeling components across multiple projects. Model Builder provides an extensive library of data-mining and predictive analytics capabilities that enable the development of any type of predictive model. A key element of Model Builder is enabling models to be rapidly deployed into decision management applications, eliminating costly manual handoffs to IT and ensuring that the competitive advantage of new models is leveraged faster.
Decision Optimizer is a software environment that encapsulates Fair Isaac's mathematical optimization procedures. These procedures search for the best business decision strategy to meet business objectives from among all possible decision strategies given real-world operational complexities, resource constraints and market uncertainties. Decision Optimizer can be customized to meet the needs of a specific business problem with a user-friendly interface that allows decision makers to easily navigate to the optimal solution given the parameters of the business problem.
Blaze Advisor is a rules management technology for managing and automating operational business decision processes. Blaze Advisor is a comprehensive software solution that covers the entire process for developing, deploying and maintaining rules-based decision management applications. Blaze Advisor lets business users and technical staff work together to define and update rules processes, and it provides sophisticated methods for expressing business rule logic, including decision trees, decision tables and scorecards. Using Blaze Advisor, companies can rapidly deploy optimized decision strategies within business systems throughout the enterprise.
Teradata's Enterprise Data Warehouse (EDW) is an important enabler in EDM's promise of rapidly delivering the ideal decisions to frontline decision makers. For example, Teradata, through its Analytic Data Set (ADS) Generator, enables analytic users to rapidly determine which variables are the most important in creating predictive models.
Typically, analytic users start with hundreds of candidate variables that can be placed in a model, and then work to determine the 10 or so best variables that provide the most predictive value. This task has been made much easier using Teradata's graphical approach to programming variables within Teradata ADS Generator, turning a task that could take several worker-months to accomplish into one that can be done in worker-days or even worker-hours, providing a significant time advantage in getting high-value analytic models into production.
Another aspect of Fair Isaac's partnership with Teradata is the use of predictive markup modeling language (PMML) to run predictive models directly within the Teradata EDW, eliminating the need to go through complicated and time-consuming data extraction processes.
While the need for the right data to make decisions is evident, the means of achieving this goal is not trivial, particularly in the context of supporting the key elements of EDM to make better decisions faster. Teradata enhances EDM by providing the necessary database technology to capture enterprise data, bring it to the decision point, and not only support but also accelerate Fair Isaac's analytical capabilities.
-Rahul Asthana, director of product marketing, Fair Isaac
|
Tim Miller is the engineering manager for Teradata Profiler, Teradata ADS Generator and Teradata Warehouse Miner. He is responsible for the development of all the components of the Teradata Warehouse Miner suite of products, including the PMML consumers for SAS and Fair Isaac.
Arlene Zaima is Teradata's advanced analytics program manager. She is responsible for the development and execution of worldwide marketing initiatives and program development for Teradata advanced analytic solutions.