Article

Objectives and accuracy in machine learning

Your machine learning objective must be to use that insight to change the way you do business.

September 6, 2017 5 min read

We get to go to a lot of conferences. And we’re always amazed at how many vendors and commentators stand up at events and trade shows and say things like, “The objective of analytics is to discover new insight about the business”.

Let us be very clear. If the only thing that your analytic project delivers is insight, it has almost certainly failed. Your objective must not be merely to discover something that you didn’t know, or to quantify something that you thought you did — rather it must be to use that insight to change the way you do business. If your model never leaves the lab, there can never be any return on your investment in data and analytics.

“Analytics must aim to deliver insight to change the way you do business”

The goal of machine learning is often — though not always — to train a model on historical, labelled data (i.e., data for which the outcome is known) in order to predict the value of some quantity on the basis of a new data item for which the target value or classification is unknown. We might, for example, want to predict the lifetime value of customer XYZ, or to predict whether a transaction is fraudulent or not.

Before we can use an analytic model to change the way we do business, it has to pass two tests. Firstly, it must be sufficiently accurate. Secondly, we must be able to deploy it so that it can make recommendations and predictions on the basis of data that are available to us — and sufficiently quickly that we are able to do something about them.

Some obvious questions arise from all of this. How do we know if our model is “good enough” to base business decisions on? And since we could create many different models of the same reality — of arbitrary complexity — how do we know when to stop our modelling efforts? When do we have the most bang we are ever going to get, so that we should stop throwing more bucks at our model?

So far, so abstract. Let’s try and make this discussion a bit more concrete by looking at some accuracy metrics for a real-world model that one of us actually developed for a customer.

A working example of machine learning

The business objective in this particular case was to avoid delays and cancellations of rail services by predicting train failures up to 36 hours before they occurred. To do this, we trained a machine learning model on the millions of data points generated by the thousands of sensors that instrument the trains to identify the characteristic signatures that had preceded historical failure events.

We built our model using a training data set of historical observations — sensor data from trains that we labelled with outcomes extracted from engineers’ reports and operations logs. For the historical data, we know whether the train failed — or whether it did not.

In fact, we didn’t use all of our labelled historical data to train our model. Rather, we reserved some of that data and ring-fenced it in a so-called “holdout” data set. That means that we have a set of data unseen to the model that we can use to test the accuracy of our predictions and to make sure that our model does not “over-fit” the data.
Confusion Matrix of predicted vs actual results

The table shown above is a “confusion matrix” resulting from the application of the model built from the training data set to the holdout data set. It enables us to understand what we predicted would happen versus what actually did happen.

You can see that our model is 84 percent accurate in predicting failures — that is, we correctly predicted that a failure would occur where one subsequently did occur within the next 36 hours in 443 out of 525 (82+443) cases. That’s a pretty good accuracy rate for this sort of model — and certainly accurate enough for the model to be useful for our customer.

Just as important as the overall accuracy, however, are the number of so-called type-one errors (false positives) and type-two errors (false negatives). In our case, we incorrectly predict 54 failures where none occur. These errors represent 54 situations where we might potentially have withdrawn a train from service for maintenance it did not need. Equally, there are 82 type-two errors. That means that for every 14,014 (13,435+54+82+443) trips made by our trains, we should anticipate that they will unexpectedly fail on 82 occasions, or 0.6 percent of the time.

Model inaccuracy costs money

Because both false positive and false negative errors incur costs, we have to be very clear what the acceptable tolerance for these kind of errors is. When reviewing the business case for deploying a new model, ensure that these costs have been properly accounted for.

If you are a business leader who works with data scientists, you may encounter lots of different shorthand for these and related constructs. Precision, recall, specificity, accuracy, odds ratio, receiver operating characteristic (ROC), area under the curve (AUC), etc. — all of these are measures of model quality. This is not the place to describe them all in detail — see the Provost and Fawcett book or Salfner, Lenk and Malek’s slightly more academic treatment in the context of predicting software system failures — but be aware that these different measures are associated with different trade-offs that are simultaneously both a trap for the unwary and an opportunity for the unscrupulous. Caveat emptor!

When we have satisfied ourselves that our model is sufficiently accurate, we need to establish whether it can actually be deployed, and — crucially — whether it can be deployed so that the predictions that it makes are actionable. This is the second test that we referred to at the start of this discussion.

In the case of our preventative maintenance model, deployment is relatively simple: As soon as trains return to the depot, data from the train sensors are uploaded and scored by our model. If a failure is predicted, we can establish the probability of the likely failure and the affected components and schedule emergency preventative maintenance, as required. This particular model is able to predict failure of train up to 36 hours in advance — so waiting the three hours until the end of the journey to collect and score the data is no problem. But in other situations — an online application for credit, for example, where we might want to predict the likelihood of default and price the loan accordingly — we might need to be able to collect and score data continuously in order for our model to make predictions that are available sufficiently quickly for them to be actionable without disrupting the way that we do business.

As we explained in a previous episode of this blog, this may mean that we need to construct a very robust data pipeline to support near-real-time data acquisition and scoring — which is why good data engineering is such a necessary and important complement to good data science in getting analytics out of the lab and into the frontlines of your business.

Tags

Dr. Frank Säuberlich leads the Data Science & Data Innovation unit of Teradata Germany. It is part of his repsonsibilities to make the latest market and technology developments available to Teradata customers. Currently, his main focus is on topics such as predictive analytics, machine learning and artificial intelligence.
Following his studies of business mathematics, Frank Säuberlich worked as a research assistant at the Institute for Decision Theory and Corporate Research at the University of Karlsruhe (TH), where he was already dealing with data mining questions.

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International. Frank has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International.

Frank Säuberlich has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

View all posts by Dr. Frank Säuberlich

Martin has over 27-years of experience in the IT industry and has twice been listed in dataIQ’s “Data 100” as one of the most influential people in data-driven business. Before joining Teradata, Martin held data leadership roles at a major UK Retailer and a large conglomerate. Since joining Teradata, Martin has worked globally with over 250 organisations to help them realise increased business value from their data. He has helped organisations develop data and analytic strategies aligned with business objectives; designed and delivered complex technology benchmarks; pioneered the deployment of “big data” technologies; and led the development of Teradata’s AI/ML strategy. Originally a physicist, Martin has a postgraduate certificate in computing and continues to study statistics.

View all posts by Martin Willcox

Stay in the know

Subscribe to get weekly insights delivered to your inbox.

Business Email*

Country*

Yes

I consent that Teradata Corporation, as provider of this website, may occasionally send me Teradata Marketing Communications emails with information regarding products, data analytics, and event and webinar invitations. I understand that I may unsubscribe at any time by following the unsubscribe link at the bottom of any email I receive.

address1

Your privacy is important. Your personal information will be collected, stored, and processed in accordance with the Teradata Global Privacy Statement.

Objectives and accuracy in machine learning

A working example of machine learning

Model inaccuracy costs money

About Dr. Frank Säuberlich

About Martin Willcox