Any machine learning algorithm aims to tell us something about the future on the basis of what it has learned from data it has been ‘trained’ on. So, the aim is to be able to generalise from known data to related but unknown data.  What has been learned from last quarter’s sales can be used to make predictions using this quarter’s sales. This allows us to make predictions for the future on data the model has never seen.

So, one obvious test of how well our models do is when they are used for real: have we correctly predicted the weather? Have our customers bought our targeted special offers? Did our smart motorway increase average traffic speed and reduce jams? The extent to which a model’s outputs fail to match reality is known as bias. This is the difference between the average machine learning model output and the reality. It basically measures how well the model has learnt from the training data.


Before we ever get to this real kind of evaluation, there is a prior assessment we perform on our models, which we use to measure how well they are performing. We do this by looking at what is known as variance. When we test our model on a fresh data set, how similar are the results to those we obtained on the training set we used to produce the model initially? Variance is basically the approximation error. It’s the variance of the machine learning model output across different training sets. It represents the sensitivity of the results to the particular choice of data set. Basically, it’s how robust the model is to deviance from the training set and thus indicates how well it can generalise.

Underfitting and overfitting

Linked to bias and variance are two other concepts: underfitting and overfitting. Getting the right ‘fit’ can be hard when selecting a machine learning model. It can underfit and therefore fail to find generalisations from the training data which can be usefully employed. Basically, underfitting means poor model performance.

It can also overfit by making generalisations which are too closely linked to one specific set of data but which don’t transfer to other data sets. This is by far the most common problem simply because of the nature of machine learning: the algorithm learns from a specific set of data and therefore relies on there being a sufficiently large quantity of data from which to be able to generalise effectively. This, of course, is by no means guaranteed. It will clearly be a struggle to generalise from one shopper’s purchases and behaviour even though the model might be 100% correct in predictions about that shopper. Having data on another 100,000 shoppers will inevitably be much more effective.

The next section looks at ways in which statistical methods can be used in machine learning to reduce the potential for overfitting.

The diagram below shows underfitting and overfitting, as well as high bias, high variance and the desired ‘just right’.

Source: Advice for applying machine learning


To get round the problem of overfitting, and simply to maximise the potential which any given data set has to provide for successful generalisation to as yet unknown data, a number of statistical techniques can be employed. Their collective aim is always to address the challenge which inevitably arises from using a finite known data set to generate predictions about unlimited and unknown data.

Resampling and separate validation datasets

Two important techniques are:

  • Use a resampling technique to estimate model accuracy
  • Hold back a validation dataset.

The most popular resampling technique is k-fold cross-validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.

A validation data set is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training data set, you can evaluate the learned models on the validation data set to get a final objective idea of how the models might perform on unseen data.

We’ll look in more detail at these techniques later in the next lesson.

Ensemble methods

Another approach to increasing real-world performance is to use what are collectively known as ensemble techniques. Ensemble methods in machine learning use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the individual learning algorithms separately. Ensembles aim to combine multiple hypotheses to form one better hypothesis. Essentially, this means mitigating the limitations of single algorithms by using a mix of complementary algorithms and combining their results.

We’ll look in more detail at all these techniques in the next lesson.