Measuring success

Used some data and a machine learning model to get results? To check if they're any good, we can use different ways to measure success, depending on what kind of problem we’re solving.

We’ve got the data, pushed it through a machine learning model and now we’ve some results. Thing is, are they any good? There are several types of metrics that can be applied to measure the level of success depending on the problem that you are working on. The most commonly reported are root mean square error (RMSE), the confusion matrix, recall and precision.

Root mean square error

This error metric measures the differences between the values predicted by your model and the values observed in your dataset. It is defined as the standard deviation of the prediction’s errors (often referred to in the literature as residuals). Due to its nature, RMSE is particularly good for prediction problems.

Confusion matrix

If you are dealing with a classification problem one way to gauge the performance of the model is through a confusion matrix. It allows you to see the ‘confusion’ between the classes, that is which class is commonly mislabelled as another class. (We hope that’s not too confusing!)

If your problem is binary (for example, you have 2 classes), then your confusion matrix will look like the one below. Your confusion matrix will have the following information:

  • True Positive (TP) : Observation is positive, and is predicted to be positive

  • False Negative (FN) : Observation is positive, but is predicted negative

  • True Negative (TN) : Observation is negative, and is predicted to be negative

  • False Positive (FP) : Observation is negative, but is predicted positive

Confusion matrix for binary problem

Actual class
    Class 1 Class 2
Predicted as
Class 1 (for example, male) TP FP
  Class 2 (for example, female) FN TN

The above matrix shows just 2 classes. The more classes the more complex the confusion matrix gets. We won’t show more complicated examples lest it becomes too confusing.

Accuracy

It is the rate of correctly classified instances or, in other words, how many times your classifiers were successful. It can be calculated as:

Accuracy = (TP+TN)/(TP+TN+FP+FN)

Precision

Represents the total number of selected items that are relevant. So it is the true positives divided by the all those selected as positive (true positives plus the false positives).

Recall

The number of correctly classified elements of 1 class (true positives) divided by the total number of elements labelled as belonging to that class (true positives and false negatives). It is more restrictive than precision, since false positives (i.e. misclassifying something from another class as that class) are penalised.

So in the example above where a machine is trying to classify males (perhaps from photographs) the precision is:

And recall is:

In both cases, the higher the value the better so these results are not terribly good, we wouldn’t advise buying this solution.

F1: Is useful to see the relationship between precision and recall at the same time. It is defined as:

F1 = 2 x (precision x recall)/(precision + recall)

Again, the higher the value, the better, 1 is perfect (and probably hard to believe)!

Jaccard index or similarity coefficient

The Jaccard index is a statistic used to measure the similarity between different sample sets. It is defined as that part of an image that is common to both the ground truth and the machine classification divided by that part of the image that is in 1 or the other. The larger the value, which will be between 0 and 1, the better.

The Jaccard distance, which is the complementary of the Jaccard Index, measures the dissimilarity between the two areas so the higher the score the greater the difference.