iBase Digital Asset Management Blog

How we predicted the Rugby World Cup Winner using Machine Learning Cloud services

2nd November 2015

As we’re cloud experts at iBase, with our preference being towards using the full gamut of services from Amazon AWS to serve our digital asset management software, we thought it would be an interesting side project to incorporate the Machine Learning API into our products. In searching for a set of data to analyse we didn’t need to look far – our client World Rugby are hosting the Rugby World Cup 2015 competition this month and using iBase Trinity to store the massive amount of video content from videographers who’ve descended on the UK from all around the world. For those who may never have heard of the game Rugby, this article may not be for you and you should stop reading this very moment.

In summary, we started our prediction software with 27 games of the tournament remaining – 25 games were predicted correctly, and 2 incorrectly. And of course we predicted the winner – an 88% certainty of the Kiwis winning was given.

In looking for data sources, we loaded in 15 ‘long term’ training points using historical data available. Whilst looking at the data it was became clear we could also include ‘shorter term’ training points grouped into the subsets of Attacking, Defending and Team Discipline. Attacking is basically metres gained from running or kicking, Defending is number of tackles missed and Team Discipline is the number of penalties conceded.

A spreadsheet was created containing all of the data points as show below:


Once the data was loaded we generated a heat map using the statistical R language to gain a first impression of how the data appeared – shown below.


The heatmap generated by the above code is shown below.


In looking at the heat graph, from an ‘off the cuff’ human’s eye view there appeared to be a number of obvious direct correlations. We chose one of them to plot a graph – the correlation between Winning and the number of metres made from attacking. Graph shown below.


With all of these data points, the next stage was to select a prediction model for classification and regression of proximities between data points.

In looking at which model type to use as a prediction model there were two leading contenders – Oblique Random Forest (OBF) or Random Forest (RF), which are both types of decision tree algorithm. After a small amount of research, we chose RF because of its reputation for efficient and accurate performance combined with ease of training for new adopters.

You may wonder what an RF is, well a Random Forest for classification is defined as ‘an ensemble of classification trees each built using a sample of the observations and only allowing a random subset of variables for selection at each node split’.

Smart scientists before us had already implemented the Random Forest functionality in a handy library we could plug into our existing R code. The plotting framework is referred to as Caret which is short for Classification And REgression Training.

Once the data had been organised into a Random Forest then it was necessary to revisit the structure of the data and re-code the Winning Team entries so they could be used to create probabilities for each outcome of any two teams being matched against each other.

The prediction model was trained and then applied to R code which applied the model to incoming data and plotted it on a graph comparing the two teams. If the value was above the plotted line on the graph then that team would be the winner as can be seen on the graph below.


Normally the classification trees used in Random Forests split the variable space using orthogonal (perpendicular to the variable axis) splits. However, for tuning the Random Forest algorithm we selected the size of the subset of randomly selected variables used at each node split. The following code sets up a tuning grid consisting of several different sizes of this subset and specifies the method used to train the model. In this case 5-fold cross-validation you can see on line 13 that all of the tuning numbers are below 15 as that is the number of incoming data types on the combined axes (e.g. Metres gained etc.). Further reading can be found at http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf


So there you have it in a nutshell. For anyone wishing an extra bit of technical detail then a closing screenshot of the winning R code has been included below.

Amazon Cloud Services
Machine Learning with Digital Asset Management Software