Please view this page on a desktop computer.

Kimono Worldcup Predictor

Can data predict who will win the World Cup?

We wanted to see if we could use a subset of data from our (un)official World Cup API to predict which team will win the World Cup. Below, you'll find an interactive model, bracket predictions and a detailed explanation of how we did it. Our model calculates a score for any given team based on a combination of historical data for each of the players on the team, and the team's performance in the World Cup thus far. The score is used to predict a winner of a hypothetical matchups.

The model's input variables are team cohesion (% of players that play for the same professional club), aggregate goals across all their players in the regular 2013-2014 season, aggregate red and yellow cards across players in the 2013-2014 season, total minutes played across all players in the 2013-2014 season and weighted 2014 World Cup momentum.

1 Head to Head

Select two teams below to see who would win a match between them according to our predictive model. You can tweak the model at your discretion by increasing or decreasing the weight of the different factors using the blue dials. The results will update in real time.

The blue dials are initialized to the optimal coefficients for each variable, which we calculated using correlation and regression analysis, which we describe in detail below. All the dials are normalized to a 0 to 100 scale. We also provide a percentage certainty to give an idea of how close the match might be.

      Team Goals Factor

      Red Card Factor

      Yellow Card Factor

      Minutes Played Factor

      Momentum Factor

      Cohesion Factor

      2 Full Tournament Bracket

      We've also included a full interactive bracket prediction - calculated by applying the model to all matches in the round of 16. Changing the dials affects the full bracket, just as it affected the the head-to-head predictions above.

      In addition to tweaking the weights of the factors using the dials, you can alter the predicted outcome of a specific match, overriding our model at that stage of the bracket. Do this by clicking on the team you think will win this match. This overrides the model's prediction for that match, affecting the predicted outcomes for remaining matches. These outcomes are still predicted by the model, but based on your match overrides. Click reset at anytime to reset all of your overrides.

      16
      QUARTER
      SEMI
      FINAL
      SEMI
      QUARTER
      16

      3 Our Analysis

      Our ingoing hypothesis was that a team’s success could be predicted based on the collective strength of its squad’s individual players going into the world cup and the momentum the team gains (or not) as it plays. First we calculated an incoming ‘seed’ for each team based both on how its individual players performed in their respective professional leagues during the 2013-2014 season, as well as how cohesive a unit it was by seeing how many players had played together over the past year on the same club teams.

      We first tested how correlated these performance and cohesion variables were with the outcomes in the World Cup matches so far using a Pearson correlation coefficient:

      Next we ran multivariate regressions to solve for coefficients of a linear equation that yields a predictive score for each team. Let’s start by examining each of our six independent variables. Each contributes either positively or negatively to the predictive model.

      Goals

      Aggregate goal count is correlated with success

      We calculated aggregate goal count by summing the goals scored by each team’s individual players during the 2013-2014 season while playing for their professional teams. Argentina, England, Italy, Spain, Switzerland and Colombia entered the World Cup with the highest aggregate goals. This set of teams has had mixed results in the group stage due to unexpectedly poor performances from England, Spain and Italy. But analyzing all 32 teams, goal count is positively correlated with success.

      Minutes Played

      Aggregate playing time correlates with success

      To estimate each squad’s recent experience, we looked at the total number of minutes each player played in the 2013-2014 season. Teams whose players spent more time on the field during the 2013-2014 season tended to perform better. The least ‘experienced’ team in the World Cup has, in aggregate, just over half the number of total minutes played as compared to the most ‘experienced’ squad.

      Red Cards

      Red cards are anti-correlated with success

      Teams with higher aggregate red cards from the 2013-2014 season tended to score fewer points in this first round of matches. FIFA rules specify that two yellow cards or a red card leads to player suspension for the next match, so the negative correlation makes intuitive sense.

      Yellow Cards

      Yellow cards are anti-correlated with success

      Just as with red cards, penalties resulting in yellow cards aggregated across all players similarly has a negative impact on success. Ecuador leads the historic Yellow Card tallies, while Japan stands out as the most well behaved side.

      Momentum

      Momentum in the early world cup games matters

      The stunning upset when Spain lost to the Netherlands 1-5 on the second day of group play, made it clear that we needed to also account for a team’s momentum in the current world cup. We hypothesized that teams winning by large goal differentials, and beating more highly regarded sides would build confidence, improving their likelihood of success in the next match.

      To estimate a team’s momentum, we multiplied the goal differential for each match played by the relative difficulty of the opposing side and summed them up. By using the weighted sum, a team gets extra credit if they beat a stronger team and gets penalized more harshly if they lose to a weaker team. For example, the -1 goal differential when England lost to Uruguay penalizes England with a momentum term of -1.34.

      We need to be careful when running regressions that include momentum because it will be partly auto-correlated with the outcome. In the following momentum equation, hhome is the team in question’s historic seed and haway,i is the historic seed for the opposing team for the ith match. Δgoalsi is the goal differential for the ith match. Momentum equation:

      Cohesion

      Players’ experience playing together does not correlate with success

      We calculated cohesion using the number of distinct clubs (nclubs)

      So a 23-man team with players belonging to 10 different clubs would have a team cohesion score of 56%, whereas a team with players spread across 5 different clubs would have 78% team cohesion. England, Spain, Russia, Italy and Germany are the most cohesive teams coming into the World Cup; of these only Germany is advancing to the round of 16.

      Looking across all teams, cohesion was very weakly anti-correlated with success. But, excluding the upsets of England, Spain and Italy as outliers (since they came into the World Cup with some of the highest scores based on historic data), cohesion becomes weakly positively correlated, making it a poor predictor of success.

      Turning these insights into a model

      We rejected variables with weaker correlations and kept these 6 as the independent variables to predict the outcome, and noted down the relative strengths of each correlation so we could factor that in later. Our next step was to find the coefficients for a basic linear equation that would calculate a score that we can use to predict the outcomes. We randomized the 32 teams to create various subsets of 6 and solved the matrix equations to calculate some initial sets of coefficients.

      These varied significantly depending on which 6 teams we were looking at. To find the optimal predictor, we plugged each set of coefficients into our prediction equation and calculated the error. We used an ordinary least squared error, subtracting the actual team score from the prediction, squaring the result and summing the individual squared errors across all teams. It’s important to separate the training and testing data sets to avoid “over-fitting” – i.e. creating a model that works perfectly for what it has observed (training), but that generalizes poorly (testing). In this case, separating the training and testing sets means that we should test the coefficient sets by seeing how well they predict the matches that were not included in the set of matches used to calculate the coefficients. And voilà, we have an equation with coefficients that predict the next game’s winner.

      This analysis is meant as a thought starter. There’s a lot of additional data and more advanced statistical methods you can apply to build more sophisticated models. One approach that lends itself well to this type of classification problem is the Support Vector Machine. We invite you to use the kimono API to build a better model. Happy hour is on us if you do!