The motivation behind this project was to understand the effect of financial investment on the on-pitch performance of football clubs in the English Premier League (EPL). The approach taken was to collect data covering eleven EPL seasons, integrate the various aspects, and then select subsets from these to analyse performance through data mining.
Data mining is a process through which raw data is thoroughly transformed and summarised, and from which models could be conjectured. Specific techniques for discovering hidden patterns are based on time series representation, classification, and clustering. The patterns are then evaluated to determine the extent to which financial investment would be reflected in a team’s performance.
One aspect of this data collection deals with the financial transactions of a club, and another aspect is its performance on the pitch. In this project, each repository extracted from external data sources was encoded and integrated into other datasets. A subsequent process of data transformation included the cleaning of data to ensure consistency and address any missing values. The transformed data was stored on a database and, consequently, time series datasets were generated on an ad hoc basis.
The data was analysed by adopting time series data-mining techniques. Time series analysis is the process whereby trends or patterns in data would be analysed over a specific period. These patterns would be subsequently derived by comparing the on-field performance of a particular team along several seasons and across all teams in a season.
In this study, trends and patterns in the data were observed using clustering, based on: distance between two time series instances; segmentation of a year’s campaign by performance; visualisations in the form of dendrograms, where clusters in the graphs would indicate the level of performance; and affinity exhibited by a club in comparison to others. These time series instances were then used to build models to predict the likely performance of an EPL team, taking into account the inventory of its players and the investment made.
Figure 1. High-level diagram of the methodology of the project
Figure 2. Dendrogram showing the EPL clubs in the 2020/21 season
Student: Carl Bondin
Supervisor : Dr Joseph Vella