In this project, a number of algorithms were used to detect the genre of a musical piece. These algorithms were applied to a number of audio tracks and were also compared to different models found in literature.
When working on this area, different topics would need to be considered. It is generally acknowledged that any dataset used would not necessarily be perfectly equal in its representation of its genres or categories. Music-genre detection is used in different websites and applications in order to differentiate between the various types of music being played. Genre detection is also used for applications to detect new music that might be liked by a user listening to music of the same category.
A piece of music is generally characterised by a number of distinct features, which would facilitate classification. A genre could be defined as a single class that would encompass musical pieces having similar features to one another. The Free Music Archive (FMA) dataset was used as the standard of reference for musical genres. The said dataset offered 106,574 tracks of 30 seconds each, spanning 16 top-level genres and another 147 sub-genres.
This project explored the process of detecting the genre of a musical composition using a number of machine learning (ML) and deep learning (DL) methods. The methods used in this project were: convolutional neural networks (CNN), support vector machines (SVMs) and ensemble machine learning (EML). These methods were adopted to establish a pattern for a certain genre. Additionally, images called mel spectrograms were generated from the different music pieces in order to train a model on the images of these graphs ‒ instead of the values ‒ to establish the pre-mentioned pattern when using the CNN. Furthermore, the SVM used different values extracted from the FMA dataset so as to establish the pattern mentioned earlier. Once these two processes were completed, EML was used to combine both techniques with another meta-model to improve efficiency.
The outcome of this project was the different evaluation metrics obtained from testing each respective model on the available testing audio. The results included confusion matrices and values, such as the accuracy, recall and precision of each music genre. A confusion matrix is a table indicating the number of times the model guessed the classification of a genre, or otherwise. The experiment obtained an accuracy of 65% for some of the top genres that were classified through the proposed model.
Figure 1. Mel spectrogram for a 30-second audio track
Student: Mario Vella
Supervisor: Dr Josef Bajada