Procedural generation of sound files from small sound samples

Sound design is an expensive and time-consuming element in game and film design. This final-year project seeks to simplify the sound-design process for the benefit of non-experts requiring sounds in developing their applications.

The proposed task entailed creating a system that would allow users to input a set of audio files, such as the sound of rain and thunder. The system would classify the sound, then it would break it down into small clips. Lastly, it would arrange all the clips in a way that would sound natural, allowing the individual to choose the length, number of layers and texture of the final sound file. This process required establishing which of two popular audio-sequencing algorithms ‒ Markov chains (MCs) and recurring neural networks (RNNs) ‒ would yield the better results when used to sequence non-musical sounds.

It was necessary to build an audio classifier in order to label inputs to help facilitate the process. The classifier was built using a convolutional neural network (CNN). This was trained using the FSD50K dataset, which is composed of 50,000 labelled audio files of varying classes, such as animal sounds and human voices. The classifier accuracy was evaluated by first splitting the dataset into training and testing data, training the data on the train split, and finally evaluating the results on the test split. Its accuracy was assessed with reference to the literature that made use of this dataset, so as to determine whether the level of accuracy achieved was comparable so as to be used as the baseline. The CNN was used to determine the type of sound was inputted, to be used by the audio sequencer.

Depending on the output from the classifier, the RNN would call the respective model that was trained for that sound type. These models were trained for each type of audio whereby, for each class, pitches could be detected and fed into the RNN to train, so as to determine what the next expected pitch would be, given a set of prior pitches. Sequences for each layer of a sound would be built upon any features that have been extracted. This would be carried out through a system that checks for increases of volume, to denote the possible start of a segment and the Fourier transform in order to assign that segment to the corresponding pitch.

Figure 1: Flow diagram of the process for generating the sound files

The generated audio files were evaluated through the use of a robust and properly designed survey. The survey was composed of multiple pairs, one from the MC and one from the RNN of audio outputs generated from the sequencer. The respondents were asked to select which of the two algorithmically sequenced sounds ‒ which were generated using varying layers ‒ they felt was the more euphonic. In the event of no clear majority, the two algorithms could be used interchangeably.

Student: Daniel Azzopardi
Course: B.Sc. IT (Hons.) Artificial Intelligence
Supervisor: Dr Kristian Guillaumier