Music Style Transfer using Deep Learning (Part 1)

Everyone enjoys good music, be it a song that excites you to join the dance floor or a song that takes you through the memory lane. The music industry has been trying to make music that appeals to a greater set of the audience. This has been a daunting task as every individual’s needs cannot be met by just a few types of songs. This is where artists come in the picture and try to create a new version of the song by:

    • Removing some instruments from the original music.
    • Adding new instruments to the original music.
  • Adding a specific style to the original music, etc.

The list of all the possible combinations is too long to be comprehensive. This opens a huge market in the music industry. Cover songs are made for almost all the popular songs and there are many variations of a song that can be made. Therefore, huge investments are made on the artists and singers to make new cover songs.

If a cover song is made with a particular set of the audience under consideration, it is quite possible that the song might not gain traction among the other groups. Is it possible to make different cover songs for different groups of an audience without burning a hole in the pockets of the production house? Imagine an application where you upload your favorite song, select your favorite singer/artist and the application plays the song encapsulating the singers/artists’ style? Artificial Intelligence might be the answer. We have heard of style transfer using images, can we use the concept of style transfer for music?

To achieve this task, we treat the problem as two-fold. In the first part, we will be dealing with identifying the genre of the song/music using song classification techniques. In the second part, we will delve deeper into the concept of generating music in a different style without losing the aspects of our original music.

PART 1: Music Classifier

Sound waves are composed of compression in the medium of air followed by a rarefaction. A set of compression-rarefaction combination is perceived as sound by our ear. Music is no different in terms of the building blocks. But how do we make the computer understand the compression and rarefaction? For a computer, to apply any Machine Learning or Deep Learning algorithm, the data has to be a set of arrays (or tensors). Converting the song into a set of numbers can be treated as feature extraction but what features are we extracting? For our task, we extracted the 39 MFCC and 5 aggregate features:

      1. Zero Crossing rate
      1. Spectral centroid
      1. Spectral roll-off: It is a measure of the shape of the signal. It represents the frequency below which a specified percentage of the total spectral energy, e.g. 85%, lies.
      1. Chroma Frequencies
    1. Mfcc

A peek at MFCC:

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC – Mel Frequency Cepstrum. They are derived from a type of cepstral representation of the audio clip (a nonlinear “spectrum-of-a-spectrum”). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory systems’ response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression.

MFCCs are commonly derived as follows:

    1. Take the Fourier transform of (a windowed excerpt of) a signal.
    1. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
    1. Take the logs of the powers at each of the mel frequencies.
    1. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
  1. The MFCCs are the amplitudes of the resulting spectrum.

About the Dataset:

We used the GTZAN dataset for training our models. This dataset was used from the well-known paper in genre classification “Musical genre classification of audio signals ” by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.

The genres are metal, disco, classical, hip-hop, jazz, country, pop, blues, reggae and rock.

    • The dataset consists of 100 songs per genre and there are 10 genres in the dataset which adds up to 1000 songs. Each song is in the .au format which is a mono channel audio song of 30 seconds duration.
  • A song in general consists of a waveform of different amplitudes across time. We need to sample down the song so that we can discretize it as the audio features are extracted from the discretized song.

We chose our metric to be per class recall as the problem at hand has 10 classes and we want the model to learn all the classes equally well.

A quick peek into some of the existing works

There have been attempts to use several model architectures to solve the problem of classifying the songs in the GTZAN dataset. Some of them are mentioned below:

Even with such state-of-the-art models the recall per class recall was not stable across the genres as the models struggled to identify one genre from another. The differentiation in our approach was to use simple models with rich features. Let us look at feature extraction details:

1.Feature extraction:

We used python and librosa for extracting the features mentioned above. We tried a few Machine Learning models and Deep learning models on the extracted Zero Crossing rate, Spectral centroid, Spectral roll off and Chroma Frequencies along with 39 MFCC features. We then came to the conclusion that there is not enough information for a simple model to learn from the mentioned features alone.

2.Data Augmentation:

Another good strategy is to increase the size of dataset rather than trying complex models using data augmentation methods. How do we increase the size of a dataset which is comprised of songs? Here are a few approaches:

  • Adding noise
  • Temporal shift
  • Temporal stretch and squeeze

Using the above-mentioned techniques, we increased the dataset size from 1000 songs (1.3Gb disk space) to 7020 songs (30.Gb disk space). We used scikit-learn standard scaler for scaling the numpy arrays extracted from the above process.

These data augmentation techniques helped us tackle the variance problem and the feature extraction techniques helped us tackle the bias.

Train Data size : (5608, 44) – 5608 songs, 44 features

Test Data : (1402, 44) – 1402 songs , 44 features

3.Variants of DL techniques that we tried are:

  • Multi-Layer Perceptron
  • Recurrent Neural Network
  • Long Short-term Memory
  • Bi-Directional Long Short-term Memory
  • A Neural Network Ensemble of the above models

We experimented with multiple architectures with varying levels of complexity in terms of depth, individual layers and the neurons. We used Adam optimizer, dropout for regularization.


    • Most of the deep learning models we built with augmented data performed well on the training and test data with recall per class in the range of 95-98%.
  • Given a choice, we suggested the basic neural network architecture due to its consistency in predictions. PFA the architecture of the neural network we used.

Performance of the Neural Network on train and test.

  • Train loss: 0.0091
  • Train accuracy : 0.9979
  • Train recall: 0.9979
  • Validation loss: 0.1008
  • Validation accuracy: 0.9672
  • Validation recall: 0.9629

We will talk about further implementation aspects of style transfer in the next article. So, stay tuned!



Leave a Reply

Your email address will not be published. Required fields are marked *