Music Style Transfer using Deep Learning (Part 3)

Style transfer was a very novel idea and it was seamlessly applied on images. Given an input image and a style image, we can compute an output image with the original content but in a new style. Another application of style transfer is on audio files (music) but it is quite challenging. Can we use the same approach as in images?
Music Style Transfer using Deep Learning
Is it possible to have the information about the audio as an image format?
Can we use the image to convert back into the audio again?

Music Style Transfer can happen in many ways as mentioned in the previous blog (Music Style Transfer using Deep Learning – Part 1) by either changing the instruments or the voice of the singer, genre etc.  Let us look at the scenario when we consider changing the genre of the music. Since the vocal cannot be changed, we took only the background music for our experimentation.

Let us now address each of the above questions. In Part 1, we have had a good understanding of the different properties of the audio files. We have seen the ZCR, Spectrogram, MFCC… etc. The Spectrogram in one way is like an image which consists of all the information about frequencies of a signal as it varies with time. We can think of the spectrogram as a 2D representation of the 1D signal as shown in the figure below.

Fig 1

From Fig 1, we can see the Spectrogram of the audio signal. In the right image(Fig 1), the height of each bar (augmented by colour) represents the amplitude of the frequencies within that band. The depth dimension represents time, where each new bar was a separate distinct transform. We can view it like a 1 * T image with F channels where F represents the Frequencies. For our convenience, we used convolution Neural Networks.

GriffinLim Algorithm which uses the modified STFT of the Spectrogram addresses the second problem in scope.

The approach used in the style transfer for images implements the Convolution Neural Networks. Since the images have a nature of the colour and there will be colour channels, the approach of using the 2D convolution is useful. However, audio is a 1D signal and to obtain a 2D image, we use the transformation of the audio into a Spectrogram using STFT which was covered in Part 1.

We can use the 1D convolution network for our purpose. Since we are passing the convolution filter across each image (i.e) (1*T), the only convolution taking place is happening in time.

For our notation, we use ‘x’ to denote the generated log-magnitude STFT, whose feature map, X is the convolved feature map over whichever dimension is being considered for style transfer or texture generation.

We took the approach of DMitry Ulyanov for Music Style Transfer. Following this, we use ‘s’ and ‘c’ to denote the log-magnitude STFT representations for the style and content audio respectively, whose feature maps are represented by S and C, respectively.

The L2 distance between the target audio and the content audio’s feature map summed over all spatial indices ‘i’, and channels ‘j’ is used for content loss expressed in the equation below:

While representing style, local information about where the musical events happen is lost. But the information about how they happen in relation to each other is maintained. To represent the style, the inner-product of the vectorized feature maps ‘X’ and ‘S’, denoted by ‘W’ and ‘G’ respectively, are calculated using the below equations, for a filter ‘i’ and another filter ‘j’ used to represent the style.

The style loss, LStyle is calculated as the sum of the L2 distances between G and W over all pairs of filters i and j in Nf, as given in the following equation:

The total loss is represented by the below equation, which uses parameters α and β to measure the importance of transferring each type of style.

We need to minimize the total loss. All network weights are unaltered during optimization of the L. Only the log magnitude STFT representation of the target audio is adjusted.


  • We tried it with Convolution Neural Network with 4096,1024,2048 filters.
  • Each filter is initialized with the weights with glorot-normal.
  • After the filters have been applied on the Spectrograms, we have the convolved feature maps of the audio of both the content and the style.
  • We defined the losses of Style and Content and calculated the total losses.
  • We minimized the total loss which is the sum of the Content and Style Losses and got the result at Spectrogram.
  • We used griffins algorithm to reconstruct the signal from the Spectrogram.
  • Instead of the RELU in the original paper of the Ulyanov, we used the SELU, ELU and leaky_RELU which observed that the loss was comparatively smaller than the original paper for the songs we considered.

Code Snippet:


The observation that can be made from the analysis is that, with the varying alpha, learning rates are summarized below. The songs considered here are the same from the base paper and the number of filters is 4096. From the code above we can see that the with the Max pooling layer and changing the hyperparameters, we are able to achieve better results. In the analysis further, we have taken multiple songs and tried for.

Beta Alpha Learning Rate Activation Function Losses
10 1e-3 1e-2 Leaky_Relu 81.79
10 1e-3 1e-2 Elu 207.1794
10 1e-3 1e-2 Selu 266.556
1 1e-2 1e-3 Relu 559(Base Paper)

Conclusion and Future Scope:

Hence it can be concluded that for texture synthesis based Style transfer, we can approach for the better results by tuning the hyperparameters as tried above. These are some of the variations we tried for the problem. Still, we can improvise this by using the MFCC and CQT transformations further. More complex methods like GAN can also be applied for the same process. Also, we can see that there is a noise issue in the style transferred songs which we can filter.



Leave a Reply

Your email address will not be published. Required fields are marked *