What could be a better way to begin your day than listening to your favorite song or composition and get rejuvenated for a fresh start! Wouldn’t it be even more exciting if you can generate music in your own style and choice? Indeed, it would be. Thanks to the advancement of AI which helps us experiment these ideas using Deep Learning techniques. Deep Learning is on the rise, extending its application in every field ranging from computer vision to natural language processing, healthcare, speech recognition, generating art, the addition of sound to silent movies, machine translation, advertising, self-driving cars, etc.
Music generation mainly delves in the two most important aspects: composition and performance. Composition focuses on building blocks of the song like the notations, tone, pitch, and chords whereas performance focuses on how the notes are played by the performer. This uniqueness defines the style of music.
Style transfer of music using Deep Learning is one of the most interesting applications where we generate music by transferring the style of one music type to another. The idea was to generate music automatically using Deep Learning models. Input is the base music and the style, while the output is the base music in a new style.
Before I start, let us understand some basic terminology and the thought process that has gone in:
What is Music?
Music is the science or art of ordering tones or sounds in succession, in combination, and in temporal relationships to produce a composition having unity and continuity. It can be a vocal, instrumental, or mechanical sound having rhythm, melody, and harmony e.g. choral music, piano music, recorded music, etc.
In short, music is nothing but a sequence of musical notes. Our input to the model is a sequence of musical events/notes. Our output will be a new sequence of musical events/notes.
For this task, we used music data from different genres in the form of Midi files and then trained the model. The model had to learn the patterns in music. Once it learned the patterns, it was able to generate new music for us. The generated music was indeed harmonious and good to hear.
Deciding the Model Architecture:
Recurrent neural networks (RNNs) are a class of artificial neural networks that make use of sequential information present in the data. They are called recurrent because they perform the same function for every single element of a sequence, with the result being dependent on previous computations. In traditional Neural Networks, the outputs are independent of previous computations.
Fig. 1 RNN Structure
Long Short Term Memory networks, “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.
Using a gating mechanism, LSTMs are able to recognize and encode long-term patterns. Since music is a sequence of notes and chords, to generate the music sequence pattern, sequence-to-sequence model was built using Long Short-Term Memory (LSTM) network. LSTMs are extremely useful to solve problems where the network has to remember information for a long period of time as is the case in music and text generation.
In the above figure, Inputs are in red, output vectors are in blue and green vectors hold the RNN’s state.
A sequence-to-sequence model is a many-to-many model that takes a sequence of items (words, letters, features of images …etc) and output is another sequence of items. The goal of the LSTM is to estimate the conditional probability p(y1, . . . , yT′|x1, . . . , xT ) where (x1, . . . , xT ) is an input sequence and y1, . . . , yT′ is its corresponding output sequence whose length T ′ may differ from T.
Data used for generating Music was in MIDI file format from the following 3 genres – Classical, Pop, and Jazz.
Wondering what are MIDI files? Let me explain:
A file with the .MID or .MIDI file extension (pronounced as “mid-ee”) is a Musical Instrument Digital Interface file.
Unlike regular audio files like MP3 or WAV files, MIDI files don’t contain actual audio data and are therefore much smaller in size. For example, a MIDI file can explain what notes are played, when they’re played, and how long or loud each note should be. In other words, they are a compact way of representing a sequence of notes being used on different instruments.
Here are some music-related terminologies to understand midi file formation:
Note: A note is either a single sound or its representation in notation. Each note consist of pitch, octave, and an offset.
Pitch: Pitch refers to the frequency of the sound.
Octave: An octave is the interval between one musical pitch and another with half or doubles its frequency.
Offset: Refers to the location of the note.
Chord: Playing multiple notes at the same time constitutes a chord.
We used the music21 toolkit to extract the contents of our dataset and to take the output of the neural network and translate it back to musical notation.
Music21 is a Python-based toolkit for computer-aided musicology. The toolkit provides a simple interface to acquire the musical notation of MIDI files. The notes and chords from Input MIDI files are extracted using music21 functions.
It also allows creating Note and Chord objects so that we can create our own MIDI files easily.
Input and Output sequences for training – We created an array of input and output sequences to train our model. Each input sequence consisted of 100 notes, while the output array stores the 101st note for the corresponding input sequence. A sample for the same is shown below with an input Sequence length of 6 –
Example – Notes from one MIDI file: (Notes Length = 68)
Below is the Sheet Music from the sample MIDI file:
Music is represented by a sequence of musical notes.
Our input to the model is a sequence of musical events/notes.
Our output will be a new sequence of musical events/notes.
In our model, we used four different types of layers:
LSTM layers is a Recurrent Neural Net layer that takes a sequence as an input and returns sequence (return_sequences=True). Return_sequences (T or F) decides whether the output is emitted for each time step or only at the end.
Dropout layers are used to handle overfitting in Neural network. It is a regularisation technique that consists of setting a fraction of input units to 0 at each update during the training to prevent overfitting.
Dense layers or fully connected layers is a fully connected neural network layer where each input node is connected to each output node.
The Activation layer determines what activation function our neural network will use to calculate the output of a node.
Experiments: We tried generating music by experimenting with different genres-
Music File Type: MIDI
Genres used – Classical, Pop and Jazz
No. of Music files used
No. of Input Notes
Unique notes (Vocab_size)
No. Of Output Notes
Dataset consisted of MIDI files from following genres – Classical, Pop and Jazz.
Extracted notes and chords from MIDI files using the music21 library.
Prepared the input and output notes sequences.
The notes in the input sequence were converted to integers and then normalized.
The notes in the output sequence were one-hot encoded.
Defined the model architecture using LSTM with different variations including TimeDistributed layer.
Trained the model and saved the weights for prediction.
Predicted the output sequence by loading the best set of weights.
Converted the output into a sequence of notes and chords.
Generated the MIDI file from the output notes.
With the tried experiments, the output sequence generated from the Classical and Pop files has resulted in generating a melodious sequence of notes.
This was an experiment for generating Music using different MIDI files of the same genre. Imagine if you can change the style of music from one genre to the other, sounds interesting, isn’t it!!! So Stay tuned for Part 3…