How to use ELMo Embedding in Bidirectional LSTM model architecture?

Embeddings from Language Models (ELMo)

AI - Language Models

ELMo embedding was developed by Allen Institute for AI, The paper “Deep contextualized word representations” was released in 2018. It is a state-of-the-art technique in the field of Text (NLP). It is a way of representing words as deeply contextualized embeddings. It uses a bidirectional language model (biLM), which is trained on a large text corpus. Once the training is completed we can use these pre-trained embeddings and apply on similar data and this technique is called as transfer learning.

Transfer Learning

Person working his brain

Transfer learning is new google in the field of NLP (Natural Language Processing), after being popular in the Image domain. Transfer learning (TL) is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. Ex: Learning some pattern representation of a problem and applying the same pattern to solve similar problems.

ELMo embeddings can be easily added to existing models and significantly improve the state of the art across challenging NLP problems, including question answering, textual entailment and sentiment analysis.

Click here to be part of INSOFE’s exciting research through our doctoral program for working professionals – World’s first Doctorate in Business Administration (DBA) in Data Science

Why use ELMo Embedding over word2vec and Glove embedding

  • Word2vec and Glove word embeddings are context-independent, ie. these models output just one vector (embedding) for each word, irrespective of the context in which the word has been used and then combining all the different senses of the word into one vector representation.

▹ Example:

“Jack while talking over the cell phone entered the prison cell to extract blood cell samples of Jill and made an entry in the excel cell about the blood sample collection.”

In the above statement for the word “cell”, word2vec or Glove embeddings will generate a single word representation of n dimensions, regardless of where the word occurs in the sentence and regardless of different meanings the word has in the sentence. The word “cell” in each context has a different meaning. This information is lost in word2vec or Glove embeddings.

  • ELMo and BERT embeddings are context-dependent, ie. these models output different vector representation (embeddings) for the same word depending on the context in which it is used.

For the same example, ELMo and BERT would generate four different vectors for the four contexts in which the word cell is used.

  • The first-word cell (Cell phone) would be close to words like iPhone, Android..
  • The second-word cell (prison cell) would be close to words like Robbery, crime..
  • The third-word cell (blood cell) would be close to words like Biology, nucleus, ribosomes..
  • The fourth-word cell (excel cell) would be close to words like Microsoft, datasheets, table…

Now let’s see how we can use ELMo embeddings on a text dataset. I am assuming you are aware of Reading and preprocessing steps to be followed on text data, if not you can refer my github repository or any text preprocessing links. Ideally, after the preprocessing of text, we convert text to numeric representation by traditional techniques [TF-idf, count vectorizer, word2vec, glove..]. In this article, we will see how we can use ELMo embedding to convert text to context-dependent representations. [The below steps are to be performed after preprocessing the data and before model building]

Please follow the below steps to implement ELMo embeddings on the text dataset:

Install required libraries*gic3JoYISK8vMWceeIEmUg.jpeg

Tensorflow hub

We will be using TensorFlow Hub which is a library for reusable machine learning modules that enables transfer learning by allowing the use of many machine learning models for different tasks. we shall access ELMo via TensorFlow Hub for our implementation.

Execute the below two commands in the anaconda terminal. (Install tensorflow==1.15.0 and install TensorFlow-hub to access ELMo) Please note TensorFlow 2.0 is yet to support ELMo embeddings.

$ pip install "tensorflow==1.15.0"
$ pip install tensorflow-hub

Importing pre-trained ELMo model using the below command:

import tensorflow_hub as hub
import tensorflow as tf
elmo = hub.Module("", trainable=True)

Sample Example: Let’s see the output of the elmo embeddings on the above statement and verify if the ELMo embeddings work:

# Statement that we used on the top
sample_statement = ["Jack while talking over the cell phone entered the prison cell to extract blood cell samples of Jill and made an entry in the excel cell about the blood sample collection."]
# Extract ELMo features 
embeddings = elmo(sample_statement, signature="default", as_dict=True)["elmo"]


The output from the above command is “TensorShape([Dimension(1), Dimension(31), Dimension(1024)])”

The output is a 3 dimensional tensor of shape (1, 31, 1024):

  • The first dimension represents the number of training samples. It is 1 in our case
  • The second dimension represents the maximum length of the longest string in the input list of strings. The length of the string — 31
  • The third dimension is the length of the ELMo vector which is 1024.

In simple terms, every word in the input sentence has an ELMo embedding representation of 1024 dimensions.


the above sample code is working, now we will build a Bidirectional lstm model architecture which will be using ELMo embeddings in the embedding layer.

from tensorflow.keras.layers import Input, Lambda, Bidirectional, Dense, Dropout
from tensorflow.keras.models import Model
def ELMoEmbedding(input_text):
    return elmo(tf.reshape(tf.cast(input_text, tf.string), [-1]), signature="default", as_dict=True)["elmo"]
def build_model():
    input_layer = Input(shape=(1,), dtype="string", name="Input_layer")
    embedding_layer = Lambda(ELMoEmbedding, output_shape=(1024, ), name="Elmo_Embedding")(input_layer)
    BiLSTM = Bidirectional(layers.LSTM(1024, return_sequences= False, recurrent_dropout=0.2, dropout=0.2), name="BiLSTM")(embedding_layer)
    Dense_layer_1 = Dense(8336, activation='relu')(BiLSTM)
    Dropout_layer_1 = Dropout(0.5)(Dense_layer_1)
    Dense_layer_2 = Dense(4168, activation='relu')(Dropout_layer_1)
    Dropout_layer_2 = Dropout(0.5)(Dense_layer_2)
    output_layer = Dense(1, activation='sigmoid')(Dropout_layer_2)
    model = Model(inputs=[input_layer], outputs=output_layer, name="BiLSTM with ELMo Embeddings")
    return model
elmo_BiDirectional_model = build_model()*e8-H5mwikPodSTzCpsEEpw.png

Please note the above model architecture can be altered. You can add more layers or drop few layers. Feel free to play around these hyper-parameters.

Model training and predictions

 with tf.Session() as session:
model_elmo =, y_train, epochs=100, batch_size=128)
 train_prediction = elmo_BiDirectional_model.predict(X_train)

Once training is completed and after having the predictions, we can check how good our model is performing. In most of the cases, it will perform better than other traditional approaches. Please note it will take a lot of time for training. So make sure you have appropriate computation.


  • ELMo embeddings are better than word2vec or Glove embeddings when the context is important for the model.
  • Easy to implement ELMo embedding on any text data.
  • ELMo embeddings are quite time-consuming. [Compute the embeddings on a GPU and save the model in a pickle file and use the saved weights during deployment or on test data].

Further Reading

Leave a Reply

Your email address will not be published. Required fields are marked *