Reinforcement Learning: Understand with a Simple Game

Reinforcement learning is an important type of Machine Learning where an agent learns how to behave in an environment by performing actions and seeing the results. It seeks to create intelligent agents that adapt to the environment by analyzing their own experience. It is a computational approach to learn something from the action.

Let’s understand this in simple terms.

My son Rigved started playing with the TV remote when he was just 2 years old. At the age of 3, he started liking kids channels but he did not know how to tune in to those channels. So, he began pressing the keys randomly and used to feel happy upon finding those channels. Gradually, he started memorizing the keys (combination only and not the numbers).
Though to reach this stage he failed multiple times but, slowly became perfect in tuning to those kids channels.

This is nothing but reinforcement learning – starting without any data, collecting data with random attempts based on the problem defined in a domain. All results get memorized as either success or failure.

Let’s understand this better with a simple game: Landing A Rocket

In our demo, the environment is a Lunar Lander which gives us the states as observations to the agent and the rewards that the agent receives as it tries to beat the environment. Lunar Lander is an environment taken from OpenAI gym. OpenAI’s gym is a library which provides a huge array of learning environments.

Agent: Lunar Lander
State:  A pixel data from the screen represented by a state vector. This represents state space(continuous).
Action: Lunar Lander has 4 possible actions: do-nothing, a fire left orientation engine, fire main engine, fire right orientation engine. This represents the output space(discrete).
Reward: Reward may be positive or negative. A successful landing agent gets 200 reward points. If an agent moves away from the landing pad, it loses reward points.
Goal: To safely land on the landing pad.
Episode:  Episode finishes if the agent crashes or comes to rest or after 5000 timesteps.
Working: It works on the Monte Carlo Approach. It is episodic i.e. it gives the output for each episode and that knowledge is used for the next episode. Every single output is used as a reference for the next episode.

The expected solution is to safely land the agent on the launch pad with consistency on both legs with the average cumulative reward of 200 over 100 episodes.

Points to remember before training the agent:

  • Agent always starts at the same starting point.
  • We terminate the episode after a certain number of steps or when it reaches the goal state.
  • At the end of the episode, we have a list of States, Actions, Rewards, and the New States.
  • The agent will sum the total rewards. It updates its experience based on top ‘x’ percentile rewards.
  • Then start a new game with this new knowledge.
  • By running more and more episodes, the agent will learn to play better and better.
  • The method our agent uses to learn is the Cross-Entropy Method.

Now we are ready to train the Agent. The agent will be trained using a forward neural net of architecture 8(nodes in input layer)*200 (nodes in hidden layer)*4(nodes in the output layer) and weights will be updated for every session. This can be done by updating the session_size variable in the training agent code.

Because the agent was new to the environment, he got crashed.

Again the agent was trained for 50 sessions.

As you can see, he was almost close to the goal state but had fired right engine more when it was not required. As a result, he had to undergo more training.

This time the agent was trained for 100 sessions.

This time again he was close but did not move into the goal state. Showed some improvement though as while he was near to the goal state, he had not fired any engines.

Back to training again, update session size to 150.

Even worse than the previous result. He was a little overconfident and went very fast.

Update session size to 200 now.

This time he was a little careful and successfully landed in the goal state.

Now you must think about how he has learned this on his own. Some work to your brains. 😛

To implement the deep cross-entropy method, we need to follow a few steps as described in the flowchart below:

The total reward received for each episode is recorded.  A batch of these episodes is then generated, ~100 episodes per batch.
Once we have gathered the data of episodes from the batch, we can pick the episodes that performed the best in that batch.

Network Architecture:

8 dimensional state vector (input layer size: 8), Hidden layer size: 200 Output layer size: 4 (4 actions)

A sample network with a hidden layer of 10 nodes.
Loss function: Cross Entropy Loss
Optimizer: Adam (Before each training step, we need to set the gradients of our optimizer back to zero)

The full code with detailed explanation can be found here on my Github –

Hope this article was helpful in giving a fair idea about Reinforcement Learning
Any questions, feedback, suggestions for improvement are most welcome. 🙂


  1. Lapan, Maxim — Deep Reinforcement Learning Hands-On, Packt Publishing, 2018
  2. Sutton R. and Barto A. — Reinforcement Learning: An Introduction, MIT Press, 1998
  3. Playing Atari with Deep Reinforcement Learning.
  4. H.Mao, Alizadeh, M. Alizadeh, Menache, I.Menache, and S.Kandula. Resource Management With deep Reinforcement Learning. In ACM Workshop on Hot Topics in Networks, 2016.
  5. I. Arel, C. Liu, T. Urbanik, and A. Kohls, “Reinforcement learning-based multi-agent system for network traffic signal control,” IET Intelligent Transport Systems, 2010.
  1. Cesar
  2. Shiva
  3. Reply
    • Shivaprasad K
  4. Reply
  5. Reply
    • Shivaprasad K
  6. Reply
    • Shivaprasad K
  7. Reply
  8. Reply
    • Shivaprasad K
  9. Reply
    • Shivaprasad K
  10. Reply
    • Shivaprasad K
  11. Reply
    • Shivaprasad K

Reply Cancel Reply

Your email address will not be published. Required fields are marked *