Reinforcement Learning: Understand with a Simple Game
Reinforcement learning is an important type of Machine Learning where an agent learns how to behave in an environment by performing actions and seeing the results. It seeks to create intelligent agents that adapt to the environment by analyzing their own experiences. It is a computational approach to learn something from the action.

Let’s understand this in simple terms.
My son Rigved started playing with the TV remote when he was just 2 years old. At the age of 3, he started liking kids channels but he did not know how to tune in to those channels. So, he began pressing the keys randomly and used to feel happy upon finding those channels. Gradually, he started memorizing the keys (combination only and not the numbers).
Though to reach this stage he failed multiple times but, slowly became perfect in tuning to those kids channels.
This is nothing but reinforcement learning – starting without any data, collecting data with random attempts based on the problem defined in a domain. All results get memorized as either success or failure.
Let’s understand this better with a simple game: Landing A Rocket
In our demo, the environment is a Lunar Lander which gives us the states as observations to the agent and the rewards that the agent receives as it tries to beat the environment. Lunar Lander is an environment taken from OpenAI gym. OpenAI’s gym is a library which provides a huge array of learning environments.
Agent: Lunar Lander
State: A pixel data from the screen represented by a state vector. This represents state space(continuous).
Action: Lunar Lander has 4 possible actions: do-nothing, a fire left orientation engine, fire main engine, fire right orientation engine. This represents the output space(discrete).
Reward: Reward may be positive or negative. A successful landing agent gets 200 reward points. If an agent moves away from the landing pad, it loses reward points.
Goal: To safely land on the landing pad.
Episode: Episode finishes if the agent crashes or comes to rest or after 5000 timesteps.
Working: It works on the Monte Carlo Approach. It is episodic i.e. it gives the output for each episode and that knowledge is used for the next episode. Every single output is used as a reference for the next episode.

The expected solution is to safely land the agent on the launch pad with consistency on both legs with the average cumulative reward of 200 over 100 episodes.
Points to remember before training the agent:
- Agent always starts at the same starting point.
- We terminate the episode after a certain number of steps or when it reaches the goal state.
- At the end of the episode, we have a list of States, Actions, Rewards, and the New States.
- The agent will sum the total rewards. It updates its experience based on top ‘x’ percentile rewards.
- Then start a new game with this new knowledge.
- By running more and more episodes, the agent will learn to play better and better.
- The method our agent uses to learn is the Cross-Entropy Method.
Now we are ready to train the Agent. The agent will be trained using a forward neural net of architecture 8(nodes in input layer)*200 (nodes in hidden layer)*4(nodes in the output layer) and weights will be updated for every session. This can be done by updating the session_size variable in the training agent code.
Because the agent was new to the environment, he got crashed.
Again the agent was trained for 50 sessions.
As you can see, he was almost close to the goal state but had fired right engine more when it was not required. As a result, he had to undergo more training.
This time the agent was trained for 100 sessions.
This time again he was close but did not move into the goal state. Showed some improvement though as while he was near to the goal state, he had not fired any engines.
Back to training again, update session size to 150.
Even worse than the previous result. He was a little overconfident and went very fast.
Update session size to 200 now.
This time he was a little careful and successfully landed in the goal state.
Now you must think about how he has learned this on his own. Some work to your brains. 😛
To implement the deep cross-entropy method, we need to follow a few steps as described in the flowchart below:

The total reward received for each episode is recorded. A batch of these episodes is then generated, ~100 episodes per batch.
Once we have gathered the data of episodes from the batch, we can pick the episodes that performed the best in that batch.
Network Architecture:
8 dimensional state vector (input layer size: 8), Hidden layer size: 200 Output layer size: 4 (4 actions)

A sample network with a hidden layer of 10 nodes.
Loss function: Cross Entropy Loss
Optimizer: Adam (Before each training step, we need to set the gradients of our optimizer back to zero)
The full code with detailed explanation can be found here on my Github – https://github.com/kspkoyyada/Simple-Reinforcement-Learning1
Hope this article was helpful in giving a fair idea about Reinforcement Learning
Any questions, feedback, suggestions for improvement are most welcome. 🙂
References
- Lapan, Maxim — Deep Reinforcement Learning Hands-On, Packt Publishing, 2018
- Sutton R. and Barto A. — Reinforcement Learning: An Introduction, MIT Press, 1998
- Playing Atari with Deep Reinforcement Learning. https://arxiv.org/pdf/1312.5602v1.pdf
- H.Mao, Alizadeh, M. Alizadeh, Menache, I.Menache, and S.Kandula. Resource Management With deep Reinforcement Learning. In ACM Workshop on Hot Topics in Networks, 2016.
- I. Arel, C. Liu, T. Urbanik, and A. Kohls, “Reinforcement learning-based multi-agent system for network traffic signal control,” IET Intelligent Transport Systems, 2010.
- http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
- https://medium.com/coinmonks/landing-a-rocket-with-simple-reinforcement-learning-3a0265f8b58c
- https://allan.reyes.sh/projects/gt-rl-lunar-lander/
- https://gym.openai.com/envs
- http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf
I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful.
Thankyou, Cesar for your honest feedback. Appreciation is definitely needed for any author.
Thanks again.
Is it alright to put part of this on my personal weblog if I post a reference to this webpage?
Maybe a bit delayed response. Yes, please. Learning should not stop. Do not forget to put the references which I have listed.
Credit goes to the references.
First of all, allow my family recognize a persons command during this matter. Even though this is certainly brand new , nevertheless soon after registering your site, this intellect has exploded extensively. Allow all of us to take hold of ones rss to help keep in touch with at all probable messages Sincere understand but will pass it on to help admirers and my individual are living members
i have checked this site a couple of times now and i have to say that i find it quite good actually. keep the nice work up!
Thanks alot.
Hi there, MegaCool blog mate, I really loved this page. I’ll be sure to talk about this to my cousin who would, odds are, love to check out this post too. Found this sites post through the Bing search engine by the way, incase you were curious. Many thanks for the wonderful read!
Thank you. Your feedback helps a lot.
We came across a cool web-site which you could appreciate. Take a search should you want.
Can I make a suggestion? I feel youve bought something good here. But what should you added a couple links to a web page that backs up what youre saying? Or possibly you would give us one thing to take a look at, one thing that may connect what youre saying to something tangible? Only a suggestion. Anyway, in my language, there arent a lot good source like this.
Yes, I agree, Adding those website links as part of the content at relevant places instead of everything in the references. Thank you.
This was a truly very good submit. In theory I’d wish to write like this also – getting time and actual effort to make a good piece of writing but what can I say I procrastinate alot and by no means seem to obtain some thing done.
Yes, It takes some time in the beginning. some resistance pulls inwards, once we come out of that we see a lot of freedom to express our thoughts.
As the saying goes, within the professionals mind there arent many choices, but for someone possessing a beginners mind, everything is wide open.
Exactly, Thanks for mentioning it.
With all the doggone snow we have gotten lately I am stuck indoors, fortunately there is the internet, thanks for giving me something to do.
Yes, It will take some time to understand. For sure you will enjoy the output you see.