By now, you must have become familiar with Reinforcement Learning along with buzz words such as environment, agent, policy, goal, and episode. We discussed this in detail in our previous blog, understanding it better with a simple problem of Landing a Rocket solved by RL.
Before we start solving the Rubik’s cube, let’s first understand what is a Rubik’s Cube. Rubik’s cube is a puzzle in the form of a plastic cube, covered with multicoloured squares, which the player attempts to twist and turn so that all the squares on each face are of the same colour. A Rubik’s cube has 6 coloured sides with 9 cubelets on each side. Please check out the figures below:
How hard is to solve a Rubik’s cube?
It’s pretty hard and I will tell you why!!!!!!
A 3×3 Rubik’s cube has 6 edges and each edge can be rotated in 2 different directions which means, 12 rotations in total. So it was calculated that in a 3×3 cube, there were 4.33 x 1019 distinct states that could be achieved. But to find an optimal solution, it was almost impossible for a human to reach the goal state without any strategy or algorithm.
Approaches to solve the Rubik’s Cube:
Algorithm (Brute force search)
In the paper, the authors introduced a technique where they have trained a simple neural network on randomly shuffled cubes. It is easy to get a policy which can show us the direction to reach the goal state(solved state). Basically, we need a computer to learn how to solve Rubik’s cube without any human intervention. But, the hardest part would be to code everything to make the computer understand.
Steps that we need to follow are:
Create input data for training the neural network. For this, we need to represent the data to make the computer understand.
Train the neural network.
Apply the model to find the optimal solution.
Let’s go through each step in detail:
Create data: Data representation is one of the most important steps. Let’s encode its two major entities:
Actions: These are the possible rotations made at one given instance. As we know, there are 12 different ways a cube can be rotated at an instance and the names of the cube sides are left, right, top, bottom, front and back.
Positions or states: It is an arrangement of the coloured cubes and as we discussed the search space, it’s very important to optimize the process to avoid memory cap.
Training a Neural Network: A machine learning algorithm accepts an input dimension and gives the output after learning the function. Here, the input is the present state in which the cube is arranged. Also, the cube is encoded in the input as a tensor of 20×24. The architecture of the network is shown below as mentioned in the paper:
The output of the network is a vector of 12 numbers (Policy) and the intensity of the output shows how good is the output vector or policy.
The methodology of training: The training introduced by the authors is quite interesting which they call as Autodidactic iteration (ADI). The name sounds very complicated but the working of the method is very simple and efficient. We shall try to discuss this in simple terms:
An assembled cube structure is given to the network and the assembled network is the goal state. A random transformation is applied to the assembled cube which has 12 states and is fed into the network asking for a value output.
The value is calculated using the formula shown in the figure.
If the goal state is reached, the value will be 1.
If the goal state is not reached, the value will be -1.
Also, the target policy is calculated using a formula shown below:
By the above process, the training data is generated. As the search space is too large, we cannot just trust the model output. Hence, the authors have used Monte Carlo Tree Search to get the optimal solving sequence.
Monte Carlo Tree Search
The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on a random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts.
See the result below after running out the experiments: