If you have a dog, you will probably know the way how reinforcement learning works – you just haven’t called it that yet. Let’s assume we have a mediocre intelligent dog that can perform different actions. He can sit, lay down or just do nothing. Now we tell our dog to lay down. Obeying us leads to a reward in form of treats. A defiant dog gets nothing or is yelled at as a kind of “negative reward” (which we obviously don’t do because we are nice people). After a while our imaginary dog starts noticing that he will get more of his beloved food, as soon as he always acts accordingly to our command. In a broader sense the figure below shows this cycle.
We as the owner personify the environment and give our dog a state St in form of a command for example. Our dog is the agent that receives the state and thereupon takes an action At like laying down. The rules that the agent uses to choose an action form the so called the policy. We react to that action At with treats or nothing as reward Rt+1 and return the new state St+1, which can be nothing or a command again.
The objective of RL is to find a policy that maximizes the reward. So let’s switch to the digital world and try reaching the goal in an environment that is causing less costs than rebuying new dog treats all the time.
Working with OpenAI Gym
For this introductory post we’ll use the Gym toolkit to create an agent that is able to play a simple taxi game. I encourage you to read their introduction first to get comfortable with Gym. With its easy API we can dive right into writing our first RL algorithm.
But let’s take a look at the game first. It consist of a 5×5 matrix containing our taxi (green if manned) and four different cabstands labeled with letters. Also, there are some walls in the environment that our taxi can’t pass. The task of the game is to pick up passengers at one of the cabstands and carry them to their destinations.
To do that, our agent has six possible actions to choose from. He can go north, south, east or west and he can try to pick up or drop off a passenger. This is called the action space of our taxi. Besides the action space we also have to define the state space. As we have 5*5 taxi locations, 5 different passenger locations (because we have to include the passenger being in our taxi) and 4 different destinations, the total number of states is 5*5*5*4 = 500.
Performing actions rewards the agent with points. He receives 20 points for a successful drop-off and loses 1 point for every time-step it takes. The latter results in our agent trying to solve the task fairly quick and prevents him from wandering around. There is also a -10 point penalty for illegal pick-up and drop-off actions and -1 penalty for driving against a wall.
The Q-learning agent
A good way to approach a solution is using the simple Q-learning algorithm, which gives our agent a memory in form of a Q-table. In this table of size states x actions we store a value for each state-action combination. Those values estimate the reward we get by taking that action and are called Q-values. Thus Q-values represent the “quality” of an action taken from that state. Higher Q-values imply better chances of getting greater rewards. To calculate them, we use the following function:
It looks pretty complex, but is easy to understand. The new Q-value of the state-action pair is based on the sum of two parts. Ignoring the α for a moment, the first part represents the old Q-value and the second part is the sum of the reward r we got by taking action at at state st and the discounted estimate of optimal future reward. The very last term returns the maximum Q-value in the next state st+1 over all actions a. This is the future reward, which is discounted by the factor γ. We do that, because we want our agent to focus more on the immediate rewards while not fully ignoring the future rewards. Now to parameter α which is the learning-rate. It determines to what proportion we weigh in the two parts into the new Q-value.
Since we initialize the Q-table with zeros, there is no best action in the start. Thus we have to choose randomly. This becomes problematic once one positive Q-value is found. That leads to the Q-function always returning that specific action. We wouldn’t take the optimal strategy as we’d not get to know whether there is an even higher Q-value. That’s where the epsilon parameter comes in to play. It decides whether we are using the Q-function to determine our next action or take a random sample of the action space. This has the advantage of not stopping to explore after we found a Q-value greater zero. Instead we are starting off exploring the action space and after every game played we decrease epsilon until reaching minimum. With enough exploration done, we can start exploiting the learnt. We call that exploration-exploitation trade-off, which is necessary to control the agent’s greed.
Realizing the theory in python
First of all we initialize the gym environment.
env = gym.make("Taxi-v2")
We continue by creating the Q-table as numpy array. The size of the spaces can be accessed as seen below and np.zeros() just creates an array of the given shape filled with zeros.
import numpy as np
state_space = env.observation_space.n
action_space = env.action_space.n
qtable = np.zeros((state_space, action_space))
The last thing that needs to be predefined are the hyper-parameters. The learning-rate and discount-factor in our Q-function can be tweaked to improve the learning process. You can leave them unchanged for now and deal with them later.
epsilon = 1.0 #Greed 100%
epsilon_min = 0.005 #Minimum greed 0.05%
epsilon_decay = 0.99993 #Decay multiplied with epsilon after each episode
episodes = 50000 #Amount of games
max_steps = 100 #Maximum steps per episode
learning_rate = 0.65
gamma = 0.65
All that’s left to do is implementing the procedure of playing games over and over again. In every episode we reset the state. After resetting we choose an action, step the game forward and update our Q-table until the game is over or we reach the maximum steps allowed. Finally we decrease our epsilon each episode.
for episode in range(episodes):
# Reset the game state, done and score before every episode/game
state = env.reset() #Gets current game state
done = False #decides whether the game is over
score = 0
for _ in range(max_steps):
# With the probability of (1 - epsilon) take the best action in our Q-table
if random.uniform(0, 1) > epsilon:
action = np.argmax(qtable[state, :])
# Else take a random action
action = env.action_space.sample()
# Step the game forward
next_state, reward, done, _ = env.step(action)
# Add up the score
score += reward
# Update our Q-table with our Q-function
qtable[state, action] = (1 - learning_rate) * qtable[state, action] \
+ learning_rate * (reward + gamma * np.max(qtable[next_state,:]))
# Set the next state as the current state
state = next_state
# Reducing our epsilon each episode (Exploration-Exploitation trade-off)
if epsilon >= epsilon_min:
epsilon *= epsilon_decay
And that’s the whole agent in less than 100 lines of code!
Another attempt to solve the environment is an agent that just takes random actions. Neither does he learn, nor remember anything. We are only restricting the allowed amount of moves as before. The implementation is a slimmed version of the Q-learning agent – we are leaving the whole Q-table aspect out. I’m aware of this agent presumably not performing very well, but we can use him in contrast to our previous agent.
Now that we have created both a random agent and a Q-learning agent let’s compare them.
The random agent is behaving pretty consistent throughout all games. Most of the games ended up with a score between around -300 and -500, although in some of them the agent played nearly perfectly around the 0 score region. But these games are pretty rare and the score-scale would have likely extended downwards as we have limited the moves of our agent.
The next graph showing the Q-learning agent’s performance looks a lot better. The first 3000 games are similar to the random agent. But after that the average performance goes up pretty rapidly. At 20000 games the performance converges. Still there are some games that end up with up to -100, because at that point we have an exploration rate of about 25%.
Congratulations, you just learned what reinforcement learning is and how to implement a decently performing Q-learning agent – and everything without a reward! Feel free to play around with the code and leave your thoughts in the comments. In the next part of this series, we will take a look at how to solve the same task described here aswell as a new one with a Deep Q-Network. So stay curious!