Reinforcement Learning for Atari Breakout in Python


Reinforcement learning is a subfield of artificial intelligence that is concerned with developing algorithms and models that can learn from feedback and optimize actions to achieve a specific goal. One popular application of reinforcement learning is training an Artificial Intelligence (AI) agent to play video games.

In this blog post, we’ll explore how to use reinforcement learning to train an AI agent to play the classic Atari game, Breakout. We’ll discuss the Deep Q-Network (DQN) algorithm, which is a popular technique for training game-playing AI agents. We’ll also provide a step-by-step tutorial on how to implement the DQN algorithm in Python using the PyTorch library and the OpenAI Gym environment to train an AI agent to play Atari Breakout.

The results of training a model with this algorithm can be seen in the following video:

By the end of this post, you’ll have a good understanding of how to use reinforcement learning to train an AI agent to master Atari games in Python. Let’s start.

What is Reinforcement Learning?

Reinforcement learning (RL) is a subfield of artificial intelligence (AI) that focuses on developing algorithms and models that enable agents to learn from feedback to optimize their actions to achieve a particular goal.

In reinforcement learning, an agent interacts with an environment and receives rewards or penalties for its actions. The agent then uses this feedback to update its decision-making process and improve its future actions to maximize the cumulative reward over time.

RL has been successfully applied in various domains such as robotics, game playing, recommendation systems, and more. The primary advantage of reinforcement learning is its ability to learn from trial-and-error without requiring a pre-defined set of rules or examples.

I don’t intend to explore this topic further in this blog post, as I’m planning to write another one that specifically covers the details of reinforcement learning. I want to make it clear that I’ll be using RL algorithms to train an AI agent to play Atari Breakout. To achieve this, we’ll create an environment where the agent can play the game, make decisions, and learn from the outcomes of those decisions. After several hours of training, the agent should be able to play the game at a high level on its own.

Algorithm composition

The following image has a visual representation of the algorithm that will be presented in this post:

Graphical representation of the main components of the Reinforcement Learning algorithm

Each component has been built as a separated python script, which can be found in our Github repository. Here is a high-level description of each component:

  • Deep Q-Network (DNQ): a Python script that defines the neural network architecture, epsilon-greedy action selection strategy, replay memory buffer, and a function for converting input image frames to PyTorch tensors for the Deep Q-Network (DQN) algorithm used to train an agent to play a game from pixel input.
  • Agent trainer: a Python script that trains a neural network to learn how to play the game using reinforcement learning and saves the trained policy network.
  • Atari Wrappers: a Python script containing a set of wrappers for modifying the behavior of Atari environments in OpenAI Gym, with the goal of preprocessing raw game screen frames and providing additional features for training reinforcement learning models

This project also includes the Renderer script, which loads a trained policy file and uses it to play the game. This allows us to observe the progress of the agent visually and evaluate its real-time performance.

Going forward, I will provide a detailed explanation of each script, allowing you to gain a better understanding of the inner workings of this algorithm.

Deep Q-Network (DNQ)

This part of the algorithm has been built in the script. It defines the neural network architecture for the Deep Q-Network (DQN) algorithm, which is used to train an agent to play a game from pixel input. The neural network architecture consists of convolutional layers followed by fully connected layers. The agent learns to play the game using reinforcement learning by interacting with the environment and updating the weights of the neural network accordingly.

Here is a graphical representation of the neural network:

Graphical representation of the Deep-Q Learning Neural Network

The neural network architecture takes 4 consecutive frames as input. Each frame is represented as a 3D matrix of pixels, which can be obtained using the OpenAI Gym tools. The frames are then converted to grayscale, effectively reducing the 3 dimensions of color to a single one, resulting in a 2D matrix representation.

This process is represented in the following image:

Reducing the dimensions of the input is advantageous for training the Neural Network as it reduces the amount of information to process. In this game, color is not crucial for recognizing any significant features that could aid in learning from the environment.

In the Python script, this dimension reduction is done using the fp function:

These code lines transform the image dimensions from 3 to 1 and reshape it to a square form (height by height). This compresses the image and reduces its size, leading to a minor loss of data. However, this transformation simplifies the image processing.

That said, each frame will be represented as a matrix like this:

Frame matrix

By using 4 consecutive frames of the game as input, we can provide better information about what is happening in the game. A single frame doesn’t show enough context, like whether the ball is moving up or down or how fast it’s moving.

These 4 frames are packed into a new matrix, the Frame Stack Matrix:

Frame stack matrix

The Frame Stack matrix may look like a 2D matrix, but it’s actually a 3D matrix. Each index in the matrix contains a Frame matrix, which is a 2D matrix.

Finally, we will combine the 32-frame stack matrix into a “batch” matrix, which is a 4D matrix. This 4D matrix, also known as the input tensor, will be passed as input to the neural network.

Input tensor shape

When training the neural network, each element of the input tensor will be passed through the neural network, as shown in the graphical representation of the Deep-Q Learning Neural Network (see above).

The structure of the neural network is defined in the constructor of class DQN:

This PyTorch Neural Network code is straightforward and easy to understand, with a clear connection between the code and its graphical representation.

When training this network, the forward function is used:

This function takes the input tensor “x” and calculates the Q-values for the four actions that the agent can choose from, based on the previous four frames of the game.

The 4 possible actions are:

    1. Move the paddle to the left
    2. Move the paddle to the right
    3. Do nothing (i.e., stay in the same position)
    4. Fire the ball (not a commonly used action)

In this context Q-values represent the expected future reward an agent will receive if it takes a certain action in a given state. In the case of the Atari Breakout game, the Q-values calculated by this function represent the expected future reward for each of the four possible actions based on the current state of the game, as observed through the previous four frames. The agent then uses these Q-values to determine which action to take in order to maximize its cumulative reward over time.

At the beginning of the training, the Agent has no experience in the environment and cannot predict the future reward of each action. Therefore, the neural network is unable to produce appropriate Q-values. However, after playing the game multiple times, the neural network will learn to predict the future reward of taking a certain action based on the state of the game (input tensor).

This script contains a function to initialize the weights of the neural network at the start of training:

This creates a blank slate with no prior experience, allowing the agent to learn from its environment by gaining experience while playing the game.

The DQN algorithm uses a combination of exploitation and exploration to learn from the environment. During the early stages of training, the agent needs to explore the environment to learn about the available actions and their effects. To achieve this, the algorithm uses a variable called random_exploration_interval, which determines the number of episodes that the agent should explore before switching to exploitation.

If the number of episodes played is less than the random_exploration_interval, the agent will select actions randomly to explore the environment. This encourages the agent to take risks and try new actions, even if they have not been tried before. In this stage the Q-values are computed but not considered for the action selection.

Once the agent has played enough episodes to surpass the random_exploration_interval, it will switch to exploitation and select actions based on the Q-values learned by the network. However, to ensure that the agent continues to explore the environment and avoid getting stuck in local optima, the algorithm uses a technique called Epsilon Decay or Annealing. This gradually decreases the probability of selecting a random action over time, so that the agent becomes more and more likely to select actions based on the Q-values as it gains experience.

All this process has been coded in the ActionSelector class:

The constructor takes the following parameters as inputs:

  • initial_epsilon
  • final_epsilon 
  • epsilon_decay
  • random_exp

These parameters are used to model the Annealing function, which exhibits behavior consistent with the response of a first-order system, similar to a discharging RL or RC circuit. The discharge of a first order system can be modeled like this:

Where e is the Euler number, t is time and τ is the time constant. This funtion produces a time response like this:

First order system time response

First-order systems exhibit an interesting property: they decay to zero after a time period equal to 5τ has elapsed. For example, if τ=1 (as in the image above), the value of the function will have decayed to 0.67% of its initial value (e-5=0.673), which can be considered effectively extinguished.

For the Reinforcement Learning algorithm, time is not important, so we focus on episode number. initial_epsilon will be the initial amplitude of the expotential function; final_epsilon is the final value; epsilon_decay is equivalent to τ and random_exp is like a episode delay for the exponential function. We can model this as follows:

Value of Epsilon as function of episodes

In this function:

  • E is the number of the actual episode being played
  • ε is the value of the Epsilon function
  • Rexp is the number of episodes that will be used for the exploration stage of the algorithm
  • εi is the desired initial value of the Epsilon function at the start of Annealing stage
  • εf is the desired final value of the Epsilon function at the end of Annealing stage
  • εdecay is used to control the speed of the Annealing
  • u(E) is a step funtion to represent the two stages of the function

The following chart shows a graphical representation of the Epsilon function:

Epsilon function for action selection

During the initial stage of training, the agent selects actions randomly with a 100% probability. Once the agent has played more than random_exp episodes, the annealing process begins, and the agent starts selecting actions based on its past experience instead of choosing actions randomly.

As the number of episodes played surpasses 5 times the value of epsilon_decay, the algorithm enters its final stage, in which most actions are selected based on the output of the neural network, with a small percentage of actions still selected randomly. This encourages continued exploration of the environment during later stages of training, while also ensuring that the agent mostly selects actions based on what it has learned from experience.

During the development of this algorithm, I have been using a value of 0.05 for the final value of epsilon (εf). This means that after the annealing process is complete, actions will be selected with a 5% probability based on random exploration, and with a 95% probability based on the Q-values learned by the network.

The script also features a ReplayMemory Class, with the following code:

This class represents a data structure for storing experience tuples in the form (state, action, reward, done, next_state) used in reinforcement learning algorithms.

When initialized, the class takes three parameters:

  • capacity: determines the maximum number of tuples that the memory can store
  • state_shape: describes the shape of the state tensor
  • device: indicates whether to use a GPU or CPU for computations.

The __init__ method initializes the memory buffer with empty tensors to store the tuples. The push method adds a new tuple to the memory buffer at the current position, and updates the position and size of the buffer accordingly.

The sample method returns a batch of batch_state experience tuples randomly sampled from the memory buffer. It retrieves the current state, next state, action, reward, and done values for each tuple in the batch, and returns them as tensors.

The __len__ method returns the current size of the memory buffer.

Replay Memory is a critical component of reinforcement learning algorithms, especially deep Q-learning networks. The purpose of the replay memory is to store past experiences (state, action, reward, next state) and randomly sample a batch of them to train the neural network. This helps the network to learn from a diverse set of experiences, preventing it from overfitting to a specific set of experiences, leading to better generalization to unseen situations.

Replay memory also allows for the breaking of the sequential correlation between experiences, which can cause issues when training the network. By randomly sampling experiences from the memory buffer, the network is exposed to a more diverse set of experiences, and it learns to generalize better.

All these components are packed in the script, which is used by to train the Reinforcement Learning algorithm to play this game. Let’s move on to the other scripts.

Atari Wrappers

This code is a Python script that contains a set of wrappers for modifying the behavior of Atari environments in OpenAI Gym. These wrappers preprocess raw game screen frames and provide additional features for training reinforcement learning models.

I did not author this code. It is largely based on an OpenAI baseline that can be accessed here, with only minor modifications.

Here’s a brief description of each method in the code, from my understanding:

  • NoopResetEnv: This wrapper adds a random number of “no-op” (no-operation) actions to the start of each episode to introduce some randomness and make the agent explore more.
  • FireResetEnv: This wrapper automatically presses the “FIRE” button at the start of each episode, which is required for some Atari games to start.
  • EpisodicLifeEnv: This wrapper resets the environment whenever the agent loses a life, rather than when the game is over, to make the agent learn to survive for longer periods.
  • MaxAndSkipEnv: This wrapper skips a fixed number of frames (usually 4) and returns the maximum pixel value from the skipped frames, to reduce the impact of visual artifacts and make the agent learn to track moving objects.
  • ClipRewardEnv: This wrapper clips the reward signal to be either -1, 0, or 1, to make the agent focus on the long-term goal of winning the game rather than short-term rewards.
  • WarpFrame: This wrapper resizes and converts the game screen frames to grayscale to reduce the input size and make it easier for the agent to learn.
  • FrameStack: This wrapper stacks a fixed number of frames (usually 4) together to give the agent some temporal information and make it easier for the agent to learn the dynamics of the game.
  • ScaledFloatFrame: This wrapper scales the pixel values to be between 0 and 1 to make the input data more compatible with deep learning models.
  • make_atari: This function creates an Atari environment with various settings suitable for deep reinforcement learning research, including the use of the NoFrameskip wrapper and a maximum number of steps per episode.
  • wrap_deepmind: This function applies a combination of the defined wrappers to the given env object, including EpisodicLifeEnv, FireResetEnv, WarpFrame, ClipRewardEnv, and FrameStack. The scale argument can be used to include the ScaledFloatFrame wrapper as well.

From this code only make_atari and wrap_deepmind are used in the script, which we are going to describe in the next section.

Agent Trainer

The script contains the main training loop for a reinforcement learning agent using the Deep Q-Network (DQN) algorithm. The script initializes the DQN model, sets hyperparameters and creates a replay memory buffer for storing experience tuples.

The script then runs the main training loop for a specified number of episodes, during which the agent interacts with the environment, samples experience tuples from the replay memory buffer, and updates the parameters of the DQN model. The script also periodically evaluates the performance of the agent on a set of evaluation episodes and saves the current DQN model and replay memory buffer to disk for later use.

Now I will go on details with this code. First I want to describe the hyperparameters of the code:

By changing these values you can greatly affect the performance of the training algorithm. There are some values that shouldn’t be changes, as they are pretty much standard for this task. That is the case of batch_size, gamma, optimizer_epsilon, adam_learning_rate, target_network_update, memory_size, policy_network_update, policy_saving_frequency, num_eval_episode and frame_stack_size.

Yes, you can try to change those values, but I wouldn’t recommend it until you have a full understanding of the algorithm and how every one of those parameters are used.

The number of episodes is the most significant parameter at the beginning, as it tells the trainer how many episodes to play. To test this algorithm, it is advisable to set the number of episodes to at least 1,000,000, which has been observed to produce human-like scores of around ~35 points. That’s from my experience with this algorithm, with no annealing and with the parameters posted above.

If you set both initial_epsilon and final_epsilon to 0.05, you essentially remove the annealing process from this algorithm. The benefits of annealing can be discussed in a future post, but for now, not using annealing is still effective.

The following chart presents the average reward obtained by an agent with the settings above:

Average reward obtained by agent, as a function of the number of player episodes

To train an agent that achieves a score greater than 350 points in Breakout, you need to train the agent for at least 9 million episodes using the given parameters. The time required to train the model depends on your hardware resources, with a good GPU being faster than a good CPU. For example, the training process that generated the chart above played 10 million episodes in over 48 hours. However, the same algorithm would complete in 10 hours on an RTX 3060, 6GB GPU. That’s a big gap right there.

The training of an agent can be done on multiple stages, because this script has a mechanism to “save” the training of a model into a file. If you want to use this mechanism, you can train a model with a certain number of episodes and save the training into a file. Then, you rename that file to and it will load the policy and avoid starting from a blank slate.

This is done by this function:

That structure makes it easier to train a model on multiple stages. It should be noted that previous_experience provides context to the algorithm, so it will not go through the exploration stage every time experience is added to the policy file.

This script also features the optimize function:

This is where magic takes place, the actual learning. The train flag is used to determine whether to optimize the model or not. If train is False, the function returns without optimizing the model. That is the exploration stage described in the previous section, which depends on the value of random_exploration_interval hyperparameter.

When train flag is True, a batch of experiences is sampled from memory, and the Q-values for the current state and action are computed using the policy network. The maximum Q-value for the next state is computed using the target network, and the expected Q-value for the current state and action is computed using the Bellman equation.

The loss between the Q-values predicted by the policy network and the expected Q-values is computed using the smooth L1 loss function. The gradients of the optimizer are zeroed out, and the loss is backpropagated through the model. The gradients are then clipped to be between -1 and 1, and the parameters of the model are updated using the optimizer.

Summing it up, the optimize_model function optimizes the PyTorch neural network model by computing and updating the Q-values, expected Q-values, and loss using a batch of experiences stored in memory.

The other important function of this code is evaluate:

This code is for evaluating how well an artificial intelligence agent can perform in a video game environment. The agent uses a neural network to make decisions, and this code helps us understand how good those decisions are.

The code takes the agent’s neural network and runs it on the game environment, which is set up to make it easy for the agent to learn. The code also keeps track of how well the agent is doing by keeping track of the total points it scores in each episode.

After running the agent on the game environment for a set number of episodes, the code calculates the average score across all the episodes and writes it down in a file along with other information, such as the current step and the current level of exploration. This information helps us see how the agent’s performance is improving over time.

This function stores the average episode reward, current step number, and level of exploration in a text file, which helps track the agent’s performance over time.

Finally, it is the Main function that makes all components to work:

This is the point where all components of the algorithm come together and training takes place.

Here is a high-level description of the main function:

  • The main() function initializes the game environment, sets up the neural networks, optimizer, and other parameters for training the DQN, and begins the training loop.
  • The game environment is first created using the make_atari() function from OpenAI Gym, which returns a raw Atari environment without frame skipping.
  • The environment is then wrapped in a DeepMind wrapper using the wrap_deepmind() function to preprocess the observations and actions, clip rewards, and stack frames to create the input to the neural networks.
  • The neural networks are defined using the DQN() class, which creates a deep convolutional neural network with a specified number of output nodes corresponding to the number of possible actions in the game.
  • The load_pretrained_model() function loads a pre-trained policy network if one is available and synchronizes the target network with the policy network.
  • The optimizer is defined using the Adam algorithm with a specified learning rate and epsilon.
  • The replay memory buffer is initialized using the ReplayMemory() class, which creates a buffer of fixed size to store previous experiences of the agent in the game.
  • The action selector is defined using the ActionSelector() class, which selects actions according to an epsilon-greedy policy.
  • The training loop consists of a series of episodes, where each episode consists of a sequence of time steps.
    • At each time step, the current state of the game is represented using a frame stack of the previous observations, and the action selector selects an action to take based on the current state and exploration policy.
    • The environment is stepped forward using the selected action, and the resulting observation, reward, and done flag are stored in the replay memory buffer.
    • Periodically, the performance of the policy network is evaluated using the evaluate() function, the policy network is optimized using the optimize_model() function, the target network is synchronized with the policy network, and the policy network weights are saved to disk.
    • The training loop terminates after a fixed number of episodes, and the final policy network weights are saved to disk.

By running this main function the training will start and the agent will learn from the experience. After finishing it, a policy file will be generated, wich can be used for playing the game automatically with a Renderer script.



5 2 votes
Article Rating
Notify of

Inline Feedbacks
View all comments