Reinforcement Learning en Atari Breakout con Python

0
76

Reinforcement Learning es una rama de la Inteligencia Artificial que se ocupa del desarrollar algoritmos y modelos capaces de aprender a través de retroalimentación y optimizar acciones para lograr un objetivo específico. Una aplicación popular del aprendizaje por refuerzo es entrenar un agente de inteligencia artificial (IA) para jugar videojuegos.

En este post veremos cómo utilizar Reinforcement Learning para entrenar a un agente de IA para jugar al clásico juego de Atari, Breakout. Utilizaremos un algoritmo con una Deep Q-Network (DQN), que es una técnica popular para entrenar agentes de IA que juegan videojuegos.

Este será un tutorial paso a paso sobre cómo implementar el algoritmo DQN en Python utilizando PyTorch y el entorno OpenAI Gym para entrenar a un agente de IA para jugar Atari Breakout.

El resultado del entrenamiento de un agente con 20 millones de episodios de experiencia lo pueden ver en el siguiente video:

Al final de este post habremos explicado los principales componentes de este algoritmos, junto con detalles sobre su funcionamiento y los scripts utilizados para lograr estos resultados. Sin más que decir, comencemos.

Qué es Reinforcement Learning?

Reinforcement Learning (aprendizaje por reforzamiento) es un subcampo de la inteligencia artificial (IA) que se enfoca en desarrollar algoritmos y modelos que permiten a los agentes aprender a través de la retroalimentación para optimizar sus acciones y lograr un objetivo particular.

En Reinforcement Learning un agente interactúa con un entorno y recibe recompensas o penalizaciones por sus acciones. Luego, el agente utiliza esta retroalimentación para actualizar su proceso de toma de decisiones y mejorar sus acciones futuras para maximizar la recompensa acumulada con el tiempo.

RL ha sido aplicado exitosamente en diversos campos, como la robótica, los juegos, los sistemas de recomendación y más. La principal ventaja del aprendizaje por refuerzo es su habilidad para aprender mediante ensayo y error sin requerir un conjunto predefinido de reglas o ejemplos.

No tengo la intención de explorar este tema más a fondo en esta publicación, ya que planeo escribir un post dedicado a explicar en detalle el concepto de Reinforcement Learning.

En este post lo que haré será utiizar algoritmos de RL para entrenar a un agente de IA para jugar Atari Breakout. Para lograr esto, crearemos un entorno donde el agente pueda jugar el juego, tomar decisiones y aprender de los resultados de esas decisiones. Después de varias horas de entrenamiento, el agente debería ser capaz de jugar el juego a un alto nivel por sí solo.

Antes de empezar con el código

El código con el algoritmo que describiremos en este post lo pueden descargar directamente desde nuestro repositorio de Github.

Si usted quiere replicar los resultados obtenidos en el video mostrado en la parte superior, esto lo podrpa hacer descargando el repositorio y ejecutando el archivo renderer.py. Este archivo utilizará el policy que hemos compartido en el repositorio, el cual cuenta con 20 millones de episodios de experiencia.

Para ejecutar este código se necesita instalar Pytorch, algo que ya explicamos en este post. No será necesario instalar OpenAI Gym, pues la biblioteca ha sido incluida en el repositorio en el directorio gym. Se decidió hacer esto de esta manera pues el algoritmo nos estaba dando problemas con la versión más reciente del Gym, por lo cual agregamos una versión modificada en la que se han corregido los errores que se producían al ejecutar el algoritmo con la versión 0.26.2.

También hará falta instalar Numpy y algunas otras dependencias, pero eso no debe ser un problema para cualquier persona con un mínimo de conocimiento en Python.

Composición del algoritmo

La siguiente imagen presenta una representación visual de los scripts que conforman este proyecto:

Graphical representation of the main components of the Reinforcement Learning algorithm

Cada componente ha sido construido como un script separado de Python, los cuales se pueden encontrar en nuestro repositorio de Github. Aquí hay una descripción de alto nivel de cada componente:

  • Deep Q-Network (DNQ): un script de Python que define la arquitectura de la red neuronal, la estrategia de selección de acciones epsilon-greedy, el búfer de memoria de replays y una función para convertir los cuadros de imagen (capturas de pantalla) de entrada en tensores PyTorch para el algoritmo de Deep Q-Network (DQN).
  • Agent trainer: script en Python que entrena una red neuronal para aprender cómo jugar el juego utilizando el Reinforcement Learning y guarda el policy de red entrenada en un archivo.
  • Atari Wrappers: un script en Python que contiene un conjunto de funciones (wrappers) que permiten interactuar con el juego de Atari. Este script permite ejecutar acciones en el juego, tales como mover el pad de izquierda a derecha. Tambien permite conocer el puntaje del juego y el estado del juego (una captura de pantalla).

Este proyecto también incluye el script de renderizado (el previamente mencionado renderer.py), que carga un archivo  policy con el entrenamiento y lo utiliza para jugar el juego. Esto nos permite observar el progreso del agente visualmente y evaluar su rendimiento en tiempo real.

A continuación proporcionaré una explicación detallada de cada script, lo que le permitirá obtener una mejor comprensión del funcionamiento interno de este algoritmo.

Deep Q-Network (DNQ)

Esta parte del algoritmo ha sido construida en el script dqn.py. Define la arquitectura de la red neuronal para el algoritmo de Deep Q-Network (DQN), que se utiliza para entrenar a un agente para jugar un juego a partir de entrada de píxeles. Estio se hizo utilizando Pytorch.

La arquitectura de la red neuronal consta de capas convolucionales seguidas de capas completamente conectadas. El agente aprende a jugar el juego utilizando el aprendizaje por refuerzo interactuando con el entorno y actualizando los pesos de la red neuronal en consecuencia.

A continuación les traemos la representación gráfica de la red neuronal:

Representación gráfica de la Red Neuronal de Deep-Q Learning

Esta arquitectura de red neuronal toma como entrada 4 frames (capturas de patalla) consecutivos. Cada frame se representa como una matriz 3D de píxeles, que se puede obtener utilizando el Atari Wrappers. Luego, los frames se convierten a escala de grises, reduciendo efectivamente las 3 dimensiones de color a una sola, lo que resulta en una representación de matriz 2D.

Este proceso lo hemos representado en la siguiente imagen:

Reducir las dimensiones de la entrada es ventajoso para entrenar la red neuronal ya que reduce la cantidad de información que se debe procesar. En este juego, el color no es crucial para reconocer características significativas que podrían ayudar al aprendizaje del del entorno.

En el script la reducción de dimensión se lleva a cabo en la siguiente función:

Estas líneas de código transforman las dimensiones de la imagen de 3 a 1 y redimensionan la imagen en una forma cuadrada (altura por altura). Esto comprime la imagen y reduce su tamaño, lo que conduce a una pérdida menor de datos. Sin embargo, esta transformación simplifica el procesamiento de la imagen.

Dicho esto, cada frame se representará como una matriz como esta:

Frame matrix

Al usar 4 frames consecutivos del juego como entrada, podemos proporcionar una mejor información sobre lo que está sucediendo en el juego. Un solo cuadro no muestra suficiente contexto, como si la pelota se mueve hacia arriba o hacia abajo o qué tan rápido se está moviendo.

Estos 4 frames son empacados en una nueva matrix, la Frame Stack Matrix:

Frame stack matrix

La Frame Stack Matrix puede parecer una matriz 2D, pero en realidad es una matriz 3D. Cada índice en la matriz contiene un frame, que es una matriz 2D.

Finalmente, combinaremos 32 Frame Stack Matrix en un «batch» que es una matriz 4D. Esta matriz 4D, también conocida como input tensor, se pasará como entrada a la red neuronal.

Input tensor shape

Al entrenar la red neuronal, cada elemento del tensor de entrada se pasará a través de la red neuronal, como se muestra en la representación gráfica de la Red Neuronal de Aprendizaje Profundo de Q (ver arriba).

La estructura de la red neuronal se define en el constructor de la clase DQN:

This PyTorch Neural Network code is straightforward and easy to understand, with a clear connection between the code and its graphical representation.

When training this network, the forward function is used:

This function takes the input tensor «x» and calculates the Q-values for the four actions that the agent can choose from, based on the previous four frames of the game.

The 4 possible actions are:

    1. Move the paddle to the left
    2. Move the paddle to the right
    3. Do nothing (i.e., stay in the same position)
    4. Fire the ball (not a commonly used action)

In this context Q-values represent the expected future reward an agent will receive if it takes a certain action in a given state. In the case of the Atari Breakout game, the Q-values calculated by this function represent the expected future reward for each of the four possible actions based on the current state of the game, as observed through the previous four frames. The agent then uses these Q-values to determine which action to take in order to maximize its cumulative reward over time.

At the beginning of the training, the Agent has no experience in the environment and cannot predict the future reward of each action. Therefore, the neural network is unable to produce appropriate Q-values. However, after playing the game multiple times, the neural network will learn to predict the future reward of taking a certain action based on the state of the game (input tensor).

This script contains a function to initialize the weights of the neural network at the start of training:

This creates a blank slate with no prior experience, allowing the agent to learn from its environment by gaining experience while playing the game.

The DQN algorithm uses a combination of exploitation and exploration to learn from the environment. During the early stages of training, the agent needs to explore the environment to learn about the available actions and their effects. To achieve this, the algorithm uses a variable called random_exploration_interval, which determines the number of episodes that the agent should explore before switching to exploitation.

If the number of episodes played is less than the random_exploration_interval, the agent will select actions randomly to explore the environment. This encourages the agent to take risks and try new actions, even if they have not been tried before. In this stage the Q-values are computed but not considered for the action selection.

Once the agent has played enough episodes to surpass the random_exploration_interval, it will switch to exploitation and select actions based on the Q-values learned by the network. However, to ensure that the agent continues to explore the environment and avoid getting stuck in local optima, the algorithm uses a technique called Epsilon Decay or Annealing. This gradually decreases the probability of selecting a random action over time, so that the agent becomes more and more likely to select actions based on the Q-values as it gains experience.

All this process has been coded in the ActionSelector class:

The constructor takes the following parameters as inputs:

  • initial_epsilon
  • final_epsilon 
  • epsilon_decay
  • random_exp

These parameters are used to model the Annealing function, which exhibits behavior consistent with the response of a first-order system, similar to a discharging RL or RC circuit. The discharge of a first order system can be modeled like this:

Where e is the Euler number, t is time and τ is the time constant. This funtion produces a time response like this:

First order system time response

First-order systems exhibit an interesting property: they decay to zero after a time period equal to 5τ has elapsed. For example, if τ=1 (as in the image above), the value of the function will have decayed to 0.67% of its initial value (e-5=0.673), which can be considered effectively extinguished.

For the Reinforcement Learning algorithm, time is not important, so we focus on episode number. initial_epsilon will be the initial amplitude of the expotential function; final_epsilon is the final value; epsilon_decay is equivalent to τ and random_exp is like a episode delay for the exponential function. We can model this as follows:

Value of Epsilon as function of episodes

In this function:

  • E is the number of the actual episode being played
  • ε is the value of the Epsilon function
  • Rexp is the number of episodes that will be used for the exploration stage of the algorithm
  • εi is the desired initial value of the Epsilon function at the start of Annealing stage
  • εf is the desired final value of the Epsilon function at the end of Annealing stage
  • εdecay is used to control the speed of the Annealing
  • u(E) is a step funtion to represent the two stages of the function

The following chart shows a graphical representation of the Epsilon function:

Epsilon function for action selection

During the initial stage of training, the agent selects actions randomly with a 100% probability. Once the agent has played more than random_exp episodes, the annealing process begins, and the agent starts selecting actions based on its past experience instead of choosing actions randomly.

As the number of episodes played surpasses 5 times the value of epsilon_decay, the algorithm enters its final stage, in which most actions are selected based on the output of the neural network, with a small percentage of actions still selected randomly. This encourages continued exploration of the environment during later stages of training, while also ensuring that the agent mostly selects actions based on what it has learned from experience.

During the development of this algorithm, I have been using a value of 0.05 for the final value of epsilon (εf). This means that after the annealing process is complete, actions will be selected with a 5% probability based on random exploration, and with a 95% probability based on the Q-values learned by the network.

The dqn.py script also features a ReplayMemory Class, with the following code:

This class represents a data structure for storing experience tuples in the form (state, action, reward, done, next_state) used in reinforcement learning algorithms.

When initialized, the class takes three parameters:

  • capacity: determines the maximum number of tuples that the memory can store
  • state_shape: describes the shape of the state tensor
  • device: indicates whether to use a GPU or CPU for computations.

The __init__ method initializes the memory buffer with empty tensors to store the tuples. The push method adds a new tuple to the memory buffer at the current position, and updates the position and size of the buffer accordingly.

The sample method returns a batch of batch_state experience tuples randomly sampled from the memory buffer. It retrieves the current state, next state, action, reward, and done values for each tuple in the batch, and returns them as tensors.

The __len__ method returns the current size of the memory buffer.

Replay Memory is a critical component of reinforcement learning algorithms, especially deep Q-learning networks. The purpose of the replay memory is to store past experiences (state, action, reward, next state) and randomly sample a batch of them to train the neural network. This helps the network to learn from a diverse set of experiences, preventing it from overfitting to a specific set of experiences, leading to better generalization to unseen situations.

Replay memory also allows for the breaking of the sequential correlation between experiences, which can cause issues when training the network. By randomly sampling experiences from the memory buffer, the network is exposed to a more diverse set of experiences, and it learns to generalize better.

All these components are packed in the dqn.py script, which is used by agent_trainer.py to train the Reinforcement Learning algorithm to play this game. Let’s move on to the other scripts.

Atari Wrappers

This code is a Python script that contains a set of wrappers for modifying the behavior of Atari environments in OpenAI Gym. These wrappers preprocess raw game screen frames and provide additional features for training reinforcement learning models.

I did not author this code. It is largely based on an OpenAI baseline that can be accessed here, with only minor modifications.

Here’s a brief description of each method in the code, from my understanding:

  • NoopResetEnv: This wrapper adds a random number of «no-op» (no-operation) actions to the start of each episode to introduce some randomness and make the agent explore more.
  • FireResetEnv: This wrapper automatically presses the «FIRE» button at the start of each episode, which is required for some Atari games to start.
  • EpisodicLifeEnv: This wrapper resets the environment whenever the agent loses a life, rather than when the game is over, to make the agent learn to survive for longer periods.
  • MaxAndSkipEnv: This wrapper skips a fixed number of frames (usually 4) and returns the maximum pixel value from the skipped frames, to reduce the impact of visual artifacts and make the agent learn to track moving objects.
  • ClipRewardEnv: This wrapper clips the reward signal to be either -1, 0, or 1, to make the agent focus on the long-term goal of winning the game rather than short-term rewards.
  • WarpFrame: This wrapper resizes and converts the game screen frames to grayscale to reduce the input size and make it easier for the agent to learn.
  • FrameStack: This wrapper stacks a fixed number of frames (usually 4) together to give the agent some temporal information and make it easier for the agent to learn the dynamics of the game.
  • ScaledFloatFrame: This wrapper scales the pixel values to be between 0 and 1 to make the input data more compatible with deep learning models.
  • make_atari: This function creates an Atari environment with various settings suitable for deep reinforcement learning research, including the use of the NoFrameskip wrapper and a maximum number of steps per episode.
  • wrap_deepmind: This function applies a combination of the defined wrappers to the given env object, including EpisodicLifeEnv, FireResetEnv, WarpFrame, ClipRewardEnv, and FrameStack. The scale argument can be used to include the ScaledFloatFrame wrapper as well.

From this code only make_atari and wrap_deepmind are used in the agent_trainer.py script, which we are going to describe in the next section.

Agent Trainer

The agent_trainer.py script contains the main training loop for a reinforcement learning agent using the Deep Q-Network (DQN) algorithm. The script initializes the DQN model, sets hyperparameters and creates a replay memory buffer for storing experience tuples.

The script then runs the main training loop for a specified number of episodes, during which the agent interacts with the environment, samples experience tuples from the replay memory buffer, and updates the parameters of the DQN model. The script also periodically evaluates the performance of the agent on a set of evaluation episodes and saves the current DQN model and replay memory buffer to disk for later use.

Now I will go on details with this code. First I want to describe the hyperparameters of the code:

By changing these values you can greatly affect the performance of the training algorithm. There are some values that shouldn’t be changes, as they are pretty much standard for this task. That is the case of batch_size, gamma, optimizer_epsilon, adam_learning_rate, target_network_update, memory_size, policy_network_update, policy_saving_frequency, num_eval_episode and frame_stack_size.

Yes, you can try to change those values, but I wouldn’t recommend it until you have a full understanding of the algorithm and how every one of those parameters are used.

The number of episodes is the most significant parameter at the beginning, as it tells the trainer how many episodes to play. To test this algorithm, it is advisable to set the number of episodes to at least 1,000,000, which has been observed to produce human-like scores of around ~35 points. That’s from my experience with this algorithm, with no annealing and with the parameters posted above.

If you set both initial_epsilon and final_epsilon to 0.05, you essentially remove the annealing process from this algorithm. The benefits of annealing can be discussed in a future post, but for now, not using annealing is still effective.

The following chart presents the average reward obtained by an agent with the settings above:

Average reward obtained by agent, as a function of the number of player episodes

To train an agent that achieves a score greater than 350 points in Breakout, you need to train the agent for at least 9 million episodes using the given parameters. The time required to train the model depends on your hardware resources, with a good GPU being faster than a good CPU. For example, the training process that generated the chart above played 10 million episodes in over 48 hours. However, the same algorithm would complete in 10 hours on an RTX 3060, 6GB GPU. That’s a big gap right there.

The training of an agent can be done on multiple stages, because this script has a mechanism to «save» the training of a model into a new_policy_network.pt file. If you want to use this mechanism, you can train a model with a certain number of episodes and save the training into a file. Then, you rename that file to trained_policy_network.pt and it will load the policy and avoid starting from a blank slate.

This is done by this function:

That structure makes it easier to train a model on multiple stages. It should be noted that previous_experience provides context to the algorithm, so it will not go through the exploration stage every time experience is added to the policy file.

This script also features the optimize function:

This is where magic takes place, the actual learning. The train flag is used to determine whether to optimize the model or not. If train is False, the function returns without optimizing the model. That is the exploration stage described in the previous section, which depends on the value of random_exploration_interval hyperparameter.

When train flag is True, a batch of experiences is sampled from memory, and the Q-values for the current state and action are computed using the policy network. The maximum Q-value for the next state is computed using the target network, and the expected Q-value for the current state and action is computed using the Bellman equation.

The loss between the Q-values predicted by the policy network and the expected Q-values is computed using the smooth L1 loss function. The gradients of the optimizer are zeroed out, and the loss is backpropagated through the model. The gradients are then clipped to be between -1 and 1, and the parameters of the model are updated using the optimizer.

Summing it up, the optimize_model function optimizes the PyTorch neural network model by computing and updating the Q-values, expected Q-values, and loss using a batch of experiences stored in memory.

The other important function of this code is evaluate:

This code is for evaluating how well an artificial intelligence agent can perform in a video game environment. The agent uses a neural network to make decisions, and this code helps us understand how good those decisions are.

The code takes the agent’s neural network and runs it on the game environment, which is set up to make it easy for the agent to learn. The code also keeps track of how well the agent is doing by keeping track of the total points it scores in each episode.

After running the agent on the game environment for a set number of episodes, the code calculates the average score across all the episodes and writes it down in a file along with other information, such as the current step and the current level of exploration. This information helps us see how the agent’s performance is improving over time.

This function stores the average episode reward, current step number, and level of exploration in a text file, which helps track the agent’s performance over time.

Finally, it is the Main function that makes all components to work:

This is the point where all components of the algorithm come together and training takes place.

Here is a high-level description of the main function:

  • The main() function initializes the game environment, sets up the neural networks, optimizer, and other parameters for training the DQN, and begins the training loop.
  • The game environment is first created using the make_atari() function from OpenAI Gym, which returns a raw Atari environment without frame skipping.
  • The environment is then wrapped in a DeepMind wrapper using the wrap_deepmind() function to preprocess the observations and actions, clip rewards, and stack frames to create the input to the neural networks.
  • The neural networks are defined using the DQN() class, which creates a deep convolutional neural network with a specified number of output nodes corresponding to the number of possible actions in the game.
  • The load_pretrained_model() function loads a pre-trained policy network if one is available and synchronizes the target network with the policy network.
  • The optimizer is defined using the Adam algorithm with a specified learning rate and epsilon.
  • The replay memory buffer is initialized using the ReplayMemory() class, which creates a buffer of fixed size to store previous experiences of the agent in the game.
  • The action selector is defined using the ActionSelector() class, which selects actions according to an epsilon-greedy policy.
  • The training loop consists of a series of episodes, where each episode consists of a sequence of time steps.
    • At each time step, the current state of the game is represented using a frame stack of the previous observations, and the action selector selects an action to take based on the current state and exploration policy.
    • The environment is stepped forward using the selected action, and the resulting observation, reward, and done flag are stored in the replay memory buffer.
    • Periodically, the performance of the policy network is evaluated using the evaluate() function, the policy network is optimized using the optimize_model() function, the target network is synchronized with the policy network, and the policy network weights are saved to disk.
    • The training loop terminates after a fixed number of episodes, and the final policy network weights are saved to disk.

By running this main function the training will start and the agent will learn from the experience. After finishing it, a policy file will be generated, wich can be used for playing the game automatically with a Renderer script.

 

 

5 2 votes
Article Rating
Suscríbete
Notify of
guest

0 Comments
Inline Feedbacks
View all comments