Meta Reinforcement Learning for steering tasks

Use case: AWAKE beamline at CERN

Implementation example for the RL4AA'24 workshop

Simon Hirlaender, Jan Kaiser, Chenran Xu, Andrea Santamaria Garcia

Today!

In this tutorial notebook we will implement all the basic components of a Meta Reinforcement Learning (MLRL) algorithm to solve steering task in a linear accelerator.

Getting started
Part I: Quick introduction
Part II: Running PPO on our problem
Part III: Running MAML on our problem

Getting started

You will need Python 3.9 or higher to run this code ❗
You will require about 1 GB of free disk space ❗
Start by cloning locally the repository of the tutorial:

git clone https://github.com/RL4AA/rl4aa24-tutorial.git

Getting started

Using Conda¶

If you don't have conda installed already, you can install the miniconda as described here.

conda env create -f environment.yml

This should create an environment named rl-tutorial and install the necessary packages inside.

Afterwards, activate the environment using

conda activate rl-tutorial

Getting started

Using venv¶

If you don't have conda installed:

Alternatively, you can create the virtual env with

python3 -m venv rl-tutorial

and activate the env with $ source <venv>/bin/activate (bash) or C:> <venv>/Scripts/activate.bat (Windows)

Then, install the packages with pip within the activated environment

python -m pip install -r requirements.txt

Afterwards, you should be able to run the provided scripts.

Part I: Quick introduction

AWAKE Accelerator

AWAKE (The Advanced Proton Driven Plasma Wakefield Acceleration Experiment) is an accelerator R&D project based at CERN. It investigates the use of plasma wakefields driven by a proton bunch to accelerate charged particles.

Plasmas can support extremely strong electric fields, with accelerating gradients of GV/m over meter-scale distances, which can reduce the size of future accelerators.

No description has been provided for this image

Momentum: 10-20 MeV/c
Electrons per bunch: 1.2e9
Bunch length: 4 ps
Pulse repetition rate: 10 Hz

Reference

"Acceleration of electrons in the plasma wakefield of a proton bunch" - Nature volume 561(2018)

The accelerator problem we want to solve

The goal is to minimize the distance $\Delta x_i$ of an initial beam trajectory to a target trajectory at different points $i$ across the accelerator(here marked as "position") in the least amount of steps.

Reference

"Ultra fast reinforcement learning demonstrated at CERN AWAKE" - IPAC (2023)

Formulating the RL problem

The problem is formulated in an episodic manner.

Actions

The actuators are the strengths of 10 corrector magnets that can steer the beam. They are normalized to [-1, 1]. In this tutorial, we apply the action by adding a delta change $\Delta a$ to the current magnet strengths.

Formulating the RL problem

States/Observations

The observations are the readings of ten beam position monitors (BPMs), which read the position of the beam at a particular point in the beamline. The states are also normalized to [-1,1], corresponding to $\pm$ 100 mm in the real accelerator. No description has been provided for this image

Formulating the RL problem

Reward

The reward is the negative RMS value of the distance to the target trajectory.

$$ r(x) = - \sqrt{ \frac{1}{10} \sum_{i=1}^{10} \Delta x_{i}^2} \,, \ \ \ \Delta x_{i} = x_{i} - x^{\text{target}}_{i} $$

where $x^{\text{target}}=\vec{0}$ for a centered trajectory.

Formulating the RL problem

Successful termination condition

If a threshold RMS (-10 mm in our case, 0.1 in normalized scale) is surpassed, the episode ends successfully. We cannot measure exactly 0 because of the resolution of the BPMs.

Unsucessful termination (safety) condition

If the beam hits the wall (any state ≤ -1 or ≥ 1 in normalized scale, 10 cm), the episode is terminated unsuccessfully. In this case, the agent receives a large negative reward (all BPMs afterwards are set to the largest value) to discourage the agent.

Episode initialization

All episodes are initialised such that the RMS of the distance to the target trajectory is large. This ensures that the task is not too easy and relatively close to the boundaries to probe the safety settings.

Agents

In this tutorial we will use: PPO (Proximal Policy Optimization) and MAML (Model Agnostic Meta Learning)

Formulating the RL problem

Environments/Tasks

1 task or 1 environment = 1 set of fixed quadrupole strengths = 1 MDP

In this tutorial we will use a variety of environments or tasks:

Fixed tasks for evaluation ❗
Randomly sampled tasks from a task distribution for meta-training ❗

We generate them from the original, nominal optics, adding a random scaling factor to the quadrupole strengths.

Formulating the RL problem

Environments/Tasks

The environment dynamics are determined by the response matrix, which in linear systems can encapsulate the dynamics of the problem.

More specifically: given the response matrix is $\mathbf R$, the change in actions $\Delta a$ (corrector magnet strength), and the change in states $\Delta s$ (BPM readings), we have:

\begin{align} \Delta s &= \mathbf{R}\Delta a\\ \end{align}

Defining a benchmark policy

During this tutorial we want to compare the trained policies we obtain with different methods to a benchmark policy.

For this problem, our benchmark policy is just the inverse of the environment's response matrix.

$\implies$ Actions from benchmark policy: \begin{align} \Delta a &= \mathbf{R}^{-1}\Delta s \end{align}

$\implies$ Actions from deep RL policy: With the policy we get the actions:

Cheatsheet on RL training 🧐

Training stage

During the training phase experience is gathered in a buffer that is used to update the weights of the policy through gradient descent. The samples in the buffer can be passed to the gradient descent algorithm in batches, and gradient descent is performed a number of epochs. This is how the agent "learns".

Evaluation/validation stage

The policy is fixed (no weight updates) and only forwards passes are performed.

So how do we compare policies in the evaluation stage? 🧐

At the beginning of each episode we reset the environment to suboptimal corrector strengths in a random way.
For each step within the episode we use the inverse of the response matrix (benchmark) or the trained policy to compute the next action (forward passes) until the episode ends (convergence or termination).
This will be performed for different evaluation tasks, just to assess how the policy performs in different lattices.

Side note:

The benchmark policy will not immediately find the settings for the target trajectory, because the actions are limited so that the maximum step is within $[-1,1]$ in the normalized space.
We can then compare the metrics of both policies.

So how do we compare policies in the evaluation stage? 🧐

There are 5 fixed evaluation tasks.
We can choose to evaluate our policy to one of them, several, or all of them.

Part II: Running PPO on our problem

Files relevant to the PPO agent

ppo.py: runs the training and evaluation stages sequentially.
configs/maml/verification_tasks.pkl: contains 5 tasks (environments/optics) upon which the policies will be evaluated.

PPO agent settings 🧐

n_env = 1
n_steps = 2048 (default params)
n_epochs = 10 (default params)
buffer_size = n_steps x n_env = 2048
backprops = (total_timesteps / buffer_size) * n_epochs

Questions 💻

Consider total_timesteps = 100. This parameter specifies the total number of timesteps (or steps) that the training process should run across all environments.

$\implies$ Considering the PPO agent settings: will we fill the buffer? what do you expect that happens?

Questions 💻

Run the code in the terminal with python ppo.py --train --steps 100 and observe the plot that pops-up.

$\implies$ What is the difference in episode length between the benchmark policy and PPO?

$\implies$ Look at the cumulative episode length, which policy takes longer?

$\implies$ Compare both cumulative rewards, which reward is higher and why?

$\implies$ Look at the final reward (-10*RMS(BPM readings)) and consider the sucessful (in red) and unsuccessful termination conditions mentioned before. What can you say about how the episode was ended?

Plot for 100 steps

No description has been provided for this image

Questions 💻

Set total_timesteps to 50,000 this time. Run it in the terminal with python ppo.py --train --steps 50000

$\implies$ What are the main differences between the untrained and trained PPO policies?

Train a bit longer setting total_timesteps to 100,000. Run it in the terminal with python ppo.py --train --steps 100000

$\implies$ In how many steps does it converge compared to the other training steps?

Plot for 50000 steps

No description has been provided for this image

Part III: Running MAML on our problem

Meta RL

Meta-learning occurs when one learning system progressively adjusts the operation of a second learning system, such that the latter operates with increasing speed and efficiency. It is also called "learning to learn". There are many flavors of meta RL

This scenario is often described in terms of two ‘loops’ of learning, an outer loop (meta training) that uses its experiences over many task contexts to gradually adjust parameters that govern the operation of an inner loop (adaptation), so that the inner loop can adjust rapidly to new tasks.

"Reinforcement Learning, Fast and Slow" (2019)

Optimization-based meta RL in this tutorial

In this tutorial we will adapt the parameters of our model (policy) through gradient descent with the MAML algorithm.

We have a meta policy $\phi(\theta)$, where $\theta$ are the weights of a neural network. The meta policy starts untrained $\phi_0$.

Step 1: outer loop

We randomly sample a number of tasks $i$ (in our case $i\in \{1,\dots,8\}$ different lattices, called meta-batch-size in the code) from a task distribution, each one with its particular initial task policy $\varphi_{0}^i=\phi_0$.

Step 2: inner loop (adaptation)

For each task, we gather experience for several episodes, store the experience, and use it to perform gradient descent and update the weights of each task policy $\varphi_{0}^i \rightarrow \varphi_{1}^i$

This is repeated for $k$ gradient descent steps to generate $\varphi_{k}^i$.

Optimization-based meta RL in this tutorial

Step 3: outer loop (meta training)

We generate episodes with the adapted task policies $\varphi_{k}^i$. We sum the losses calculated for each task $\tau_{i}$ and perform gradient descent on the meta policy $\phi_0 \rightarrow \phi_1$

$\beta$ is the meta learning rate, $\alpha$ is the fast learning rate (for inner loop gradient updates)

Meta RL: summary

We start with a random meta policy, and we initialize the task policies with it: $\phi_0 = \varphi_{0}^i$

1 meta_step:  # Outer loop
   sample 8 tasks
   for task in tasks:
        for fast_step in num_steps:  # Inner loop
            for fast_batch in fast_batch_size:
                rollout 1 episode:
                    reset corrector_strength
                    while not stopped:
                        env.step()

We have gathered experience and trained 8 task policies: $$ \varphi_{0}^1 \rightarrow \varphi_{k}^1$$ $$\vdots$$ $$\varphi_{0}^8 \rightarrow \varphi_{k}^8 $$

The losses from the task policies are summed, and gradient descent is applied to update the meta policy $\phi_0 \rightarrow \phi_1$

Important files

train.py: performs the meta-training on AWAKE problem
test.py: performs the evaluation of the trained policy
configs/: stores the yaml files for training configurations

Evaluation of a random task policy 💻

We will look at the inner loop only.
We consider only 1 task for now (task 0), from 5 fixed evaluation tasks.
The policy $\varphi_0^0$ starts as random and adapts for 500 steps (and show the progress every 50 steps).
This code does the training and evaluation.

$\implies$ Run the following code to train the task policy $\varphi_0^0$ for 500 steps:

python test.py --experiment-name tutorial --experiment-type adapt_from_scratch --num-batches 500 --plot-interval 50 --task-ids 0

Once it has run, you can look at the adaptation progress by running:

python read_out_train.py --experiment-name tutorial --experiment-type adapt_from_scratch

$\implies$ Run it now for all tasks:

python test.py --experiment-name tutorial --experiment-type adapt_from_scratch --num-batches 500 --plot-interval 50 --task-ids 0 1 2 3 4

$\implies$ Save the plot for comparison later

Evaluation of random policy

If the code didn't work for you, this is the plot you should get (see below).
We can see that it fails at the beginning, but it learns with time.

Meta training

Training

The meta-training takes about 30 mins for the current configuration. Therefore we have provided a pre-trained policy which can be used for evaluation later.

Meta-learning consumes a considerable amount of data.

Evaluation of the pre-trained meta-policy 💻

We will now use a pre-trained policy located in awake/pretrained_policy.th and evalulate it against a certain number of fixed tasks.

$\implies$ Run the following code:

python test.py --experiment-name tutorial --experiment-type test_meta --use-meta-policy --policy awake/pretrained_policy.th --num-batches 500 --plot-interval 50 --task-ids 0 1 2 3 4

use --task-ids 0 1 2 3 4 to run evaluation against all 5 tasks, or e.g. --task-ids 0 to evaluate only for task 0.
here we set the flag --use-meta-policy so that it uses the pre-trained policy.

$\implies$ Afterwards, you can look at the adaptation progress by running:

python read_out_train.py --experiment-name tutorial --experiment-type test_meta

Evaluation of the trained meta-policy

$\implies$ What difference can you see compared to the untrained policy (previous plot saved)?

We can observe that the pre-trained meta policy can solve the problem for different tasks (i.e. lattices) within a few adaptation steps!

No description has been provided for this image

Overall, meta RL has a better performance from the start

No description has been provided for this image

MAML logic 🧐

This part is important if you want to have a deeper understanding of the MAML algorithm.

maml_rl/metalearners/maml_trpo.py implements the TRPO algorithm for the outer-loop.
maml_rl/policies/normal_mlp.py implements a simple MLP policy for the RL agent.
maml_rl/utils/reinforcement_learning.py implements the Reinforce algorithm for the inner-loop.
maml_rl/samplers/ handles the sampling of the meta-trajectories of the environment using the multiprocessing package.
maml_rl/baseline.py A linear baseline for the advantage calculation in RL.
maml_rl/episodes.py A custom class to store the results and statistics of the episodes for meta-training.

Further Resources

Getting started in RL¶

OpenAI Spinning Up - Very understandable explainations on RL and the most popular algorithms acompanied by easy-to-read Python implementations.
Reinforcement Learning with Stable Baselines 3 - YouTube playlist giving a good introduction on RL using Stable Baselines3.
Build a Doom AI Model with Python - Detailed 3h tutorial of applying RL using DOOM as an example.
An introduction to Reinforcement Learning - Brief introdution to RL.
An introduction to Policy Gradient methods - Deep Reinforcement Learning - Brief introduction to PPO.

Further Resources

Papers about RL in Particle Accelerators and Large-Scale Facilities¶

Learning-based optimisation of particle accelerators under partial observability without real-world training - Tuning of electron beam properties on a diagnostic screen using RL.
Sample-efficient reinforcement learning for CERN accelerator control - Beam trajectory steering using RL with a focus on sample-efficient training.
Autonomous control of a particle accelerator using deep reinforcement learning - Beam transport through a drift tube linac using RL.
Basic reinforcement learning techniques to control the intensity of a seeded free-electron laser - RL-based laser alignment and drift recovery.
Real-time artificial intelligence for accelerator control: A study at the Fermilab Booster - Regulation of a gradient magnet power supply using RL and real-time implementation of the trained agent using field-programmable gate arrays (FPGAs).
Magnetic control of tokamak plasmas through deep reinforcement learning - Landmark paper on RL for controling a real-world physical system (plasma in a tokamak fusion reactor).

Further Resources

RL Books¶

R. S. Sutton, Reinforcement learning, Second edition. in Adaptive computation and machine learning. Cambridge, Massachusetts: The MIT Press, 2020 Reinforcement Learning: An Introduction
A. Agarwal, N. Jiang, S. M. Kakade, W. Sun: Reinforcement Learning: Theory and Algorithms, 2022 https://rltheorybook.github.io/
K. P. Murphy, Probabilistic Machine Learning: An introduction. MIT Press, 2022. https://probml.github.io/pml-book/book1.html
K. P. Murphy, Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023. http://probml.github.io/book2

Packages¶

Gymnasium - Defacto standard for implementing custom environments. Also provides a library of RL tasks widely used for benchmarking.
Stable Baselines3 - Provides reliable, benchmarked and easy-to-use implementations of the most important RL algorithms.
Ray RLlib - Part of the Ray Python package providing implementations of various RL algorithms with a focus on distributed training.

Courses Online¶

Chelsea Finn (Berkley): Deep Multi-Task and Meta Learning
Sergey Levine (Berkley): Deep Reinforcement Learning
Emma Brunskill (Stanford): Reinforcement Learning

In [ ]: