Meta Reinforcement Learning for steering tasks
Use case: AWAKE beamline at CERN
Implementation example for the RL4AA'24 workshop
Simon Hirlaender, Jan Kaiser, Chenran Xu, Andrea Santamaria Garcia
Today!
In this tutorial notebook we will implement all the basic components of a Meta Reinforcement Learning (MLRL) algorithm to solve steering task in a linear accelerator.- Getting started
- Part I: Quick introduction
- Part II: Running PPO on our problem
- Part III: Running MAML on our problem
Getting started
You will need Python 3.9 or higher to run this code ❗
You will require about 1 GB of free disk space ❗
Start by cloning locally the repository of the tutorial:
git clone https://github.com/RL4AA/rl4aa24-tutorial.git
Getting started
Using Conda¶
If you don't have conda installed already, you can install the miniconda
as described here.
conda env create -f environment.yml
This should create an environment named rl-tutorial
and install the necessary packages inside.
Afterwards, activate the environment using
conda activate rl-tutorial
Getting started
Using venv¶
If you don't have conda installed:
Alternatively, you can create the virtual env with
python3 -m venv rl-tutorial
and activate the env with $ source <venv>/bin/activate
(bash) or C:> <venv>/Scripts/activate.bat
(Windows)
Then, install the packages with pip
within the activated environment
python -m pip install -r requirements.txt
Afterwards, you should be able to run the provided scripts.
Part I: Quick introduction
AWAKE Accelerator
AWAKE (The Advanced Proton Driven Plasma Wakefield Acceleration Experiment) is an accelerator R&D project based at CERN. It investigates the use of plasma wakefields driven by a proton bunch to accelerate charged particles.
Plasmas can support extremely strong electric fields, with accelerating gradients of GV/m over meter-scale distances, which can reduce the size of future accelerators.
- Momentum: 10-20 MeV/c
- Electrons per bunch: 1.2e9
- Bunch length: 4 ps
- Pulse repetition rate: 10 Hz
Reference
"Acceleration of electrons in the plasma wakefield of a proton bunch" - Nature volume 561(2018)The accelerator problem we want to solve
The goal is to minimize the distance $\Delta x_i$ of an initial beam trajectory to a target trajectory at different points $i$ across the accelerator(here marked as "position") in the least amount of steps.
Reference
"Ultra fast reinforcement learning demonstrated at CERN AWAKE" - IPAC (2023)Formulating the RL problem
The problem is formulated in an episodic manner.
Actions
The actuators are the strengths of 10 corrector magnets that can steer the beam. They are normalized to [-1, 1]. In this tutorial, we apply the action by adding a delta change $\Delta a$ to the current magnet strengths.Formulating the RL problem
States/Observations
The observations are the readings of ten beam position monitors (BPMs), which read the position of the beam at a particular point in the beamline. The states are also normalized to [-1,1], corresponding to $\pm$ 100 mm in the real accelerator.Formulating the RL problem
Reward
The reward is the negative RMS value of the distance to the target trajectory.$$ r(x) = - \sqrt{ \frac{1}{10} \sum_{i=1}^{10} \Delta x_{i}^2} \,, \ \ \ \Delta x_{i} = x_{i} - x^{\text{target}}_{i} $$
where $x^{\text{target}}=\vec{0}$ for a centered trajectory.
Formulating the RL problem
Successful termination condition
If a threshold RMS (-10 mm in our case, 0.1 in normalized scale) is surpassed, the episode ends successfully. We cannot measure exactly 0 because of the resolution of the BPMs.
Unsucessful termination (safety) condition
If the beam hits the wall (any state ≤ -1 or ≥ 1 in normalized scale, 10 cm), the episode is terminated unsuccessfully. In this case, the agent receives a large negative reward (all BPMs afterwards are set to the largest value) to discourage the agent.
Episode initialization
All episodes are initialised such that the RMS of the distance to the target trajectory is large. This ensures that the task is not too easy and relatively close to the boundaries to probe the safety settings.
Agents
In this tutorial we will use: PPO (Proximal Policy Optimization) and MAML (Model Agnostic Meta Learning)
Formulating the RL problem
Environments/Tasks
In this tutorial we will use a variety of environments or tasks:
Fixed tasks for evaluation ❗
Randomly sampled tasks from a task distribution for meta-training ❗
We generate them from the original, nominal optics, adding a random scaling factor to the quadrupole strengths.
Formulating the RL problem
Environments/Tasks
The environment dynamics are determined by the response matrix, which in linear systems can encapsulate the dynamics of the problem.
More specifically: given the response matrix is $\mathbf R$, the change in actions $\Delta a$ (corrector magnet strength), and the change in states $\Delta s$ (BPM readings), we have:
\begin{align} \Delta s &= \mathbf{R}\Delta a\\ \end{align}
Defining a benchmark policy
During this tutorial we want to compare the trained policies we obtain with different methods to a benchmark policy.
For this problem, our benchmark policy is just the inverse of the environment's response matrix.
$\implies$ Actions from benchmark policy: \begin{align} \Delta a &= \mathbf{R}^{-1}\Delta s \end{align}
$\implies$ Actions from deep RL policy: With the policy we get the actions:
Cheatsheet on RL training 🧐
Training stage
During the training phase experience is gathered in a buffer that is used to update the weights of the policy through gradient descent. The samples in the buffer can be passed to the gradient descent algorithm in batches, and gradient descent is performed a number of epochs. This is how the agent "learns".Evaluation/validation stage
The policy is fixed (no weight updates) and only forwards passes are performed.So how do we compare policies in the evaluation stage? 🧐
- At the beginning of each episode we reset the environment to suboptimal corrector strengths in a random way.
- For each step within the episode we use the inverse of the response matrix (benchmark) or the trained policy to compute the next action (forward passes) until the episode ends (convergence or termination).
- This will be performed for different evaluation tasks, just to assess how the policy performs in different lattices.
Side note:
- The benchmark policy will not immediately find the settings for the target trajectory, because the actions are limited so that the maximum step is within $[-1,1]$ in the normalized space.
- We can then compare the metrics of both policies.
So how do we compare policies in the evaluation stage? 🧐
- There are 5 fixed evaluation tasks.
- We can choose to evaluate our policy to one of them, several, or all of them.
Part II: Running PPO on our problem
Files relevant to the PPO agent
ppo.py
: runs the training and evaluation stages sequentially.configs/maml/verification_tasks.pkl
: contains 5 tasks (environments/optics) upon which the policies will be evaluated.
PPO agent settings 🧐
n_env
= 1n_steps
= 2048 (default params)n_epochs
= 10 (default params)buffer_size
=n_steps
xn_env
= 2048- backprops =
(total_timesteps / buffer_size) * n_epochs
Questions 💻
Consider total_timesteps = 100
.
This parameter specifies the total number of timesteps (or steps) that the training process should run across all environments.
$\implies$ Considering the PPO agent settings: will we fill the buffer? what do you expect that happens?
Questions 💻
Run the code in the terminal with python ppo.py --train --steps 100
and observe the plot that pops-up.
$\implies$ What is the difference in episode length between the benchmark policy and PPO?
$\implies$ Look at the cumulative episode length, which policy takes longer?
$\implies$ Compare both cumulative rewards, which reward is higher and why?
$\implies$ Look at the final reward (-10*RMS(BPM readings)) and consider the sucessful (in red) and unsuccessful termination conditions mentioned before. What can you say about how the episode was ended?
Plot for 100 steps
Questions 💻
Set total_timesteps
to 50,000 this time. Run it in the terminal with python ppo.py --train --steps 50000
$\implies$ What are the main differences between the untrained and trained PPO policies?
Train a bit longer setting total_timesteps
to 100,000. Run it in the terminal with python ppo.py --train --steps 100000
$\implies$ In how many steps does it converge compared to the other training steps?
Plot for 50000 steps
Part III: Running MAML on our problem
Meta RL
Meta-learning occurs when one learning system progressively adjusts the operation of a second learning system, such that the latter operates with increasing speed and efficiency. It is also called "learning to learn". There are many flavors of meta RL
This scenario is often described in terms of two ‘loops’ of learning, an outer loop (meta training) that uses its experiences over many task contexts to gradually adjust parameters that govern the operation of an inner loop (adaptation), so that the inner loop can adjust rapidly to new tasks.
Optimization-based meta RL in this tutorial
In this tutorial we will adapt the parameters of our model (policy) through gradient descent with the MAML algorithm.
- We have a meta policy $\phi(\theta)$, where $\theta$ are the weights of a neural network. The meta policy starts untrained $\phi_0$.
Step 1: outer loop
We randomly sample a number of tasks $i$ (in our case $i\in \{1,\dots,8\}$ different lattices, called meta-batch-size
in the code) from a task distribution, each one with its particular initial task policy $\varphi_{0}^i=\phi_0$.
Step 2: inner loop (adaptation)
For each task, we gather experience for several episodes, store the experience, and use it to perform gradient descent and update the weights of each task policy $\varphi_{0}^i \rightarrow \varphi_{1}^i$
This is repeated for $k$ gradient descent steps to generate $\varphi_{k}^i$.
Optimization-based meta RL in this tutorial
Step 3: outer loop (meta training)
We generate episodes with the adapted task policies $\varphi_{k}^i$. We sum the losses calculated for each task $\tau_{i}$ and perform gradient descent on the meta policy $\phi_0 \rightarrow \phi_1$
$\beta$ is the meta learning rate, $\alpha$ is the fast learning rate (for inner loop gradient updates)
Meta RL: summary
We start with a random meta policy, and we initialize the task policies with it: $\phi_0 = \varphi_{0}^i$
1 meta_step: # Outer loop
sample 8 tasks
for task in tasks:
for fast_step in num_steps: # Inner loop
for fast_batch in fast_batch_size:
rollout 1 episode:
reset corrector_strength
while not stopped:
env.step()
We have gathered experience and trained 8 task policies: $$ \varphi_{0}^1 \rightarrow \varphi_{k}^1$$ $$\vdots$$ $$\varphi_{0}^8 \rightarrow \varphi_{k}^8 $$
The losses from the task policies are summed, and gradient descent is applied to update the meta policy $\phi_0 \rightarrow \phi_1$
Important files
train.py
: performs the meta-training on AWAKE problemtest.py
: performs the evaluation of the trained policyconfigs/
: stores the yaml files for training configurations
Evaluation of a random task policy 💻
- We will look at the inner loop only.
- We consider only 1 task for now (task 0), from 5 fixed evaluation tasks.
- The policy $\varphi_0^0$ starts as random and adapts for 500 steps (and show the progress every 50 steps).
- This code does the training and evaluation.
$\implies$ Run the following code to train the task policy $\varphi_0^0$ for 500 steps:
python test.py --experiment-name tutorial --experiment-type adapt_from_scratch --num-batches 500 --plot-interval 50 --task-ids 0
Once it has run, you can look at the adaptation progress by running:
python read_out_train.py --experiment-name tutorial --experiment-type adapt_from_scratch
$\implies$ Run it now for all tasks:
python test.py --experiment-name tutorial --experiment-type adapt_from_scratch --num-batches 500 --plot-interval 50 --task-ids 0 1 2 3 4
$\implies$ Save the plot for comparison later
Evaluation of random policy
- If the code didn't work for you, this is the plot you should get (see below).
- We can see that it fails at the beginning, but it learns with time.
Meta training
Training
The meta-training takes about 30 mins for the current configuration. Therefore we have provided a pre-trained policy which can be used for evaluation later.
Meta-learning consumes a considerable amount of data.
Evaluation of the pre-trained meta-policy 💻
We will now use a pre-trained policy located in awake/pretrained_policy.th
and evalulate it against a certain number of fixed tasks.
$\implies$ Run the following code:
python test.py --experiment-name tutorial --experiment-type test_meta --use-meta-policy --policy awake/pretrained_policy.th --num-batches 500 --plot-interval 50 --task-ids 0 1 2 3 4
- use
--task-ids 0 1 2 3 4
to run evaluation against all 5 tasks, or e.g.--task-ids 0
to evaluate only for task 0. - here we set the flag
--use-meta-policy
so that it uses the pre-trained policy.
$\implies$ Afterwards, you can look at the adaptation progress by running:
python read_out_train.py --experiment-name tutorial --experiment-type test_meta
Evaluation of the trained meta-policy
$\implies$ What difference can you see compared to the untrained policy (previous plot saved)?
We can observe that the pre-trained meta policy can solve the problem for different tasks (i.e. lattices) within a few adaptation steps!
Overall, meta RL has a better performance from the start
MAML logic 🧐
This part is important if you want to have a deeper understanding of the MAML algorithm.
maml_rl/metalearners/maml_trpo.py
implements the TRPO algorithm for the outer-loop.maml_rl/policies/normal_mlp.py
implements a simple MLP policy for the RL agent.maml_rl/utils/reinforcement_learning.py
implements the Reinforce algorithm for the inner-loop.maml_rl/samplers/
handles the sampling of the meta-trajectories of the environment using the multiprocessing package.maml_rl/baseline.py
A linear baseline for the advantage calculation in RL.maml_rl/episodes.py
A custom class to store the results and statistics of the episodes for meta-training.
Further Resources
Getting started in RL¶
- OpenAI Spinning Up - Very understandable explainations on RL and the most popular algorithms acompanied by easy-to-read Python implementations.
- Reinforcement Learning with Stable Baselines 3 - YouTube playlist giving a good introduction on RL using Stable Baselines3.
- Build a Doom AI Model with Python - Detailed 3h tutorial of applying RL using DOOM as an example.
- An introduction to Reinforcement Learning - Brief introdution to RL.
- An introduction to Policy Gradient methods - Deep Reinforcement Learning - Brief introduction to PPO.
Further Resources
Papers about RL in Particle Accelerators and Large-Scale Facilities¶
- Learning-based optimisation of particle accelerators under partial observability without real-world training - Tuning of electron beam properties on a diagnostic screen using RL.
- Sample-efficient reinforcement learning for CERN accelerator control - Beam trajectory steering using RL with a focus on sample-efficient training.
- Autonomous control of a particle accelerator using deep reinforcement learning - Beam transport through a drift tube linac using RL.
- Basic reinforcement learning techniques to control the intensity of a seeded free-electron laser - RL-based laser alignment and drift recovery.
- Real-time artificial intelligence for accelerator control: A study at the Fermilab Booster - Regulation of a gradient magnet power supply using RL and real-time implementation of the trained agent using field-programmable gate arrays (FPGAs).
- Magnetic control of tokamak plasmas through deep reinforcement learning - Landmark paper on RL for controling a real-world physical system (plasma in a tokamak fusion reactor).
Further Resources
RL Books¶
- R. S. Sutton, Reinforcement learning, Second edition. in Adaptive computation and machine learning. Cambridge, Massachusetts: The MIT Press, 2020 Reinforcement Learning: An Introduction
- A. Agarwal, N. Jiang, S. M. Kakade, W. Sun: Reinforcement Learning: Theory and Algorithms, 2022 https://rltheorybook.github.io/
- K. P. Murphy, Probabilistic Machine Learning: An introduction. MIT Press, 2022. https://probml.github.io/pml-book/book1.html
- K. P. Murphy, Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023. http://probml.github.io/book2
Packages¶
- Gymnasium - Defacto standard for implementing custom environments. Also provides a library of RL tasks widely used for benchmarking.
- Stable Baselines3 - Provides reliable, benchmarked and easy-to-use implementations of the most important RL algorithms.
- Ray RLlib - Part of the Ray Python package providing implementations of various RL algorithms with a focus on distributed training.
Courses Online¶
- Chelsea Finn (Berkley): Deep Multi-Task and Meta Learning
- Sergey Levine (Berkley): Deep Reinforcement Learning
- Emma Brunskill (Stanford): Reinforcement Learning