Applying Reinforcement Learning to Particle Accelerators: An Introduction
Use case: Transverse beam steering at ARES linear accelerator at DESY
Tutorial at 4th ICFA beam dynamics mini-workshop on machine learning applications for particle accelerators
Today!
In this tutorial notebook we will implement all the basic components of a Reinforcement Learning algorithm to solve a problem in particle accelerators, focused on reward definition
- Part I: Introduction
- Part II: Algorithm implementation in Python
- Part III: Reward definition!
- Part IV: Training an RL agent
Download the repository¶
Once you have Git installed open your terminal, go to your desired directory, and type:
git clone https://github.com/RL4AA/rl-tutorial-ares-basic.git
Then enter the downloaded repository:
cd rl-tutorial-ares-basic
Install dependencies¶
You need to install the dependencies before running the notebooks.
Using conda¶
If you don't have conda installed already and want to use conda for environment management, you can install the miniconda as described here.
- Create a conda env with
conda create -n rl-icfa python=3.10
- Activate the environment with
conda activate rl-icfa
- Install the required packages via
pip install -r requirements.txt
. - Additional installation steps:
python -m jupyter contrib nbextension install --user
python -m jupyter nbextension enable varInspector/main
- After the tutorial you can remove your environment with
conda remove -n rl-icfa --all
Install dependencies¶
You need to install the dependencies before running the notebooks.
Using venv only¶
If you do not have conda installed:
Alternatively, you can create the virtual env with venv
in the standard library
python -m venv rl-icfa
and activate the env with $ source
Then, install the packages with pip within the activated environment
python -m pip install -r requirements.txt
Finally, install the notebook extensions if you want to see them in slide mode:
python -m jupyter contrib nbextension install --user
python -m jupyter nbextension enable varInspector/main
# Importing the required packages
from time import sleep
import matplotlib.pyplot as plt
import names
import numpy as np
from gymnasium.wrappers import RescaleAction
from IPython.display import clear_output, display
from stable_baselines3 import PPO
from utils.helpers import (
evaluate_ares_ea_agent,
plot_ares_ea_training_history,
show_video,
)
from utils.train import ARESEACheetah, make_env, read_from_yaml
from utils.train import train as train_ares_ea
from utils.utils import NotVecNormalize
Part I: Introduction
Formulating the RL problem
We need to define:
- Actions
- Observations
- Reward
- Environment
- Agent
ARES (Accelerator Research Experiment at SINBAD)
ARES is an S-band radio frequency linac at the DESY Hamburg site equipped with a photoinjector and two independently driven traveling wave accelerating structures. The main research focus is the generation and characterization of sub-femtosecond electron bunches at relativistic particle energy. The generation of short electron bunches is of high interest for radiation generation, i.e. by free electron lasers.
- Final energy: 100-155 MeV
- Bunch charge: 0.01-200 pC
- Bunch length: 30 fs - 1 ps
- Pulse repetition rate: 1-50 Hz
The accelerator problem we want to solve
We would like to focus and center the electron beam on a diagnostic screen using corrector and quadrupole magnets
Formulating the RL problem
Overview of our study case
Discussion
$\implies$ Is the action space continuous or discrete?
$\implies$ Is the problem fully observable or partially observable?
Formulating the RL problem
Actions
In the ARES transverse tuning task we have 3 quadrupoles and 2 corrector magnets
The actions are:
- Quadrupole magnet strength $k_{1,2,3}$ $[1/m^2]$
- Corrector deflection angle $\theta_\mathrm{v, h}$ $[mrad]$ (vertical and horizontal
In our control system we can set these derived values directly according the beam energy
$\implies$ actions
$=[k_{\mathrm{Q1}},k_{\mathrm{Q2}},\theta_\mathrm{CV},k_{\mathrm{Q3}},\theta_\mathrm{CH}]$
is a 5-dimensional array
Formulating the RL problem
Observation / state
Observation is the information an agent receives about the current state of the environment
It should provide enough information so that the agent can solve this problem.
The observation does not necesearily cover the entire (internal) state of the environment.
Discussion
$\implies$ What should be included in the observation
?
$\implies$ What can be observed in simulation?
$\implies$ What cannot be observed in real world?
$\implies$ How does this relate to the environment?
The screen is made from scintillating material and glows when hit by electrons
The camera films the screen
Formulating the RL problem
The environment's state
The state
can be fully described by with four components:
- The target beam: the beam we want to achieve, our goal
- as a 4-dimensional array $b^\mathrm{(t)}=[\mu_x^{(\mathrm{t})},\sigma_x^{(\mathrm{t})},\mu_y^{(\mathrm{t})},\sigma_y^{(\mathrm{t})}]$, where $\mu$ denotes the position on the screen, $\sigma$ denotes the beam size, and $t$ stands for "target".
- The incoming beam: the beam that enters the EA upstream
- $I = [\mu_x^{(\mathrm{i})},\sigma_x^{(\mathrm{i})},\mu_y^{(\mathrm{i})},\sigma_y^{(\mathrm{i})},\mu_{xp}^{(\mathrm{i})},\sigma_{xp}^{(\mathrm{i})},\mu_{yp}^{(\mathrm{i})},\sigma_{yp}^{(\mathrm{i})},\mu_s^{(\mathrm{i})},\sigma_s^{(\mathrm{i})}]$, where $i$ stands for "incoming"
- The magnet strengths and deflection angles
- $[k_{\mathrm{Q1}},k_{\mathrm{Q2}},\theta_\mathrm{CV},k_{\mathrm{Q3}},\theta_\mathrm{CH}]$
- The transverse misalignments of quadrupoles and the diagnostic screen
- $[m_{\mathrm{Q1}}^{(\mathrm{x})},m_{\mathrm{Q1}}^{(\mathrm{y})},m_{\mathrm{Q2}}^{(\mathrm{x})},m_{\mathrm{Q2}}^{(\mathrm{y})},m_{\mathrm{Q3}}^{(\mathrm{x})},m_{\mathrm{Q3}}^{(\mathrm{y})},m_{\mathrm{S}}^{(\mathrm{x})},m_{\mathrm{S}}^{(\mathrm{y})}]$
Discussion
$\implies$ Do we (fully) know or can we observe the state of the environment?
Formulating the RL problem
Our definition of observation
The observation
for this task contains three parts:
- The target beam: the beam we want to achieve, our goal
- as a 4-dimensional array $b^\mathrm{(t)}=[\mu_x^{(\mathrm{t})},\sigma_x^{(\mathrm{t})},\mu_y^{(\mathrm{t})},\sigma_y^{(\mathrm{t})}]$, where $\mu$ denotes the position on the screen, $\sigma$ denotes the beam size, and $t$ stands for "target".
- The current beam: the beam we currently have
- $b^\mathrm{(c)}=[\mu_x^{(\mathrm{c})},\sigma_x^{(\mathrm{c})},\mu_y^{(\mathrm{c})},\sigma_y^{(\mathrm{c})}]$, where $c$ stands for "current"
- The magnet strengths and deflection angles
- $[k_{\mathrm{Q1}},k_{\mathrm{Q2}},\theta_\mathrm{CV},k_{\mathrm{Q3}},\theta_\mathrm{CH}]$
Discussion
$\implies$ Does this observation definition fullfil the Markov property? (does the probability distribution for the next beam depend only on the observation? or is it affected by other state information?)
Formulating the RL problem
Goal and reward
Our goal is divided in two tasks:
- to steer the beam to the desired positions
- to focus the beam to the desired beam size
Discussion
$\implies$ How should we define our reward function? Give it a go!
$\implies$ We will look into the reward definitions in the following section.
Formulating the RL problem
Agent / algorithm
image from RL Tips and Tricks - A. Raffin
Discussion
$\implies$ What would you choose and why?
Part II: Algorithm implementation in Python
About libraries for RL
There are many libraries with already implemented RL algorithms, and frameworks to implement an environment to interact with. In this notebook we use:
- Stable-Baselines3 for the RL algorithms
- Gymnasium for the environment
More info here
Note:
- Gymnasium is the successor of the OpenAI Gym.
- Stable-baselines3 now has an early-stage JAX implementation sbx.
Agent / algorithm
- As mentioned, we use the Stable-Baselines3 (SB3) package to implement the reinforcement learning algorithms.
- In this tutorial we focus on two examples: PPO (proximal policy optimization) and TD3 (twin delayed DDPG)
Environment
We take all the elements of the RL problem we defined previously and represent the tuning task as a gym
environment, which is a standard library for RL tasks.
A custom gym.Env
would contain the following parts:
- Initialization: setup the environment, declares the allowed
observation_space
andaction_space
reset
method: resets the environment for a new episode, returns 2-tuple(observation, info)
step
method: main logic of the environment. It takes anaction
, changes the environment to a newstate
, get newobservation
, compute thereward
, and finally returns the 5-tuple(observation, reward, terminated, truncated, info)
terminated
checks if the current episode should be terminated according to the underlying MDP (reached goal reached, or exceeded some thresholds)truncated
checks if the current episode should be truncated outside of the underlying MD (e.g. time limit)
render
method: to visualize the environment (a video, or just some plots)
An overview of this RL project
Code directory structure
We list the most relevant parts of the project structure below:
utils/train.py
contains the gym environments and the training scriptARESEA
implements the ARES Experimental Area transverse tuning task as agym.Env
. It contains the basic logic, such as definition of observation space, action space, and reward. How an action is taken is implemented in child classes with specific backends.ARESEACheetah
is derived from the base classARESEA
, where it usescheetah
simulation as a backend.make_env
Initializes aARESEA
envrionment, and wraps it with required gym.wrappers with convenient features (e.g. monitoring the progress, end episode when time_limit is reached, rescales the action, normalize the observation, ...)train
convenient function for training the RL agent. It callsmake_env
, sets up the RL algorithm, starts training, and saves the results inutils/recordings
,utils/monitors
andutils/models
.
Code directory structure
We list the most relevant parts of the project structure below:
utils/helpers.py
contains some utility functionsevaluate_ares_ea_agent
Takes a trained agent and evaluates its performance using different metrics.plot_ares_ea_training_history
shows the progress during training
What is Cheetah?
- RL algorithms require a large number of samples to learn ($10^5-10^9$), and getting those samples in the real accelerator is often too costly.
- This is why a common approach is to train the agent in simulation, and then deploy it in the real machine
- In our case we would train with optics simulation codes for accelerators, such as OCELOT
- These codes were developed to help the design phases of accelerators, but not to generate training data, making their computing time too high for RL.
- Cheetah is a tensorized approach for transfer matrix tracking, which saves computation time and overhead compared to OCELOT
You can find more information in the paper and the code repository.
The ARES-EA (ARES Experimental Area) Environment
- We formulated the ARES-EA task as a
gym
environment, which allows our algorithm to easily interface with both the simulation and real machine backends as shown before. - In this part, you will get familiar with the environment for the beam focusing and positioning at ARES accelerator.
Some methods:
reset
: in both real and simulation cases: resets the magnets to initial values. In simulation, regenerate incoming beam, (optionally) resets the magnet misalignments.step
: set magnets to new settings. Observe the beam (run a simulation or observe screen image in real-world).
Now let's create the environment:
# Create the environment
env = ARESEACheetah()
env.target_beam_mode = "constant"
Set a target beam you want to achieve
$\implies$ Let's define the position $(\mu_x, \mu_y)$ and size $(\sigma_x, \sigma_y)$ of the beam on the screen
$\implies$ Modify the target_beam
list below, where the order of the arguments is $[\mu_x,\sigma_x,\mu_y,\sigma_y]$
$\implies$ Take into account the dimensions of the screen ($\pm$ 2e-3 m)
$\implies$ The target beam will be represented by a blue circle on the screen
target_beam = np.array([1e-3, 2e-4, 1e-3, 2e-4]) # Change it
env.target_beam_values = target_beam
env.reset() ##
plt.figure(figsize=(7, 4))
plt.imshow(env.render()) # Plot the screen image
<matplotlib.image.AxesImage at 0x2a35fcfa0>
Get familiar with the Gym environment
$\implies$ Change the magnet values, i.e. the actions
$\implies$ The actions are normalized to 1, so valid values are in the [-1, 1] interval
$\implies$ The values of the action
list in the cell below follows this magnet order: [Q1, Q2, CV, Q3, CH]
action = np.array([1, 0.5, 0.5, 1, 0.6]) # put your action here
Perform one step: update the env, observe new beam!
env = RescaleAction(env, -1, 1) # rescales the action to the interval [-1, 1]
env.reset()
env.step(action)
plt.figure(figsize=(7, 4))
plt.imshow(env.render())
<matplotlib.image.AxesImage at 0x2a36714e0>
$\implies$ Observe the plot above, what beam does that magnet configuration yield? can you center and focus the beam by hand?
- Let's now use the environment in a loop, and perform 10 steps
- The function below will linearly vary the value of the vertical corrector
env.reset()
steps = 10
def change_vertical_corrector(q1, q2, cv, q3, ch, steps, i):
action = np.array([q1, q2, cv + 1 / steps * i, q3, ch])
return action
fig, ax = plt.subplots(1, figsize=(7, 4))
for i in range(steps):
action = change_vertical_corrector(0.2, -0.2, -0.5, 0.3, 0, steps, i)
env.step(action)
img = env.render()
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
Part III: Reward definition!
- In the following, we reduce our problem to only focusing of the beam, and actuators to only 3 quadrupole magnets
- In this way, we can train a RL agent with fewer steps
Training a good agent revolves primarily around finding the right setup for the environment and the correct reward function. In order to iterate over and compare many different options, our training function takes a dictionary called config
. The dictionary keys or "configurations" are explained below
Configurations
In the following, we use a config
dictionary to set up the training. This allows us to easily switch between different training conditions. Below we show some selected configurations that have the most influence on training results, the parameters can mostly be divided into two parts.
Configurations
Environment configurations
action_mode
Set directly the magnet strength or set a delta action. You may set this to"direct"
or"delta"
. You should find that "delta" trains faster. Setting "delta" is also crucial in running the agent on the real accelerator.reward_mode
: How the reward is calculated. Can be set tonegative_objective
,objective_improvement
, orsum_of_pixels
.time_reward
: Whether the agent will be penalized for making another step, this is intended to make the tuning faster.rescale_action
: Takes the limits of the magnet settings and scale them into the following range.
Configurations
Environment configurations
Termination conditions:
abort_if_off_screen
If this property is set to True, episodes are aborted when the beam is no longer on the screen.time_limit
: Number of interactions the agent gets to tune the magnets within one episode.target_sigma_x_threshold
,target_sigma_y_threshold
: Thresholds for beam parameters. If all beam parameters are within the threshold from their target, episodes will end and the agent will stop optimising.
Question
$\implies$ What does the existence of termination conditions says about the nature of the problem? is it episodic or continuous?
What could go wrong?
Let's load some pre-trained models using different combinations of the config
dictionary and using different reward definitions
Pre-trained Agent 1: "Gary Buchwald"
Relevant config
parameters
"abort_if_off_screen": True
"reward_mode": "objective_improvement"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": -1.0
"action_mode": "delta"
Reward = objective_improvement
Difference of the objective:
$$ r_\mathrm{obj-improvement} = ( \mathrm{obj}_{j-1} - \mathrm{obj}_{j} ) / \mathrm{obj}_0 $$
$$ obj = \sum_{i}|b_i^\mathrm{(c)} - b_i^\mathrm{(t)}|$$
where $j$ is the index of the current time step.
Question
$\implies$ What do you expect to happen, why?
agent_name = "Gary Buchwald" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
terminated = False
truncated = False
observation, _ = env.reset()
while not (terminated or truncated):
action, _ = loaded_model.predict(observation)
observation, reward, terminated, truncated, info = env.step(action)
img = env.render()
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
Pre-trained Agent 2: "David Archibald"
Relevant config
parameters
"abort_if_off_screen": False
"reward_mode": "sum_of_pixels"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": 0.0
"action_mode": "delta"
Reward = sum_of_pixels
(focusing-only)
$$r_\mathrm{sum-pixel} = - \sum_\text{all pixels} \text{pixel-value}$$
Question
$\implies$ What do you expect to happen, why?
agent_name = "David Archibald" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
terminated = False
truncated = False
observation, info = env.reset()
while not (terminated or truncated):
action, _ = loaded_model.predict(observation)
observation, reward, terminated, truncated, info = env.step(action)
img = env.render()
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
Pre-trained Agent 3: "Bertha Sparkman"
Relevant config
parameters
"abort_if_off_screen": False
"reward_mode": "objective_improvement"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": 0.0
"action_mode": "direct"
Reward = objective_improvement
Difference of the objective:
$$ r_\mathrm{obj-improvement} = ( \mathrm{obj}_{j-1} - \mathrm{obj}_{j} ) / \mathrm{obj}_0 $$ $$ \mathrm{obj} = \sum_{i}|b_i^\mathrm{(c)} - b_i^\mathrm{(t)}|$$
where $j$ is the index of the current time step.
Question
$\implies$ What do you expect to happen?
$\implies$ What is the difference between Agent 1: "Gary Buchwald" and this agent?
agent_name = "Bertha Sparkman" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
terminated = False
truncated = False
observation, info = env.reset()
while not (terminated or truncated):
action, _ = loaded_model.predict(observation)
observation, reward, terminated, truncated, info = env.step(action)
img = env.render()
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
Pre-trained Agent 4: "Betty Gordon"
Relevant config
parameters
"abort_if_off_screen": False
"reward_mode": "objective_improvement"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": 0.0
"action_mode": "delta"
Reward = objective_improvement
Difference of the objective:
$$ r_\mathrm{obj-improvement} = ( \mathrm{obj}_{j-1} - \mathrm{obj}_{j} ) / \mathrm{obj}_0 $$ $$ \mathrm{obj} = \sum_{i}|b_i^\mathrm{(c)} - b_i^\mathrm{(t)}|$$
where $j$ is the index of the current time step.
Question
$\implies$ What do you expect to happen?
$\implies$ What is the difference between Agent 1: "Gary Buchwald", Agent 3: "Bertha Sparkman", and this agent?
agent_name = "Betty Gordon" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
terminated = False
truncated = False
observation, info = env.reset()
while not (terminated or truncated):
action, _ = loaded_model.predict(observation)
observation, reward, terminated, truncated, info = env.step(action)
img = env.render()
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
Pre-trained Agent 5: "Sean Kelley"
Relevant config
parameters
"abort_if_off_screen": False
"reward_mode": "negative_objective"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": 0.0
"action_mode": "delta"
Reward = negative_objective"
$$ \mathrm{obj} = \sum_{i}|b_i^\mathrm{(c)} - b_i^\mathrm{(t)}|$$
$$ r_\mathrm{neg-obj} = -1 \cdot \mathrm{obj} / \mathrm{obj}_0 $$
where $b = [\mu_x,\sigma_x,\mu_y,\sigma_y]$, $b^\mathrm{(c)}$ is the current beam, and $b^\mathrm{(t)}$ is the target beam. $\mathrm{obj}_0$ is the initial objective after reset
.
Question
$\implies$ What do you expect to happen, why?
agent_name = "Sean Kelley" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
terminated = False
truncated = False
observation, info = env.reset()
while not (terminated or truncated):
action, _ = loaded_model.predict(observation)
observation, reward, terminated, truncated, info = env.step(action)
img = env.render()
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
Part IV: Training an RL agent
What is inside an actor-critic agent like PPO?
- An
actor model
, often a neural network, takes theobservation
of the currentstate
and predicts anaction
to be taken (forward pass)- In the ARES case, it observes the accelerator and predicts the magnet settings
- A
critic model
, also a neural network, takes theobservation
of the currentstate
and predicts the value function of the state (and evaluates how good is the action taken by theactor model
)
What actually happens when you train a PPO agent?
Step 1: collect samples
n_samples = n_steps * n_envs
is the total number of samples, or interactions with the environment in oneepoch
(more on what that means later)- One sample is collected at each step
- We can initialize
n_envs
parallel environments, in which the agent will taken_steps
- The total number of samples then has to account for the samples gathered in all environments
At each step:
- The agent will take actions according to the current
actor model
prediction (forward pass of the model NN) - The
critic model
will predict the value functions of the states during the episode (forward pass of the model NN)
The samples (actions, rewards,...) from all environments are stored in a buffer
, where buffer_size = n_samples
What actually happens when you train a PPO agent?
Step 2: update the models (weights of NNs)
After performing n_steps
in a particular environment (and therefore gathering n_steps
number of samples per environment), it's time to update the actor and critic models (backpropagation of the NNs). Let's consider only 1 environment now for simplicity.
- One can split the
n_samples
in mini-batches of a certainbatch_size
- This means that the model will be completely updated (i.e. has seen all the samples) after
n_samples_tot
/batch_size
number of backpropagations - Once the model is updated, it can be trained again on the same samples a certain number of
n_epochs
(number of iterations on the training set) - This process can be repeated a certain number of
epochs
(yes...) - The total number of samples across the epochs is
total_timesteps
, wheretotal_timesteps = n_steps * n_envs * n_epochs = n_samples * n_epoch
- This means that the model will be completely updated (i.e. has seen all the samples) after
What actually happens when you train a PPO agent?
Question
$\implies$ What the advantage of having a buffer?
What actually happens when you train a PPO agent?
Example
Let's consider the following training parameters:
n_steps
= 100n_envs
= 2batch_size
= 50n_epochs
= 3epochs
= 2
Question
$\implies$ What is total_timesteps
?
$\implies$ What is the total number of batches n_batch
in 1 epoch?
$\implies$ What is the total number of model updates?
Training time!
Now, set the config below and train your first reinforcement learning agent!
Apart from the reward definition, time_reward, etc. that we discussed before. Below are some other configurations that you can change:
net_arch
: architecture of the policy network (# of neurons in each layer)gamma
: Discount factor of the RL problem. Set lower to make rewards now more important than rewards later (usually above 0.9)normalize_observation
: Normalize observations throughout training by fitting a running mean and standard deviation of themnormalize_reward
: Normalize rewards throughout training by fitting a running mean and standard deviation of them
# Feel free to change some of the configurations here.
config = {
"n_envs": 40,
"n_steps": 50,
"batch_size": 100,
"n_epochs": 10,
"total_timesteps": 200_000,
"abort_if_off_screen": False,
"action_mode": "delta",
"gamma": 0.99,
"frame_stack": None,
"net_arch": [64, 64],
"normalize_observation": True,
"normalize_reward": True,
"rescale_action": (-3, 3),
"reward_mode": "negative_objective",
"run_name": names.get_full_name(),
"target_sigma_x_threshold": None,
"target_sigma_y_threshold": None,
"threshold_hold": 5,
"time_limit": 25,
"time_reward": -0.0,
}
Questions
Looking at the config
dictionary in the cell above:
$\implies$ How many epochs does it correspond to?
$\implies$ How many model updates (backpropagation) would you be doing in total?
You will train the agent by executing the cell below: Note: This could take about 10 min on a laptop.
# Toggle comment to re-run the training (can take very long)
%time train_ares_ea(config)
Training metrics
Let's look at the training metrics to see how the agent did.
Comment out the following line and set agent_under_investigation
to the name of your agent, to check its training history.
agent_under_investigation = config["run_name"]
# agent_under_investigation = "Donna Brown"
# Training curves from this training
# Change `config["run_name"` to `"ml_worksop` to see curves from example training.
plot_ares_ea_training_history(agent_under_investigation)
Check the videos
To look at videos of the agent during training:
- find the first output line of the training cell. Your agent should have a name (e.g. Fred Rogers).
- Find the subdirectory
utils/recordings/
. - There should be a directory for the name of your agent with video files in it. The
ml_workshop
directory contains videos from an example training.
Agent evaluation
Run the following cell to evaluate your agent. This is the mean deviation of the beam parameters from the target. Lower results are better.If you are training agents that include the dipoles, set the functions argument include_position=True
.
plt.figure(figsize=(7, 4))
evaluate_ares_ea_agent(agent_under_investigation, include_position=False, n=200)
We can also test the trained agent on a simulation.
If you want to see an example agent instead of the one you just trained, set agent_name="ml_workshop"
.
# Run final agent
fig, ax = plt.subplots()
agent_name = agent_under_investigation
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=True)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
terminated = False
truncated = False
observation, _ = env.reset()
while not (terminated or truncated):
action, _ = loaded_model.predict(observation)
observation, reward, terminated, truncated, info = env.step(action)
img = env.render()
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.3)
Running in the real world
Below you can see one of our final trained agents optimising position and focus of the beam on the real ARES accelerator.
Keep in mind that this agent has never seen the real accelerator before. All it has ever seen is a very simple linear beam dynamics simulation. Despite that it performs well on the real accelerator where all kinds of other effects come into the mix.
Note that this does not happen by itself and is the result of various careful decisions when designing the traiing setup.
Once trained, the agent is, however, trivial to use and requires no futher tuning or knowledge of RL.
# Show polished donkey running (on real accelerator)
show_video("utils/real_world_episode_recording.mp4")
Further Resources
Getting started in RL¶
- OpenAI Spinning Up - Very understandable explainations on RL and the most popular algorithms acompanied by easy-to-read Python implementations.
- Reinforcement Learning with Stable Baselines 3 - YouTube playlist giving a good introduction on RL using Stable Baselines3.
- Build a Doom AI Model with Python - Detailed 3h tutorial of applying RL using DOOM as an example.
- An introduction to Reinforcement Learning - Brief introdution to RL.
- An introduction to Policy Gradient methods - Deep Reinforcement Learning - Brief introduction to PPO.
Further Resources
Papers¶
- Learning to Do or Learning While Doing: Reinforcement Learning and Bayesian Optimisation for Online Continuous Tuning
- Cheetah: Bridging the Gap Between Machine Learning and Particle Accelerator Physics with High-Speed, Differentiable Simulations
- Learning-based optimisation of particle accelerators under partial observability without real-world training - Tuning of electron beam properties on a diagnostic screen using RL.
- Sample-efficient reinforcement learning for CERN accelerator control - Beam trajectory steering using RL with a focus on sample-efficient training.
- Autonomous control of a particle accelerator using deep reinforcement learning - Beam transport through a drift tube linac using RL.
- Basic reinforcement learning techniques to control the intensity of a seeded free-electron laser - RL-based laser alignment and drift recovery.
- Real-time artificial intelligence for accelerator control: A study at the Fermilab Booster - Regulation of a gradient magnet power supply using RL and real-time implementation of the trained agent using field-programmable gate arrays (FPGAs).
- Magnetic control of tokamak plasmas through deep reinforcement learning - Landmark paper on RL for controling a real-world physical system (plasma in a tokamak fusion reactor).
Further Resources
Literature¶
- Reinforcement Learning: An Introduction - Standard text book on RL.
Packages¶
- Gymnasium, (successor of OpenAI Gym) - De facto standard for implementing custom environments. Also provides a library of RL tasks widely used for benchmarking.
- Stable Baselines3 - Provides reliable, benchmarked and easy-to-use implementations of the most important RL algorithms.
- Ray RLlib - Part of the Ray Python package providing implementations of various RL algorithms with a focus on distributed training.