In this tutorial notebook we will implement all the basic components of a Reinforcement Learning algorithm to solve a problem in particle accelerators, focused on reward definition
ARES is an S-band radio frequency linac at the DESY Hamburg site equipped with a photoinjector and two independently driven traveling wave accelerating structures. The main research focus is the generation and characterization of sub-femtosecond electron bunches at relativistic particle energy. The generation of short electron bunches is of high interest for radiation generation, i.e. by free electron lasers.
We would like to focus and center the electron beam on a diagnostic screen using corrector and quadrupole magnets
We need to define:
$\implies$ Is the action space continuous or discrete?
$\implies$ Is the problem deterministic or stochastic?
In the ARES transverse tuning task we have 3 quadrupoles and 2 corrector magnets
The actions are:
In our control system we can set these derived values directly according the beam energy
$\implies$ actions
$=[k_{\mathrm{Q1}},k_{\mathrm{Q2}},\theta_\mathrm{CV},k_{\mathrm{Q3}},\theta_\mathrm{CH}]$
is a 5-dimensional array
Observation is the information an agent receives about the current state of the environment
It should provide enough information so that the agent can solve this problem.
The observation does not necesearily cover the entire (internal) state pf the environment.
$\implies$ What should be included in the observation
?
$\implies$ What can be observed in simulation?
$\implies$ What cannot be observed in real world?
$\implies$ How does this relate to the environment?
The screen is made from scintillating material and glows when hit by electrons
The camera films the screen
The state
can be fully described by with four components:
$\implies$ Do we know or can we observe the state of the environment?
The observation
for this task contains three parts:
$\implies$ Does this state definition fullfil the Markov property? (does the probability distribution for the next beam depend only on the present state? or is it affected by information about the past?)
Our goal is divided in two tasks:
$\implies$ How should we define our reward function? Give it a go!
$\implies$ We have a whole section dedicated to reward formulation later on
image from RL Tips and Tricks - A. Raffin
$\implies$ What would you choose and why?
There are many libraries with already implemented RL algorithms, and frameworks to implement an environment to interact with. In this notebook we use:
More info here
Note: OpenAI Gym is slowly being succeeded by Gymnasium, a fork of Gym maintained by the Farama Foundation.
We take all the elements of the RL problem we defined previously and reprensent the tuning task as an OpenAI Gym environment, which is a standard library for RL tasks.
A custom gym.Env
would contain the following parts:
observation_space
and action_space
reset
method: resets the environment for a new episode, returns 2-tuple (observation, info)
step
method: main logic of the environment. It takes an action
, changes the environment to a new state
, get new observation
, compute the reward
, and finally returns the 4-tuple (observation, reward, done, info)
done
checks if the current episode should be terminated (reached goal reached, or exceeded some thresholds)render
method: to visualize the environment (a video,or just some plots)We list the most relevant parts of the project structure below:
utils/train.py
contains the gym environments and the training scriptARESEA
implements the ARES Experimental Area transverse tuning task as a gym.Env
. It contains the basic logic, such as definition of observation space, action space, and reward. How an action is taken is implemented in child classes with specific backends.ARESEACheetah
is derived from the base class ARESEA
, where it uses cheetah
simulation as a backend.make_env
Initializes a ARESEA
envrionment, and wraps it with required gym.wrappers with convenient features (e.g. monitoring the progress, end episode when time_limit is reached, rescales the action, normalize the observation, ...)train
convenient function for training the RL agent. It calls make_env
, setup the RL algorithm, starts training, and saves the results in utils/recordings
, utils/monitors
and utils/models
.We list the most relevant parts of the project structure below:
utils/helpers.py
contains some utility functionsevaluate_ares_ea_agent
Takes a trained agent and evaluates its performance using different metrics.plot_ares_ea_training_history
shows the progress during trainingfrom time import sleep
import matplotlib.pyplot as plt
import names
import numpy as np
from gym.wrappers import RescaleAction
from IPython.display import clear_output, display
from stable_baselines3 import PPO
from utils.helpers import (
evaluate_ares_ea_agent,
make_ares_ea_training_videos,
plot_ares_ea_training_history,
show_video,
)
from utils.train import ARESEACheetah, make_env, read_from_yaml
from utils.train import train as train_ares_ea
from utils.utils import NotVecNormalize
Some methods:
reset
: in both real and simulation cases: resets the magnets to initial values. In simulation, regenerate incoming beam, (optionally) resets the magnet misalignments.step
: set magnets to new settings. Observe the beam (run a simulation or observe screen image in real-world).Now let's create the environment:
# Create the environment
env = ARESEACheetah()
env.target_beam_mode = "constant"
$\implies$ Let's define the position $(\mu_x, \mu_y)$ and size $(\sigma_x, \sigma_y)$ of the beam on the screen
$\implies$ Modify the target_beam
list below, where the order of the arguments is $[\mu_x,\sigma_x,\mu_y,\sigma_y]$
$\implies$ Take into account the dimensions of the screen ($\pm$ 2e-3 m)
$\implies$ The target beam will be represented by a blue circle on the screen
target_beam = np.array([1e-3, 2e-4, 1e-3, 2e-4]) # Change it
env.target_beam_values = target_beam
env.reset() ##
plt.figure(figsize = (7, 4))
plt.imshow(env.render(mode="rgb_array")) # Plot the screen image
<matplotlib.image.AxesImage at 0x7f596835a260>
$\implies$ Change the magnet values, i.e. the actions
$\implies$ The actions are normalized to 1, so valid values are in the [0, 1] interval
$\implies$ The values of the action
list in the cell below follows this magnet order: [Q1, Q2, CV, Q3, CH]
action = np.array([1, 0.5, 0.5, 1, 0.6]) # put your action here
Perform one step: update the env, observe new beam!
env = RescaleAction(env, -1, 1) # rescales the action to the interval [-1, 1]
env.reset()
env.step(action)
plt.figure(figsize = (7, 4))
plt.imshow(env.render(mode="rgb_array"))
<matplotlib.image.AxesImage at 0x7f59683e5fc0>
$\implies$ Observe the plot above, what beam does that magnet configuration yield? can you center and focus the beam by hand?
env.reset()
steps = 10
def change_vertical_corrector(q1, q2, cv, q3, ch, steps, i):
action = np.array([q1, q2, cv + 1 / steps * i, q3, ch])
return action
fig, ax = plt.subplots(1, figsize = (7, 4))
for i in range(steps):
action = change_vertical_corrector(0.2, -0.2, -0.5, 0.3, 0, steps, i)
env.step(action)
img = env.render(mode="rgb_array")
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
Training a good agent revolves primarily around finding the right setup for the environment and the correct reward function. In order to iterate over and compare many different options, our training function takes a dictionary called config
. The dictionary keys or "configurations" are explained below
In the following, we use a config
dictionary to set up the training. This allows us to easily switch between different training conditions. Below we show some selected configurations that have the most influence on training results, the parameters can mostly be divided into two parts.
action_mode
Set directly the magnet strength or set a delta action. You may set this to "direct"
or "delta"
. You should find that "delta" trains faster. Setting "delta" is also crucial in running the agent on the real accelerator.reward_mode
: How the reward is calculated. Can be set to negative_objective
, objective_improvement
, or sum_of_pixels
.time_reward
: Whether the agent will be penalized for making another step, this is intended to make the tuning faster.rescale_action
: Takes the limits of the magnet settings and scale them into the following range.Termination conditions:
abort_if_off_screen
If this property is set to True, episodes are aborted when the beam is no longer on the screen.time_limit
: Number of interactions the agent gets to tune the magnets within one episode.target_sigma_x_threshold
, target_sigma_y_threshold
: Thresholds for beam parameters. If all beam parameters are within the threshold from their target, episodes will end and the agent will stop optimising.$\implies$ What does the existence of termination conditions says about the nature of the problem? is it episodic or continuous?
Let's load some pre-trained models using different combinations of the config
dictionary and using different reward definitions
config
parameters"abort_if_off_screen": True
"reward_mode": "objective_improvement"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": -1.0
"action_mode": "delta"
objective_improvement
where $j$ is the index of the current time step.
$\implies$ What do you expect to happen, why?
agent_name = "Gary Buchwald" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
done = False
observation = env.reset()
while not done:
action, _ = loaded_model.predict(observation)
observation, reward, done, info = env.step(action)
img = env.render(mode="rgb_array")
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
config
parameters"abort_if_off_screen": False
"reward_mode": "sum_of_pixels"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": 0.0
"action_mode": "delta"
sum_of_pixels
(focusing-only)$\implies$ What do you expect to happen, why?
agent_name = "David Archibald" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
done = False
observation = env.reset()
while not done:
action, _ = loaded_model.predict(observation)
observation, reward, done, info = env.step(action)
img = env.render(mode="rgb_array")
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
config
parameters"abort_if_off_screen": False
"reward_mode": "objective_improvement"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": 0.0
"action_mode": "direct"
objective_improvement
where $j$ is the index of the current time step.
$\implies$ What do you expect to happen?
$\implies$ What is the difference between Agent 1: "Gary Buchwald" and this agent?
agent_name = "Bertha Sparkman" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
done = False
observation = env.reset()
while not done:
action, _ = loaded_model.predict(observation)
observation, reward, done, info = env.step(action)
img = env.render(mode="rgb_array")
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
config
parameters"abort_if_off_screen": False
"reward_mode": "objective_improvement"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": 0.0
"action_mode": "delta"
objective_improvement
where $j$ is the index of the current time step.
$\implies$ What do you expect to happen?
$\implies$ What is the difference between Agent 1: "Gary Buchwald", Agent 3: "Bertha Sparkman", and this agent?
agent_name = "Betty Gordon" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
done = False
observation = env.reset()
while not done:
action, _ = loaded_model.predict(observation)
observation, reward, done, info = env.step(action)
img = env.render(mode="rgb_array")
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
config
parameters"abort_if_off_screen": False
"reward_mode": "negative_objective"
"target_sigma_x_threshold": None
"target_sigma_y_threshold": None
"time_reward": 0.0
"action_mode": "delta"
negative_objective"
where $b = [\mu_x,\sigma_x,\mu_y,\sigma_y]$, $b^\mathrm{(c)}$ is the current beam, and $b^\mathrm{(t)}$ is the target beam. $\mathrm{obj}_0$ is the initial objective after reset
.
$\implies$ What do you expect to happen, why?
agent_name = "Sean Kelley" # names are randomly generated in training
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=False)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
done = False
observation = env.reset()
while not done:
action, _ = loaded_model.predict(observation)
observation, reward, done, info = env.step(action)
img = env.render(mode="rgb_array")
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
actor model
, often a neural network, takes the observation
of the current state
and predicts an action
to be taken (forward pass)critic model
, also a neural network, takes the observation
of the current state
and predicts the value function of the state (and evaluates how good is the action taken by the actor model
)n_samples = n_steps * n_envs
is the total number of samples, or interactions with the environment in one epoch
(more on what that means later)n_envs
parallel environments, in which the agent will take n_steps
At each step:
actor model
prediction (forward pass of the model NN)critic model
will predict the value functions of the states during the episode (forward pass of the model NN)The samples (actions, rewards,...) from all environments are stored in a buffer
, where buffer_size = n_samples
After performing n_steps
in a particular environment (and therefore gathering n_steps
number of samples per environment), it's time to update the actor and critic models (backpropagation of the NNs). Let's consider only 1 environment now for simplicity.
n_samples
in mini-batches of a certain batch_size
n_samples_tot
/batch_size
number of backpropagationsn_epochs
(number of iterations on the training set)epochs
(yes...)total_timesteps
, wheretotal_timesteps = n_steps * n_envs * n_epochs = n_samples * n_epoch
$\implies$ What the advantage of having a buffer?
Let's consider the following training parameters:
n_steps
= 100n_envs
= 2batch_size
= 50n_epochs
= 3epochs
= 2 $\implies$ What is total_timesteps
?
$\implies$ What is the total number of batches n_batch
in 1 epoch?
$\implies$ What is the total number of model updates?
Now, set the config below and train your first reinforcement learning agent!
Apart from the reward definition, time_reward, etc. that we discussed before. Below are some other configurations that you can change:
net_arch
: architecture of the policy network (# of neurons in each layer)gamma
: Discount factor of the RL problem. Set lower to make rewards now more important than rewards later (usually above 0.9)normalize_observation
: Normalize observations throughout training by fitting a running mean and standard deviation of themnormalize_reward
: Normalize rewards throughout training by fitting a running mean and standard deviation of them# Feel free to change some of the configurations here.
config = {
"n_envs": 40,
"n_steps": 50,
"batch_size": 100,
"n_epochs": 10,
"total_timesteps": 200_000,
"abort_if_off_screen": False,
"action_mode": "delta",
"gamma": 0.99,
"frame_stack": None,
"net_arch": [64, 64],
"normalize_observation": True,
"normalize_reward": True,
"rescale_action": (-3, 3),
"reward_mode": "negative_objective",
"run_name": names.get_full_name(),
"target_sigma_x_threshold": None,
"target_sigma_y_threshold": None,
"threshold_hold": 5,
"time_limit": 25,
"time_reward": -0.0,
}
Looking at the config
dictionary in the cell above:
$\implies$ How many epochs does it correspond to?
$\implies$ How many model updates (backpropagation) would you be doing in total?
You will train the agent by executing the cell below:
# Toggle comment to re-run the training (can take very long)
%time train_ares_ea(config)
==> Training agent "Marie Willis" Eval num_timesteps=20000, episode_reward=-14.00 +/- 1.08 Episode length: 25.00 +/- 0.00 New best mean reward! Eval num_timesteps=40000, episode_reward=-6.88 +/- 0.79 Episode length: 25.00 +/- 0.00 New best mean reward! Eval num_timesteps=60000, episode_reward=-6.28 +/- 1.96 Episode length: 25.00 +/- 0.00 New best mean reward! Eval num_timesteps=80000, episode_reward=-4.76 +/- 0.96 Episode length: 25.00 +/- 0.00 New best mean reward! Eval num_timesteps=100000, episode_reward=-4.57 +/- 0.50 Episode length: 25.00 +/- 0.00 New best mean reward! Eval num_timesteps=120000, episode_reward=-5.31 +/- 1.91 Episode length: 25.00 +/- 0.00 Eval num_timesteps=140000, episode_reward=-5.31 +/- 1.02 Episode length: 25.00 +/- 0.00 Eval num_timesteps=160000, episode_reward=-4.97 +/- 0.40 Episode length: 25.00 +/- 0.00 Eval num_timesteps=180000, episode_reward=-4.77 +/- 0.52 Episode length: 25.00 +/- 0.00 Eval num_timesteps=200000, episode_reward=-5.42 +/- 1.10 Episode length: 25.00 +/- 0.00 CPU times: user 8min 22s, sys: 1.35 s, total: 8min 23s Wall time: 8min 24s
Let's look at the training metrics to see how the agent did.
Comment out the following line and set agent_under_investigation
to the name of your agent, to check its training history.
agent_under_investigation = config["run_name"]
# agent_under_investigation = "Donna Brown"
# Training curves from this training
# Change `config["run_name"` to `"ml_worksop` to see curves from example training.
plot_ares_ea_training_history(agent_under_investigation)
To look at videos of the agent during training:
utils/recordings/
.ml_workshop
directory contains videos from an example training.If you are training agents that include the dipoles, set the functions argument include_position=True
.
plt.figure(figsize = (7,4))
evaluate_ares_ea_agent(agent_under_investigation, include_position=False, n=2000)
Evaluation results (2000 evaluations) ---------------------------------------- ==> Mean MAE = 5.207024056733189e-05 ==> RMSE = 7.323487962575908e-05 ==> Mean no. of steps = 25.0
We can also test the trained agent on a simulation.
If you want to see an example agent instead of the one you just trained, set agent_name="ml_workshop"
.
# Run final agent
agent_name = agent_under_investigation
loaded_model = PPO.load(f"utils/models/{agent_name}/model")
loaded_config = read_from_yaml(f"utils/models/{agent_name}/config")
env = make_env(loaded_config, record_video=True)
env = NotVecNormalize(env, f"utils/models/{agent_name}/normalizer")
done = False
observation = env.reset()
while not done:
action, _ = loaded_model.predict(observation)
observation, reward, done, info = env.step(action)
img = env.render(mode="rgb_array")
ax.imshow(img)
display(fig)
clear_output(wait=True)
sleep(0.5)
Below you can see one of our final trained agents optimising position and focus of the beam on the real ARES accelerator.
Keep in mind that this agent has never seen the real accelerator before. All it has ever seen is a very simple linear beam dynamics simulation. Despite that it performs well on the real accelerator where all kinds of other effects come into the mix.
Note that this does not happen by itself and is the result of various careful decisions when designing the traiing setup.
Once trained, the agent is, however, trivial to use and requires no futher tuning or knowledge of RL.
# Show polished donkey running (on real accelerator)
show_video("utils/real_world_episode_recording.mp4")