Review and experimental benchmarking of machine learning algorithms for efficient optimization of cold atom experiments

The generation of cold atom clouds is a complex process which involves the optimization of noisy data in high dimensional parameter spaces. Optimization can be challenging both in and especially outside of the lab due to lack of time, expertise, or access for lengthy manual optimization. In recent years, it was demonstrated that machine learning offers a solution since it can optimize high dimensional problems quickly, without knowledge of the experiment itself. In this paper we present results showing the benchmarking of nine different optimization techniques and implementations, alongside their ability to optimize a Rubidium (Rb) cold atom experiment. The investigations are performed on a 3D $^{87}$Rb molasses with 10 and 18 adjustable parameters, respectively, where the atom number obtained by absorption imaging was chosen as the test problem. We further compare the best performing optimizers under different effective noise conditions by reducing the Signal-to-Noise ratio of the images via adapting the atomic vapor pressure in the 2D+ MOT and the detection laser frequency stability.


Introduction
Cold atom systems are an essential part of the quantum revolution, facilitating new generations of quantum technologies.Although the theory behind laser and evaporative cooling techniques is well understood [1,2,3,4], in practice each apparatus is unique and optimal parameters must be found experimentally, typically in a lengthy manual optimization routine involving a large number of inter-dependent parameters.Such a labour intensive approach requires a large amount of technical expertise and patience, and will often only find a local efficiency maximum due to the size and complexity of the ‡ O.A. and V.A.H. are joint first authors.
parameter space.Thus, by instead optimizing via machine learning, one can improve the productivity of cold atom experiments and also take steps towards autonomous operation or control by non-specialist operators.This will be particularly essential for mobile quantum technologies such as gravimeters [5,6,7,8,9] and lattice clocks [10,11,12,13,14] where the environmental conditions vary, or space missions on manned and unmanned spacecraft [15,16,17,18,19,20,21,22].However, the optimization of problems involving noisy data and large parameter sets is computationally heavy and requires the use of an appropriate optimization algorithm.
The topic of optimization has been of significant interest in the community, with a variety of approaches being taken.Earlier examples utilised heuristic optimization algorithms, such as differential evolution algorithms, to maximize atom number in a magnetic trap loaded from a magneto-optical trap (MOT) [23,24] and demonstrated a higher atom number than that achieved by manual optimization.Here up to 21 parameters were optimized in 5 h 45 min or 5000 evaluations.Later, neural networks were used to optimize optical depth in a MOT by optimizing 63 parameters [25], and to optimize an evaporation ramp for 87 Rb Bose-Einstein Condensate (BEC) using optical depth (OD) as indicator for the phase space density [26].Reinforcement learning is used in [27] to optimize atom number and temperature simultaneously, as well as preparing a defined number of atoms in the MOT.The optimisation technique is performed on both a simulation and a real world experiment, using laser detuning as a single optimization parameter.Currently the most popular machine learning (ML) technique within the field is Bayesian optimization (BO) using Gaussian process regression [28,29,30,31,32].Such routines were, for example, used to optimize the evaporation ramp for a crossed optical dipole trap (cODT) via a 16 parameter ramp.In this case, the minimization of the number of atoms in the wings of the cloud was used as an optimization goal [28].They were also able to identify which parameters are important to optimize and which could be excluded from the optimization as unimportant.Another work benchmarks a different Bayesian optimizer against a random search, and shows the relative importance of different stages of an evaporation ramp for atoms in a cODT [29]; here 8 parameters are used to optimize the number of atoms in a circular area of interest, with 300 trials requiring 3 h, and repeating the optimization 16 times to reduce the effects of noise.Further benchmarking work has occurred in [30], comparing BO to neural networks and differential evolution by optimizing 37 parameters for cooling and subsequent evaporation to Bose-Einstein condensate (BEC) in a time-averaged orbiting potential trap; this showed the significant advantage of BO over neural networks and differential evolution.
While these results are promising in terms of applying BO for the preparation of ultra-cold atom systems, more systematic comparisons between different methods are required to facilitate the choice of the most suitable algorithms.BO as well as other ML methods can have a significant computational overhead for calculating the next parameter sample due to the required training and evaluation of the underlying ML model.On one hand, this calls for an efficient implementation of ML methods.But on the other hand, a quantification of the overhead versus the runtime of the coldatom experiment is required to decide between ML methods and less computationally expensive heuristic methods.
Another major challenge that influences the choice of the best optimization method is the inherently noisy data that is produced in the changing environments, especially for mobile experiments.Atom numbers and temperatures are typical optimization objectives.They are calculated from the fitting of the atom cloud absorption profiles.This fitting can be very sensitive to fluctuations, especially at low atom numbers.This can degrade the performance of algorithms to varying degrees, depending on the underlying strategy.This difficulty can be tackled in a range of ways.For example, one can take the average of several optimization cycles [29].An alternative approach is to adjust the algorithm, for example by resampling old population members in a differential evolution algorithm [24], or by using a cost function which is less sensitive to noise [30,29,28].Many of the alternative cost functions have the additional benefit of requiring less computation time by using parameters such as optical depth as an indicator for atom number or phase space density.
In this paper, we study and benchmark how noise and dimensionality influences the performance of a large set of heuristic and ML optimization methods.We consider the heuristic methods particle swarm optimization (PSO) [33,34] and its adaptation LILDE [24], differential evolution (DE), covariance matrix adaptation evolution strategy (CMA-ES) [35] and downhill simplex, also known as Nelder-Mead search [36].As a baseline uninformed method, we include random sampling (RS) from a uniform distribution.In addition, machine learning-based BO in various implementations [37,38] is considered, with particular emphasis on algorithms that can handle noise in an explicit manner.Moreover, an extension of BO is studied, which is explicitly designed to cope with noisy data without the need for repeated optimization or resampling [39].Although the extended BO method is more elaborate, we developed an efficient implementation that facilitates the optimization of a large number of parameters using a fitted atom number as a cost function with a small computational overhead [40].

Optimization algorithms for tuning experimental parameters
There is a large number of algorithms that are suitable for minimizing different kinds of objective functions.The laboratory experiments considered in this work can be regarded as a function f : p ∈ X ⊂ R d → R that maps a d-dimensional vector of experimental control parameters p to the objective value f (p).The training data fed to the minimization methods f (p) = f (p) + ǫ is corrupted by a random noise ǫ that is assumed to have zero mean.In many cases, its probability distribution is well modelled by a normal distribution with a constant variance η 2 noise , i.e. ǫ ∼ N (0, η 2 noise ).The goal is to find the parameter vector p min that minimizes E[ f (p)] = f (p) in the search space X .The function f (p) is in general non-convex and derivative information is unattainable, rendering gradient-based methods inappropriate.Since d has a moderately large value (here d = 10 or d = 18) and the objective is expensive to evaluate, an exhaustive search in the full space X is also infeasible.In such a case, one often resorts to heuristic minimization strategies.These do not guarantee convergence to a global minimum, however they can explore larger parameter spaces and are to a certain degree robust towards noise.

Heuristic minimization methods
All considered heuristic strategies start from a random population of N vectors which is updated in each iteration.
Particle swarm optimization (PSO) is a global minimization strategy that draws its inspiration from the field of sociobiology and the intelligence of animal groups (e.g. a flock of birds) [41].The swarm members move with individual velocities through the parameter space.These velocities are updated by a randomized weighted sum of the current velocities, velocities directed to the individual best seen parameters p best,i , i = 1, . . ., N, and the best seen parameters of the whole swarm p best,swarm .
Differential evolution (DE) is an evolutionary optimization algorithm inspired by the mutation, crossover and selection processes occurring in nature [42].In the mutation step, for each member p i of the population, a mutated genome is created as where F is the differential weight and a, b, c are distinct randomly selected population members.Random entries of p i , selected according to a crossover probability, are replaced with the mutated genome p mut , forming a new candidate.The candidate replaces p i in the next population if its objective value is lower.
The covariance matrix adaptation evolution strategy (CMA-ES) is another evolutionary algorithm [43].It draws its N population members from a multivariate normal distribution N (µ, Σ) with mean vector µ ∈ R d and covariance matrix Σ ∈ R d×d .In contrast to DE, it does not replace members on an individual basis, but uses a weighted subset of M < N members with the lowest objective values to update the mean µ and covariance matrix Σ for the next iteration.
Downhill simplex (simplex) or Nelder-Mead search also updates a population of vectors in each iteration [36].However, this population is a simplex of only N = d + 1 vectors (i.e. a triangle in R 2 , a tetrahedron in R 3 etc.) which is much smaller than typical populations of DE and CMA-ES.New candidates are created through processes called reflection, expansion and contraction of the worst-performing vector of the simplex.If those three candidates do not lead to a specific improvement, the simplex shrinks towards its best performing vector p best,simplex .Due to the small population, simplex is a more local minimization method that typically converges faster than other heuristic methods as in the worst case d + 2 evaluations, or in the best case only one evaluation of the objective is required per iteration.
None of the above algorithms handle noise in an explicit manner.However, the population updates are often based on many vectors and corresponding objective evaluations (e.g.N individual comparisons for DE, averaging over M vectors for CMA-ES) such that the impact of noise is diminished.A different approach is taken in the LILDE extension of DE.This algorithm re-evaluates members that have survived a specific number of generations [24], further reducing the impact of noise by removing population members whose fitness value is only good due to noise.We speculate that PSO and simplex can be most severely misled by noise.For these methods a single vector and corresponding noisy evaluation can act as an attractor of the population (p best,swarm for PSO and p best,simplex for simplex).As long as no lower value is observed, these vectors are not updated.

Bayesian optimization
BO is a sequential method that uses previous observations D = { f (p 1 ), . . ., f (p k )} to train a stochastic machine learning model which is used to determine a new parameter vector p k+1 to evaluate [44].The stochastic model, a Gaussian process, can be regarded as a posterior distribution over random functions given the observations [45].For each parameter vector p * it predicts a normal distribution of possible function values Here, µ 0 and σ 2 0 are hyperparameters for the mean and variance of the objective function.The covariance matrix [Σ] ij = σ 2 0 κ(p i , p j )+η 2 δ ij depends on a positive definite covariance function κ(•, •) and on an additional hyperparameter for the noise variance η 2 .The best choice of the covariance function depends on the assumed differentiability of the objective f (p) [45].In this work, the considered BO optimizers use the Matérn-5/2 covariance function, The length scales l 1 , . . ., l d , at which the covariance decreases, are another set of hyperparameters.The value of all hyperparameters is determined by their maximum likelihood estimate [45].
The next sampling point p k+1 is chosen by some infill criterion.A common choice is to maximise the expected improvement of the predicted objective value distribution with respect to the lowest seen objective value y min .
In the noiseless case (η 2 = 0), the predicted variance σ 2 (p * ), and thus EI(p * ), tends to zero if the number of neighbouring observations increases.Therefore, a local minimum will always be eventually escaped, since EI(p * ) will be larger in parameter regions with less data and more predicted variance.In fact, it has been shown that the expected improvement strategy converges in the noiseless case at a near optimal rate to the global minimum if the objective belongs to the reproducing kernel Hilbert space with kernel σ 2 0 κ(p 1 , p 2 ) [46].

Bayesian optimization of noisy functions
In the noisy case, σ 2 (p * ) and thus EI(p * ) do not tend to zero, even if p * is an observed point.This can lead to an oversampling of local minima.Moreover, the best observed value y min might be corrupted by noise.Several extensions of BO have been proposed for noisy settings [47,39].In this work, we focus on the noisy expected improvement strategy (NEI) [39].For this, F sets of random function values F i = {y i,1 , . . ., y i,k }, i = 1, . . ., F are drawn from the noisy Gaussian process posterior at the observed points p 1 , . . ., p k .The fantasies F i are thus possible values of f (p) that are compatible with the noisy observations D. Each F i is used to train the noiseless Gaussian process and to determine a noiseless expected improvement EI i (p * ) with respect to the corresponding minimum y min,i = min(F i ).The noisy expected improvement is then defined as the average over the fantasies

Different implementations of Bayesian optimization
BO is a relatively complex method for which implementation details can have a large influence on the convergence and computational overhead.Therefore, in this work three different implementations of BO are considered, the open-source package FMFN [38], the open-source AX framework [37] provided by Meta Platforms Inc., and the BO optimizer (JCM) included in the commercial software JCMsuite, developed by some of the authors.The three methods follow, for example, different strategies to maximise acquisition functions (either EI(p * ) or NEI(p * )).The open-source package FMFN [38] uses the best result from 10 000 random samples of the acquisition function and 10 local L-BFGS-B optimizations started from random initial positions.Random sampling can result in samples of low quality for larger dimensionality d of the parameter space.For example, for d = 18 parameters, 10 000 samples correspond to effectively about 1.7 samples per dimension which can be considered extremely sparse.The AX framework first evaluates the acquisition function at, by default, 1000 random positions.The best 20 positions are further optimized by the local L-BFGS-B or SLSQP method.The JCM optimizer performs an initial DE maximisation of the acquisition function.The obtained maximum and the previous sample with the lowest objective value are then tuned by a local L-BFGS-B optimization.
AX and JCM can detect the noise level η 2 in order to perform a maximisation of NEI(p * ).FMFN assumes by default that the objective function is noiseless.The same holds for the JCM optimizer with disabled noise detection.
Within our collaboration [40], we developed different strategies with the goal of limiting the computational overhead of optimization based on noisy experimental data.For this work, the methods where implemented into the JCM optimizer.To compute NEI(p * ) without requiring F additional hyperparameter optimizations, the optimized length scale hyperparameters of the noisy Gaussian process are used as proposed in [39].Exploiting the fact that the noiseless covariance matrix [Σ] noiseless ij = σ 2 0 κ(p i , p i ) is identical for each of the F Gaussian processes up to different optimal prefactors σ 2 0 , the expensive step of the matrix inversion (i.e.Cholesky decomposition) has to be done only once.Moreover, following the batch optimization strategy in [39] the next sampling point is already computed during a pending evaluation of the objective.The numerical effort of this computation, e.g. the maximum number of DE iterations, is adapted to the runtime of the experiment to reduce the waiting time, in the ideal case, to zero [48].The apparatus used within this paper is described in Fig. 1.It is based on a commercial RuBECi vacuum system, alongside a 2D+ MOT platform, magnetic coils [49], and an optical system for the 3D MOT beams.A rendering can be seen in the box labelled 'physics package' in Fig. 1.The vacuum chamber consists of two glass cells separated by a differential pumping stage.One chamber is used to create a 2D+ MOT which acts as an atom source.The second chamber acts as a 'physics package', where atoms are trapped in a 3D-MOT before further cooling.Here, two separately controllable coils, aligned along the x-axis, provide a magnetic gradient.Additionally, two pairs of Helmholtz coils, aligned along the y-and z-axis of the experiment, are used to generate offset fields for cancelling stray magnetic fields.The laser light for trapping in the 3D-MOT is fed into the system via three collimators providing beams with a diameter of 13.2 mm which are retro-reflected.The atoms are imaged using absorption imaging.

Experimental set-up
As shown in the 'optical setup' box of Fig. 1, cooling (F = 2 → F ′ = 3) and repumping (F = 1 → F ′ = 2) light is generated via two micro-integrated DFB-MOPA modules [50].The lasers are frequency stabilized via offset locks to a reference laser locked to the 85 Rb F = 3 → F ′ = 3/4 crossover.The optical power delivered to the experiment is modulated and switched via acousto-optic modulators (AOM), then combined on a 50:50 fibre splitter, before additional extinction provided by shutters.
No active intensity stabilization is present in the setup.
In short, the experimental sequence starts with the loading of the 3D-MOT by a 2D+ MOT, followed by the compression and molasses phases, and ends with the imaging of the atoms.A selection of experimental parameters from this sequence will be used in order to perform the benchmarking.
During the MOT phase, the detuning and optical power of the cooling and repumper lasers can be varied, as well as the currents delivered to both gradient coils and the zoffset coils.
Within the molasses phase, the power and detuning of cooling and repumper lasers can be ramped over a variable duration or switched, and the coil currents can also be modified as in the MOT phase.These tunable molasses parameters are the start point of the ramp for the optical power of the repumper, the endpoints of the ramps for the optical power of cooler and repumper, the current applied to all coils responsible for generating offset fields, and the overall length of the molasses phase.Since the molasses phase is short compared to the rest of the sequence, this parameter does not significantly influence the run-time and thus the result of the comparison.This sequence is used for all measurements presented in the following benchmarking measurements.A list of the parameters including their limits can be found in A1.

Server Optimizer interface
Sequence player

Data acquisition Compiler
Figure 2. Schematic (block diagram) of the control software showing the server's workflow running the experiment with its sub-processes.Using the client, a sequence is created to be used by the server for compilation and later used by the sequence player.Feedback for the optimizer is provided using the acquired data via the data evaluation software.An adapted sequence is then fed into the compiler.
After a sequence is performed by the experiment, absorption images are evaluated and post-processed in tailored Python analysis code [51].Atom number and optical density can be evaluated via either a full 2D Gaussian fit, or via two 1D fits of row and column sums.This information is then fed into the optimization algorithm which uses the result to generate a new sequence.This is shown in Fig. 2.
The offset frequencies, optical powers, coil currents, and timings can be controlled via the Sinara system, a schematic of which is shown in the 'Control electronics' box of Fig. 1.The Sinara system is a Field Programmable Gate Array (FPGA) based real-time system, developed by Quartiq [52].The FPGA board is augmented with digital in/out breakout boards as well as DDS and DAC cards for controlling the experiment.By using these cards we control our frequency offset locks, coil drivers, shutters and AOMs.The control software of the experiment is a server-client based system using Python.It is described in Appendix D.

Results
In the following sections we present benchmarking results of the performance of several optimizers using post-molasses atom number after 10 ms of free expansion as the test problem.The problem is chosen due to its resilience to false positives and the potential to adjust the noisiness of the data by worsening the imaging lasers frequency lock and reducing the rubidium gas pressure in the 2D+ MOT.The fitness function for this problem is f = −N atoms , which converts the maximisation of atom number into a minimization problem as required by the optimizers used.We also note that this experiment typically takes approximately 2.5 s per shot.
In the 10-dimensional case we vary parameters which influence the number of trapped atoms, such as laser frequencies, optical powers and magnetic fields.A complete list of the parameters chosen including the limits can be found in the table in Appendix Appendix A. The MOT loading time was excluded due to the effect it would have on the duration of the experiment and thus the comparability of the results.In addition, we included the start of the molasses phase: the frequencies of the cool and repumper lasers, as well as the starting point of the power ramp of the cooling laser.
For the 18-dimensional case we extended this list of parameters to include all remaining variables influencing the molasses, namely the endpoint of the cooler power ramp, the start and end point of the repumper power ramp, the offset fields during molasses, and the temporal length of the molasses phase.Since the length of the molasses phase is short, including this as an optimization parameter does not noticeably affect the results of the comparison.The power ramps are linear and frequency is instantaneously switched.
The optimizers were tested using the following procedure.Each algorithm is run using the default settings for 400 iterations and repeated 10 times to obtain an average performance.The choice of 400 iterations allows for many algorithms to reach a plateau whilst also limiting the time needed for data acquisition to a reasonable amount.In order to remove systematics due to experimental drift, the algorithms are interleaved rather than run sequentially, that is, each algorithm is run once, one after another, and then this process repeated 10 times.Parameter limits are chosen such as to reduce the size of the parameter space where no atoms are measured.
In the following figures, we show the evolution of the mean fitness (i.e.atom number) over time, with the shaded area representing the standard error of the 10 runs.The optimizers received no initial set and are initialized randomly within the defined parameter space.
As described in Sec. 2, in order to provide comprehensive benchmarking, we compare at least one algorithm per family: CMA-ES ; the BO methods JCM with noise detection, FMFN and AX; PSO; variations on differential evolution LILDE and DE; RS; and simplex.We benchmark for both 10-and 18-dimensional space as well as two different noise levels.

10-dimensional optimization
Results for 10-dimensional optimization are shown in Fig. 3. RS is considered a baseline optimization that does not take the information from previous evaluations into account.
In our benchmarking, it swiftly finds sets of parameters producing a signal, which indicates that a signal can be found throughout the majority of the parameter space.
Although RS is considered a baseline method, we see that simplex consistently performs worse, reaching a maximum of (2.0±0.7)×10 8 atoms compared to (3.9±0.2)×10 8 atoms for RS, as well as taking longer to reach these values.As described in Sec.2.1, simplex is a local minimization method and does not incorporate strategies to limit effects of noise.Noisy observations can deteriorate the algorithm by attracting population members to regions with relatively low performance.Following simplex and RS in performance are DE, LILDE and PSO.All three reach similar fitness values ((4.5 ± 0.3) × 10 8 , (4.7 ± 0.2) × 10 8 , and (4.7 ± 0.2) × 10 8 atoms respectively) but do not plateau within the given number of iterations.The graphs show jumps in performance at certain times for all three.This is most visible in the LILDE algorithm at 400 s and 1000 s.The jumps are due to these optimizers relying on a generational approach where gained knowledge is collected and combined after a defined number of iterations to generate new sets.The updated process produces a sharp increase in the reached fitness value.This indicates that an adjustment of the default generation size may increase the performance.Bayesian optimizers FMFN and AX lie in third and fourth place.Both perform surprisingly equally in terms of the reached fitness with (5.1 ± 0.3) × 10 8 and (5.2 ± 0.2) × 10 8 atoms respectively.In terms of optimization speed, they perform similarly to begin with, however, over the full iteration number a clear advantage of FMFN over AX is visible.We attribute the significant computational overhead of AX to the more elaborate computation of expected improvement which requires predictions from F (default F = 20) Gaussian processes instead of only one.Based on the plateauing of fitness it seems that neither optimizer would significantly benefit from a higher number of iterations.
CMA-ES performs second best.Here we see a fitness of (5.4 ± 0.3) × 10 8 atoms which is close to the best reachable value.Upon reaching 400 iterations, the fitness is still increasing.As a result, it is possible that a higher fitness value could be reached with further iterations and it may perform similar to the best algorithm given enough time.It is surprising that such a simple optimizer still yields such good performance compared to more elaborate approaches.As such, we conclude that the described approach of updating the mean vector from a set of points is robust against the noise levels in this problem.
The best performing optimizer in our list is the JCM Bayesian optimizer, which was extended for this work.It outperforms the others from start to end.The computation time is short compared to AX due to strategies like precomputing next samples and sharing information between the F Gaussian process regressions (see the theory section 2.4).It reaches the best fitness ((5.6 ± 0.3) × 10 8 atoms) after ≈1500 s reaching a plateau afterwards.The parameters used to achieve the best fitness are shown for the 4 best performing optimizers in App.Appendix B.

18-dimensional optimization
In the 18-dimensional case we find that all optimizers perform consistently worse compared to the 10-dimensional case.Since all limits stayed the same as those used in the 10-dimensional case, one would expect the optimizers to reach the same, or possibly better, fitness as before.We assume that the larger parameter space makes the optimization problem much harder such that only lower fitness values are found within the same number of iterations.However, we note that we are unable to rule out systematic drifts in the experiment between the time the 10-and 18-dimensional data was taken.
When we look at each optimizer individually, we find that FMFN is now the worst performing optimizer, notably worse even than simplex and RS.We attribute this behaviour to its very sparse search for good samples in higher dimensional problems, as described in Sec.2.4.
LILDE and DE now barely perform better than RS.By default, in these algorithms, the population size is 15 times the number of dimensions.Therefore, for d = 18 dimensions, there are 270 evaluations per iteration meaning that these optimization approaches reduce to effectively random sampling within 400 iterations.
As in the 10-dimensional example, the high overhead of AX results in it taking far longer than any other optimizer to finish 400 iterations.Despite the higher computational effort and thus long optimization times, AX still performs well for higher dimensional problems, reaching a plateau of (4.6 ± 0.3) × 10 8 atoms.
PSO performs similarly to AX in terms of fitness (reaching (4.4 ± 0.3) × 10 8 atoms) but takes much less time.
The best performing optimizers are again CMA-ES and the JCM optimizer reaching similar fitness values of (4.7±0.3)×10 8 and (4.8±0.3)×10 8 atoms respectively.CMA-ES takes marginally less time for 400 iterations, and does not reach a plateau value, therefore its fitness may eventually exceed the JCM optimizer.Despite this, the JCM optimizer reaches a close-to-optimal fitness value far more quickly than any other optimizer.
In conclusion, one can see the commercial Bayesian optimizer, JCM, performing best, closely followed by CMA-ES representing the best open-source optimizer available for this use case.Other optimizers perform consistently worse and degrade in performance, when compared to the lower dimensional problem.We see this most prominently for FMFN.The parameters used to achieve the best performing sequences are shown for the best four optimizers in the table in App.Appendix C.

Influence of noise on the experiment optimization
In order to explore the robustness of the optimizers to noise, we tested the algorithms at two different noise levels for the 10-dimensional problem described above.The different noise levels were achieved in two ways: by changing the dispenser current in the 2D+ MOT we can reduce the number of atoms loaded into the 3D MOT; and by changing the PID parameters of the imaging laser we can introduce instabilities in its frequency and thus in the number of atoms measured.The lower atom number reached in the higher noise test case is due to the experimental conditions (i.e. a reduced loading rate) rather than the performance of the optimizers.The two noise levels have a fluctuation in atom number of ±5.6 % and ±19.5 %, where we note that the lower noise level is the typical performance of our experiment.We compared the highest performing optimizers, namely AX, FMFN, CMA-ES and JCM, alongside the JCM optimizer without its noise detection feature.
In the low noise case (Fig. 4), the same trends are visible as discussed in Sec.4.1.However, by comparing the JCM algorithm with and without noise detection we can see that the additional feature does not increase early optimization speeds, instead it results in a slightly higher fitness value of (5.6 ± 0.3) × 10 8 (with noise detection) compared to (5.2 ± 0.2) × 10 8 (without noise detection) atoms and in that higher fitness being reached faster.
In the high noise case, all five algorithms tested perform relatively similarly in terms of fitness.FMFN reaches the lowest fitness value ((1.03 ± 0.06) × 10 8 in 2000 s), followed by AX at (1.08 ± 0.04) × 10 8 in 4800 s, JCM with and without noise detection reach (1.07 ± 0.03) × 10 8 and (1.09 ± 0.04) × 10 8 atoms in 2000 s, and finally CMA-ES reaches a fitness value of (1.10 ± 0.03) × 10 8 atoms in 1900 s.Although we have ranked the algorithms here according to the mean fitness value, the differences between them are almost all within error margins.The main differences in the algorithms can actually be seen in the speed of optimization.For example, AX is the slowest optimizer at all points in the process, and FMFN slows dramatically after performing fastest in very early stages.
Due to the additional noise detection capabilities of the algorithm developed here, it was expected to perform the best in a high noise environment.The benchmarking data shows that it does indeed optimize faster than all other optimizers and to a higher fitness than all optimizers except CMA-ES.In fact, it reaches close to its final value within 720 s.This is roughly half the time taken for CMA-ES to reach the same value.We assume that CMA-ES performs so well in terms of final fitness due to averaging over many vectors which limits the effect of noise (Sec.2.1).Nevertheless, the noise detection function provided by JCM significantly improves the speed of optimization compared to the algorithm without it, and enables it to perform better than the other BOs tested.In general we find that most optimizers do not a plateau in the available iterations.
In summary, the influence of noise extends the iterations and thus time needed for the optimization.It also generally reduces the initial speed of optimization considerably, which means that more time is required to achieve a 'good enough' value.There is a trade-off in performance between speed and final fitness, with CMA-ES providing a marginally better final fitness, and the JCM with noise detection algorithm reaching a high fitness value much quicker than other algorithms.

Outlook
In conclusion, we tested different optimizers and showed their performance for different dimensionalities.Additionally, the developed implementation of JCM's efficient noise detection feature was benchmarked against the three best performing algorithms at two different noise levels.The best performing optimizers were found to be JCM (with noise detection) and the open source CMA-ES, with JCM providing faster optimization.
Optimization of the problem using 18 dimensions rather than 10 resulted in a worse performance for all optimizers.This shows indications that optimizing more parameters at once is not always better when it comes to the optimization of real world experiments.We also tested the influence of noise on optimizer performance, finding that with increased noise AX, JCM, FMFN and CMA-ES reach similar fitness, however, JCM and CMA-ES continued to offer the best performance in terms of speed and fitness.We were able to confirm that the noisy expected improvement strategy implemented into the JCM optimizer for this work improves the speed of optimization and in some cases also the final fitness value.
When choosing a suitable optimizer for an experiment, one must make a trade-off between speed, final fitness, and cost.The open source algorithm CMA-ES provided a marginally better final fitness, whereas the JCM optimizer with noise detection reached a high fitness value much quicker than other algorithms.Therefore, CMA-ES would be better suited to an experiment which would be optimized once or infrequently.However, for experiments requiring very regular optimization, such as mobile experiments, the speed benefits of the JCM optimizer would enable more efficient operation.
The results presented could be extended further via investigations at longer iteration numbers or in more noisy conditions.

Figure 1 .
Figure 1.A block diagram showing the relevant parts of the experimental setup used within this paper, alongside a rendering of the apparatus (on the right).Dashed lines indicate electrical and solid lines optical signals, similarly green shading represents electronics components and yellow shading optical components.

Figure 3 .
Figure 3. Development of the fitness function (maximization of atom number) over time for different optimizers for (a) a 10-dimensional parameter space and (b) a 18dimensional parameter space.Each algorithm runs for 400 iterations.The lines represent the mean value, while the shaded areas show the standard deviation over 10 repetitions.

Figure 4 .
Figure 4. Development of the fitness function (maximization of atom number) over time for the highest performing optimizers with different noise levels.Figure (a) shows optimization with an atom number noise level of ±5.6 %, while (b) shows the same for a noise level of ±19.5 %.The JCM optimizer is tested with and without noise detection activated, with this comparison highlighted in the insets.The reduced atom number reached in (b) is due to the experimental conditions choosen to increase the noise level.Each algorithm runs for 400 iterations.The lines represent the mean value, while the shaded area shows the standard deviation over 10 repetitions.

Table C1 .
Mean values of the input parameters, including their standard deviation, used to reach the best fitness.The parameters of the four best performing optimizers are shown.A big standard deviation shows either little influence of this parameter on the result or the optimizer having difficulties in finding the optimal set.