An adaptive Bayesian approach to gradient-free global optimization

Many problems in science and technology require finding global minima or maxima of complicated objective functions. The importance of global optimization has inspired the development of numerous heuristic algorithms based on analogies with physical, chemical or biological systems. Here we present a novel algorithm, SmartRunner, which employs a Bayesian probabilistic model informed by the history of accepted and rejected moves to make an informed decision about the next random trial. Thus, SmartRunner intelligently adapts its search strategy to a given objective function and moveset, with the goal of maximizing fitness gain (or energy loss) per function evaluation. Our approach is equivalent to adding a simple adaptive penalty to the original objective function, with SmartRunner performing hill ascent on the modified landscape. The adaptive penalty can be added to many other global optimization schemes, enhancing their ability to find high-quality solutions. We have explored SmartRunner’s performance on a standard set of test functions, the Sherrington–Kirkpatrick spin glass model, and Kauffman’s NK fitness model, finding that it compares favorably with several widely-used alternative approaches to gradient-free optimization.


Introduction
Many models in fields of enquiry as diverse as natural and social sciences, engineering, machine learning, and quantitative medicine are described by complex non-linear functions of many variables.Often, the task is to find globally optimal solutions of these models, which is equivalent to finding global minima or maxima of the corresponding model functions.The global optimization problem arises in engineering design, economic and financial forecasting, biological data analysis, potential energy models in physics and chemistry, robot design and manipulations, and numerous other settings.Notable examples include finding the minimum of protein free energy in computer simulations of protein folding, 1, 2 finding high-fitness solutions in evolving populations subject to mutation, selection, recombination, and genetic drift [3][4][5] (biological fitness quantifies the degree of reproductive success of an organism in an evolving population), and minimizing the error function in deep-learning neural network models. 6,7 thematically, the global optimization problem is defined as finding the maximum (or the minimum) of a real-valued function F(X), where X denotes a collection of discrete or continuous variables that describe the state of the system.The states of the system may be subject to nonlinear constraints.Here we focus on maximizing F(X), which we will refer to as the fitness function; with F(X) = −E(X), this is equivalent to minimizing an energy or error function E(X).In the energy function case, E(X) may signify the energy of a microstate or a free energy of a coarse-grained/mesoscopic state.The number of variables in X may be large in real-world applications and F(X) may be costly to evaluate, making it highly desirable to develop efficient global optimization algorithms which require as few fitness function evaluations as possible to reach high-quality solutions.The set of fitness values assigned to all states of the system forms a fitness landscape -a high-dimensional surface which global optimization algorithms must traverse on their way to the mountain peaks that correspond to high-scoring solutions.
If the fitness function is concave everywhere, the fitness landscape consists of a single peak and the global maximum is easy to find.However, in most problems of interest fitness landscapes contain multiple local maxima and saddle points which can trap the optimizer.There is no guarantee of finding the global maximum in this case unless all system states can be examined, which is usually not feasible because their number is exponentially large.A well-known worstcase scenario is a "golf-course" landscape which is flat everywhere apart from a few states that form a basin of attraction for an isolated deep hole, or a tall peak.In protein folding, this scenario is known as Levinthal's paradox 8 -proteins cannot fold on biologically reasonable time scales if they need to sample a sizable fraction of their microscopic configurations.While Levinthal's paradox has been resolved by introducing the concept of a protein folding funnel, 1, 2, 9, 10 generally there is no guarantee of finding the global maximum in a reasonable number of steps, and global optimization is demonstrably an NP-hard problem. 11f the gradient of the fitness function can be computed efficiently, it should be used to guide the search because the gradient vector indicates the direction of the steepest ascent.Here, we focus on systems with discrete or discretized states and assume that the gradient is not available.Namely, we consider an undirected graph with N nodes or vertices, where N is the total number of system states which may be astronomically large or even unknown.Each node i = 1 . . .N is assigned a state X i and a corresponding fitness value F(X i ).This definition describes a vast number of systems that are either naturally discrete (e.g., spin glasses 12 ) or discretized by superimposing a lattice on a continuous landscape.Besides the fitness function, a global optimization algorithm requires a move set -a deterministic or stochastic rule for moving between states on the fitness landscape.A move set defines state neighborhoods -a set of states reachable from a given state in a single jump.The size of the neighborhood is typically fixed but may also change in complex ways, e.g. with recombination moves described below.
Numerous empirical approaches have been developed over the years to tackle the problem of gradient-free optimization.Usually, these algorithms are based on an analogy with a physical, chemical or biological process in which some kind of optimization is known to occur.For example, the celebrated simulated annealing algorithm 13 is a Monte Carlo technique based on an analogy with a physical annealing process in which the material starts at a high temperature to enable constituent molecules or atoms to move around.The temperature is gradually decreased, allowing the material to relax into low-energy crystalline states.The rate of temperature decrease is a key parameter of the simulated annealing algorithm. 14Numerous modifications of the basic simulated annealing approach have been developed over the years: parallel tempering Monte Carlo, 15 replica Monte Carlo, 16 population annealing, 17 simulated tempering, 18 and many others.Generally speaking, the idea of these algorithms is to overcome free energy barriers by simulating a broad range of temperatures.Besides estimating various thermodynamic quantities by Monte Carlo sampling, some of these algorithms have also been applied to combinatorial optimization problems such as the search for the ground states of Ising spin glasses. 19enetic or evolutionary algorithms [20][21][22] are based on an analogy with the evolution of a biological population: a population of candidate solutions is subjected to multiple rounds of recombination, mutation, and selection, enabling "the survival of the fittest".Harmony search is a music-inspired algorithm, applying such concepts as playing a piece of music from memory, pitch adjustment, and composing new notes to an evolving population of harmonies. 23,24 article swarm algorithms draw their inspiration from the collective behavior of bird flocks and schools of fish. 25,26 aboo search is a deterministic strategy in which all nearest neighbors of the current state are examined and the best move is accepted. 27To avoid returning to previously examined states via deterministic cycles, a fixed-length "taboo" list is kept of the recently visited states that are temporarily excluded from the search.Stochastic hill climbing employs a procedure in which the moves are accepted or rejected using a sigmoid (two-state) function with a fixed temperature T . 28As in simulated annealing, this strategy allows for deleterious moves whose frequency depends on the value of T .][31][32][33][34][35] Here we propose a novel global optimization algorithm which we call SmartRunner.SmartRunner is not based on an analogy with a physical, chemical or biological system.Instead, the algorithm uses previously accumulated statistics on rejected and accepted moves to make a decision about its next move.Thus, SmartRunner adapts its search strategy intelligently as a function of both local and global landscape statistics collected earlier in the run, with the goal of maximizing the overall fitness gain.Generally speaking, SmartRunner can be viewed as a stochastic extension of the Taboo search policy.However, unlike the Taboo algorithm, it does not need to evaluate fitness values of every neighbor of the current state, which may be computationally expensive.Moreover, it replaces infinite penalties assigned to the states in the "taboo" list by node-dependent penalties which only become infinite when all the nearest neighbors of the node is question have already been explored.We benchmark SmartRunner on a set of challenging global optimization problems and show that it consistently outperforms several other state-of-the-art algorithms.Moreover, we demonstrate that the SmartRunner approach amounts to hill climbing on a dynamically redefined fitness landscape.This redefinition can be used to enhance the performance of many other global search approaches such as simulated annealing or evolutionary algorithms.

Materials and Methods
Bayesian estimation of the probability to find a novel beneficial move.Unweighted moves.Consider a fitness landscape with a move set that defines N nearest neighbors for each discrete system state X i (i = 1 . . .N ).We divide all neighbors of the state X i into two disjoint subsets: one set S i p of size U i p ≥ 0 contains all states with fitness ≤ F i , while the other set S i of size U i = N − U i p ≥ 0 contains all states with fitness > F i .Moves between X i and any state in the set S i p are deleterious or neutral, while moves to any state in the set S i are beneficial.Generally, we expect the size of S i to be small: U i ≪ N ≃ U i p because as a rule it is more difficult to find a beneficial move than a deleterious or neutral one.We assign the system state X i to the node i on a network, with nodes representing system states and edges representing nearest-neighbor jumps.We consider a single random walker that explores the network.At each step, the walker is equally likely to initiate a jump to any of the N neighbors of the current node.Let us say that the random walker is currently at node i and has made n unsuccessful attempts to make a move i → j ∈ nnb(i), where nnb(i) = S i p ∪ S i is a set that contains all the nearest neighbors of node i (for simplicity, let us assume for the moment that all deleterious and neutral moves are rejected while a beneficial move, once found, is immediately accepted).After n trials, we have data D = {K p , m p , K, m}, where K p is the total number of visits to the nodes in S i p and K = n − K p is the total number of visits to the nodes in S i .Furthermore, m p ≤ K p and m ≤ K are the number of unique visited nodes in S i p and S i , respectively.The probability of observing D is given by Correspondingly, the probability of U i given the data is where P (U ) is the prior probability that there are U nearest neighbors of node i whose fitness is higher than F i .Choosing an uninformative prior, we obtain: Note that N U =0 P (U ) = 1.Then Eq. (2) yields where . Focusing on the K = 0, m = 0 limit (that is, on the case where no beneficial moves have yet been found) and assuming U i ≪ N , we obtain where γ = n/N .Furthermore, where N = N −m p +1.Note that the exponential substitution becomes inaccurate for the terms in the sum in which U ′ approaches N − m p ; however, since m p ≤ N , these terms are suppressed in the n ≫ 1 limit compared to the accurately approximated terms with U ′ ≪ N − m p .We observe that m p is a stochastic variable whose expectation value can be shown to be where the last approximation requires N ≫ 1.Finally, If N ≫ 1 and N ≫ m p , Eq. (8) yields where the last approximation is valid for n ≫ 1.Thus, the probability to find beneficial moves, P (U i > 0|D) ≃ e −n/N , decreases exponentially with n.Note that if n = 0 (no random trials have been made), Eq. (8) yields P (U i > 0|D) = N /(N +1), consistent with the prior probability in Eq. ( 3) which assigns equal weights to all N + 1 values of U i .Thus, to begin with the system is very optimistic that a beneficial move will be found.However, if m p = N (that is, all moves have been tried and none are beneficial), Eq. (8) yields P (U i > 0|D) = 0, as expected.Thus, the system gradually loses its optimism about finding a beneficial move as it makes more and more unsuccessful trials.Finally, we compute the probability of finding a higher-fitness target in the next step: In the beginning of the search, n ≪ N and, correspondingly, m p ≪ N .If, in addition, n ≫ 1 (which implies N ≫ 1), Eq. ( 10) simplifies considerably: Note that in this limit p f is independent of N to the leading order.If m p = N , N = 1 and p f = 0 in Eq. ( 10), as expected.Note that if n = m p = 0, N = N + 1 and γ = 0. Then Z = N + 1 from Eq. ( 6), leading to the following simplification of Eq. ( 10): Thus, not surprisingly, the probability of finding a higher-fitness target before making any moves is 1/2.After making a single move and not finding a higher-fitness target (n = 1, m p = 1), N = N and γ = 1/N .With the additional assumption that N ≫ 1, we obtain: In summary, the probability of finding a beneficial move, p f , starts out at 0.5 and decreases with the number of trials until either a beneficial move is found (in which case Eq. ( 10) is no longer applicable) or there are no more novel moves to find (in which case p f = 0).The asymptotic ≃ 1/n behavior of p f is universal in the n ≫ 1 limit (Eq.( 11)).
Finally, we observe that the above formalism can be extended to any subsets S i p and S i since the initial division into deleterious/neutral moves in S i p and beneficial moves in S i was arbitrary.Thus, even if a beneficial move is found, we can add it to S i p and regard S i as the set of remaining, or novel beneficial moves.
Weighted moves.The probability to find a novel beneficial move (Eq.( 10)) was derived under the assumption that the total number of neighbors N is known and that the move set is unweighted -each new move is chosen with equal probability 1/N .However, move sets may be intrinsically weighted: for example, in systems with recombination relative weights of recombination moves depend on the genotype frequencies in the population.In addition, it may be of interest to assign separate weights to classes of moves, such as one-and two-point mutations in sequence systems, or one-and two-spin flips in spin systems.In this section, we relax the assumption of unweighted moves, while still treating N as a known constant.
Specifically, we consider a set of weights {w j } N j=1 associated with i → j ∈ nnb(i) moves.The probability of a i → j jump is then given by p(i → j) = w j /W , where W = N j=1 w j = U i p j=1 w j + U i j=1 w j ≡ W U i p + W U i is the sum over all nearest-neighbor weights, and W U i p and W U i are partial sums over the weights in S i p and S i , respectively.Consequently, which in the K = 0 case reduces to Next, we integrate the likelihood over the edge weights: We represent the probability distribution of edge weights by a Gaussian mixture model, which can be used to describe multimodal distributions of arbitrary complexity: 36 where Ω is the normalization constant, P is the number of Gaussian components and p k is the relative weight of component k: P k=1 p k = 1.In the N ≫ 1 limit, we expect W ≃ ⟨W ⟩ = N k p k wk ≡ N w, such that Eq. ( 16) simplifies to where Here, x dte −t 2 is the complementary error function.The normalization constant is given by Note that if all the Gaussians are narrow (σ k ≪ wk , ∀k), erfc − wk √ 2σ k → 2 and thus Ω → 1, as expected.
If the edge weights are Gaussian distributed with mean w and standard deviation σ (i.e., P = 1), Eq. ( 19) becomes where α = γ − γ 2 σ 2 2 w2 and c = γ σ 2 w .If in addition all weights are equal, σ w → 0 and β → γ in Eq. ( 21), such that Eq. ( 18) for the likelihood reduces to Eq. ( 5).Thus, the difference between β and γ is due to fluctuation corrections.The model evidence Z, the posterior probability P (U i |D) and p f , the probability of finding a higher-fitness target in the next step, are found by substituting γ → β into Eqs.( 6), ( 8) and (10), respectively.
Note that if n → 0, β → 0 as well and therefore p f → 1/2 since the argument leading to Eq. ( 12) still holds.Moreover, Eq. ( 10) still yields p f = 0 when all the neighbors have been explored (m p = N ).Finally, if n, m p ≪ N and n ≫ 1, α ≃ γ and the ratio of complementary error functions in Eq. ( 21) is ≃ 1.Then the argument leading to Eq. ( 11) also holds, yielding p f ≃ 1/n asymptotically even in the weighted move case.Thus, introducing weighted moves does not lead to qualitative differences in the p f dependence on n -any substantial differences are localized to the intermediate region: 1 ≤ n ≤ 30 or so, and in many systems the p f curves for weighted and unweighted moves overlap almost completely (cf.red and blue curves in Fig. S1A-D).
Next, we consider the exponential probability distribution of edge weights -an archetypical distribution often found in natural and artificial networks: 37 where w denotes the mean of the exponential distribution, such that ⟨W ⟩ = N w.It is easy to show that the likelihood P (D|U i ) is given by Eq. ( 18) with β exp = log 1 + n N .Consequently, as in the case of the Gaussian mixture model, the model evidence Z, the posterior probability P (U i |D) and p f are given by Eqs. ( 6), ( 8) and (10), respectively, but with β exp instead of γ.Clearly, β exp → 0 as n → 0 and therefore p f → 1/2 as in the Gaussian mixture case.Similarly, p f = 0 once m p = N .Lastly, the n, m p ≪ N limit yields α ≃ γ, which in turn leads to p f ≃ 1/n under the additional assumption of n ≫ 1.Thus, the dependence of p f on n for exponentially distributed weights is again qualitatively similar to the p f functions in the corresponding unweighted cases (cf.red and blue curves in Fig. S1E,F).
Simplified treatment of p f .The computation of p f for the unweighted and the exponentially distributed cases requires the knowledge of m p and N besides the number of trials n.For weighted move sets in the Gaussian mixture model, one would additionally require p k , wk and σ k for each Gaussian component.Unless these parameters are known a priori, they would have to be estimated from a sample of edge weights, increasing the computational burden.Moreover, this extra effort may not be justified, in the view of the nearly universal dependence of p f on n in all three cases considered above.1][22] With recombination, N depends on the current state of the population and therefore generally changes with time.Hence, computing N at each step would increase the complexity of the algorithm.
We propose to capitalize on the approximate universality of p f by creating a minimal model for it which depends only on the number of trials n.Specifically, we define This model has the right asymptotics at n = 0 and in the n, m p ≪ N , n ≫ 1 limit, but does not go to 0 identically when m p = N because enforcing this condition requires the knowledge of N .However, if N ≫ 1, as can be expected with complex move sets, n ≫ 1 at m p = N and the difference between the small but non-zero value of p f in Eq. ( 23) and zero will be immaterial (cf.green curves in Fig. S1 for p f (n) in several model systems).
Implementation of the optimal search policy: the SmartRunner algorithm.Given Bayesian probabilistic estimates of finding novel moves between a given node and its nearest neighbors, we need to formulate an optimal search policy in order to maximize the expected fitness gain over the course of the run with l tot random trials.
Assuming that the walker is currently at node i, there are two options after each random trial: stay at node i (thereby rejecting the move) or jump to a neighboring node j.If the walker stays at node i, we expect it to search for l i steps before finding a higher-fitness node which has not been detected before.Then the value of the policy of staying at i can be evaluated as where ∆F b = F k − F i is the expected fitness gain of the newly found beneficial move to a node k ∈ nnb(i) and R is the expected rate of fitness gain per step times the number of steps remaining in the run.Furthermore, l rem ≤ l tot is the number of steps remaining in the simulation, and l i is the expected number of steps needed to find k: where p i f is given by Eq. ( 10) or Eq. ( 23) (with the dependence on the node index i made explicit for clarity) and rnd[ ] is the rounding operator.Finally, C = ∆F b + R l rem denotes a constant contribution independent of the node index, under the assumption that ∆F b is the same for all nodes.
The value of the policy of jumping to a downstream node j from i is given by where the extra factor of −R accounts for the jump between the current node i and the new node j, which reduces the total number of remaining steps by 1.
We represent each visited state as a node and each rejected or accepted move as an edge on a directed graph G, implemented using the DiGraph class from NetworkX 1 .Thus, node i is part of the directed graph G which contains information about all previously attempted moves.Depending on which previous moves have been explored, node i may be connected to multiple downstream nodes by directed edges; in general, these edges may form directed cycles.To see if the jump to one of the nodes downstream of node i will yield a more beneficial policy, we traverse G recursively starting from the node i, for up to l max steps.For computational convenience, G is implemented with two types of nodes: 'regular' nodes i, j, k, . . .which denote states on the fitness landscape and are therefore assigned fitness values F i , F j , F k , . . .(black circles in Fig. 1), and 'terminal' nodes i t , j t , k t , . . .which are assigned fitness values −Rl i , −Rl j , −Rl k , . . .(green circles in Fig. 1).Note that we have omitted the node-independent contribution C in Eqs. ( 24) and ( 26).The edges of G connecting two regular nodes (solid black arrows in Fig. 1): m → n are assigned a weight of F n − F m .The edges of G connecting a regular node to a terminal node (dashed green arrows in Fig. 1): m → m t are assigned a weight of −R l m .By construction, terminal nodes do not have descendants and each regular node has exactly one terminal descendant.The policy values are computed using a set of recursive paths on the directed graph G.All 1 https://networkx.org/documentation/stable/reference/classes/digraph.htmlvalid paths must start from node i and end at one of the terminal nodes reachable with ≤ l max steps.The goal is to identify a path which has the maximum weight among all paths.Note that with a single step, the only valid path is i → i t and its weight is given by S i from Eq. ( 24).The minimum allowed value of l max is thus equal to 2 because this enables computations of the path weight as L i→j in Eq. ( 26) for j ∈ nnb(i).Larger values of l max will enable longer jumps if any are available; longer jumps entail repeated application of Eq. ( 26) to compute the total path weight.If the winning path is i → i t , the walker stays at the node i and makes another random trial, updating its p i f accordingly.If the winning path is i → j t (where j may be several steps away depending on the value of l max ), the walker jumps to the node j and makes a new random trial from that node.The node j statistics such as n and m p are initialized if the node has not been visited before in the run, and updated otherwise.
Note that if Eq. ( 10) is used to compute p i f , it is possible to obtain p i f = 0 and therefore l i = ∞ in Eq. ( 25), which is represented computationally by a large positive constant.The case in which both node i and all its neighbors j reachable in ≤ l max steps are in this category requires special treatment because the R l i and R l j penalties cancel out and SmartRunner essentially becomes a local optimizer driven solely by fitness differences.To avoid being trapped in local fitness maxima in this special case, SmartRunner employs two alternative strategies.In the first strategy, a random path is chosen in the ensemble of all paths with ≤ l max steps, instead of the path with the maximum sum of edge weights.In the second strategy, a longer random path to the boundary of the p f = 0 region is constructed explicitly; the random path can have up to 10 3 steps.In both strategies, if the boundary of the p f = 0 region is not reached, the procedure is repeated at subsequent steps, resulting in an overall random walk to the boundary of the "maximally-explored" region.
The SmartRunner algorithm depends on R, the expected rate of fitness gain per step.Larger positive values of R will encourage jumps to less-explored regular nodes even if those have slightly lower fitness values and will therefore promote landscape exploration.Smaller positive values of R will encourage more thorough exploration of the current node but will not fully prevent deleterious moves.Negative values of R however will prevent all further exploration.We adjust the value of R adaptively as follows.The algorithm starts out with a user-provided initial value R init .For each move, either accepted or rejected, the corresponding fitness value is recorded in a fitness trajectory array.Once M values are accumulated in the array, a linear model is fit to the fitness trajectory, yielding the fitted slope R fit .Finally, R is computed as where ϵ is a small positive constant.Note that the second line serves to 'rectify' the values of R fit that follow below the threshold ϵ, preventing R from ever reaching negative values.The value of R is recomputed every M steps using Eq. ( 27), providing adaptive feedback throughout the run.The positive hyperparameter α is the level of 'optimism' -how much more optimistic the system is about its future success compared to past performance.As discussed above, larger values of α will promote landscape exploration.
The SmartRunner algorithm can be summarized as the following sequence of steps: 2. Initialize regular node X 0 with F(X 0 ).
6. Add an edge X 0 → X 0,t with a weight −Rl X 0 . do: ∈ G: add an edge X → X ′ with a weight F(X ′ ) − F(X).
4. Update statistics for X, recompute l X and update the X → X t edge weight.
5. Recursively compute sums of edge weights for all paths of length ≤ l max starting at X and ending at downstream terminal nodes.If l X = ∞ for the XX t path and l Y k = ∞ for all other paths X . . .Y k Y k,t in the ensemble, initiate a random walk; otherwise, stay at X or jump to Y k according to the path with the maximum sum of edge weights.
while l ≤ l tot

OUTPUT:
Globally best state: X best , F(X best ) Fitness trajectory: {F} ltot l=1 Total number of fitness function evaluations: f eval Adaptive fitness landscape.The stay or leave policy defined by Eqs. ( 24) and ( 26) amounts to an adaptive redefinition of the fitness landscape: where R l i is a positive occupancy penalty whose overall magnitude is controlled by the hyperparameter R. The penalty increases as the node i is explored more and more, resulting in progressively larger values of l i .Note that if Eq. ( 23) is used to estimate p f , the only additional piece of information required to compute F i from F i is the total number of trials n i at the node i, which is easy to keep track of.Thus, F i can serve as input not only to SmartRunner, which in this view amounts to hill climbing on the F landscape, but to any global optimization algorithm.In algorithms where individual moves are accepted or rejected sequentially (e.g., Simulated Annealing, Stochastic Hill Climbing), we compare F i with F j − R to account for the fact that jumping from node i to node j decreases the total number of remaining steps by 1 (cf.Eq. ( 26)).In algorithms which involve non-sequential scenarios (e.g, Evolutionary Algorithm), modified fitnesses F from Eq. ( 28) are used directly instead of F.

Results
SmartRunner can climb out of deep local maxima.To demonstrate the ability of SmartRunner to traverse local basins of attraction leading to suboptimal solutions, we have constructed a 2D fitness landscape defined by a weighted sum of two Gaussians (Fig. 2).The left basin of attraction leads to a local maximum (F ⋆ = 50.17)which is much smaller compared to the global maximum on the right (F ⋆ = 78.48).The two basins of attraction are separated by a steep barrier.We start the SmartRunner runs from the left of the local basin of attraction, making sure that the walker rapidly falls there first, reaching the local maximum in a few thousand steps (Fig. 2A).Afterwards, the walker explores the local basin of attraction more and more extensively (Fig. 2B,C) until the barrier is finally overcome and the global maximum is found (Fig. 2D).The exploration strategy is automatically adapted to the fitness landscape features rather than being driven by external parameters such as the simulated annealing temperature.
SmartRunner performance on 4D test functions.Next, we have explored SmartRunner performance on three standard 4D test functions often used to benchmark global optimization algorithms: 29 Rastrigin, Ackley and Griewank (se SI Methods for function definitions).The test functions are defined in standard hypercube ranges and supplemented with periodic boundary conditions.The resulting fitness landscapes are discretized using the same step size ∆x in all 4 directions, resulting in 1.63 × 10 9 , 1.17 × 10 10 and 2.08 × 10 12 distinct fitness states for Rastrigin, Ackley and Griewank functions, respectively.All three test functions are characterized by multiple local maxima; the unique global maximum is located at ⃗ x = (0, 0, 0, 0) and corresponds to F = 0.The landscapes are explored by randomly choosing one of the directions and then increasing or decreasing the corresponding coordinate by ∆x (the nnb moveset).Fig. 3 shows the performance of SmartRunner on the Rastrigin test function: Fig. 3A is a hyperparameter scan which shows no consistent trend in the dependence of the average best fitness values ⟨F best ⟩ on R init , the initial rate of fitness gain per step.This is expected because the value of R is reset adaptively during the run (cf.Eq. ( 27)).In contrast, there is a slight preference for lower values of optimism α.Fig. 3B shows the corresponding average of function evaluationsunique fitness function calls which can be used as a measure of algorithm performance, especially in cases where fitness function calls are expensive, making it advisable to focus on maximizing the average fitness gain per function evaluation.As expected, the optimal values of α correspond to the lower number of function evaluations since lower values of α tend to favor exploitation (i..e., a more thorough search of the neighbors of the current state) over exploration (which favors more frequent jumps between landscape states).Figs.S2A,B and S2C,D show the corresponding results for Ackley and Griewank test functions, respectively.Lower values of α work better for Ackley, while α ≥ 5 are preferable for Griewank, indicating that in general a scan over several values of α may be required.Since the Griewank landscape is considerably larger and the global maximum is not always found, we also show the maximum best-fitness value found over 50 independent runs, and the corresponding number of function evaluations (Fig. S2E,F).For lower values of α, the global maximum is not always found but rather another high-fitness solution.
With reasonable hyperparameter settings, all 50 SmartRunner runs find the global maximum of the Rastrigin landscape (Fig. 3C), requiring ≃ 15500 function evaluations on average.Fig. 3E shows three representative fitness trajectories -rapid convergence to the vicinity of the global maximum is observed in ≤ 4 × 10 4 steps, regardless of the starting state.The dynamics of global optimization strongly depend on the moveset type.To explore whether SmartRunner can adapt to movesets with non-local moves, we have considered the spmut moveset in which a randomly chosen coordinate is changed to an arbitrary new value on the discretized landscape.Thus, instead of updating a given coordinate in ±∆x increments, most moves change the coordinate by many multiples of ∆x, creating a densely connected landscape: for example, the number of nearest neighbors is 200 × 4 = 800 for the Rastrigin function, instead of just 8 with the nnb moveset (the abbreviation spmut stands for single-point mutations, since a given x i can 'mutate' into any other x j , j ̸ = i from a discrete set).Fig. 4 shows that SmartRunner reliably finds the global maximum with the spmut moveset.The dependence on R init is weak and the lower values of α are preferable (Fig. 4A).The number of fitness function calls is much higher for the same total number of steps (10 5 ) as with the nnb moveset (Fig. 4B,D).All 50 runs find the global maximum with optimal or nearly-optimal hyperparameter settings (Fig. 4C), and fitness trajectories quickly converge to high-quality solutions (Fig. 4E).Similar behavior is observed with Ackley and Griewank functions: lower values of α work better and the number of function evaluations is several times larger compared to the nnb moveset (Fig. S3).Thus, using the nnb moveset is preferable for all three landscapes.Next, we have asked whether the observed differences in SmartRunner performance at different hyperparameter values are statistically significant.Using the Rastrigin function as an example, we have employed one-sided Kolmogorov-Smirnov (KS) tests for the best-fitness distributions (Fig. S4).The distribution with the highest average of best-fitness values in Figs.3A  and 4A was compared with all the other distributions.We find that the differences between the distributions are not statistically significant with the nnb moveset (Fig. S4A).In contrast, using α ≥ 6 with the spmut moveset leads to statistically significant degradation of SmartRunner performance (Fig. S4B).We have also used KS tests to investigate the effects of the l max hyperparameter (Fig. S5).Since for all three test functions best-fitness distributions with l max = 3 are not significantly better than the corresponding l max = 2 distributions with the same α and R init hyperparameter settings, we typically use l max = 2 as it is less expensive computationally.
The effects of the occupancy penalty on other global optimization algorithms.As mentioned above, the SmartRunner algorithm can be viewed as hill climbing on a fitness landscape F modified with the occupancy penalties (Eq.( 28)).However, the modified fitness landscape can also be explored using other empirical global optimization approaches.Here, we focus on three widely used algorithms: Simulated Annealing (SA), 13 Stochastic Hill Climbing (SHC), 28 and Evolutionary Algorithm (EA) [20][21][22] (see SI Methods for implementation details).SA is based on an analogy with a metallurgy technique involving heating followed by controlled cooling of a material to alter its physical properties. 13The algorithm is implemented as a series of Metropolis Monte Carlo move trials 38 with a slowly decreasing temperature.SA's hyperparameters are the initial temperature T i and the final temperature T f , plus the expected rate of fitness gain R when the occupancy penalty is included.We use a linear cooling schedule in this work.SHC is a version of hill climbing which accepts downhill moves with the probability p = 1/(1 + exp [(F current − F new )/T ]). 28Thus, p ≃ 0 in the F new /T ≪ F current /T limit, and p ≃ 1 in the opposite limit.SHC's search strategy is controlled by the temperature T , along with R in the case of modified landscapes.Finally, EA is inspired by the process of biological evolution. 21,22 t involves creating a population of N pop 'organisms' (i.e., putative solutions; we use N pop = 50 in this work).The population is initialized randomly and subjected to repeated rounds of recombination, mutation and selection.Besides the population size, EA's hyperparameters are the crossover (recombination) rate r x , the mutation rate µ and, for modified landscapes, the expected rate of fitness gain R.
The original algorithm names (SA, SHC, EA) are reserved for runs with R = 0; runs with modified landscapes are referred to as 'enhanced' (ESA, ESHC, EEA).Fig. S6 shows the performance of ESA as a function of the initial temperature T i and the expected rate of fitness gain R for our three test functions, with the nnb moveset (although we have also performed a scan over the final temperature T f , the dependence is weak and the results are not shown).We observe that T i ≃ 1 values are more preferable and, as expected, are accompanied by the lower number of function evaluations.Strikingly, the hyperparameter settings with the best average performance always have non-zero R : T i = 1.0, T f = 0.002, R = 0.1 for the Rastrigin function (the corresponding ⟨F best ⟩ = −0.017).For the Ackley function, T i = 1.0, T f = 0.001, R = 0.15 (the corresponding ⟨F best ⟩ = −2.078).For the Griewank function, T i = 1.0, T f = 0.001, R = 0.2 (the corresponding ⟨F best ⟩ = −0.067).Thus, ESA outperforms SA -when using simulated annealing, the best global optimization strategy is to augment the original fitness values with the occupancy penalties.Fig. S7 shows that these observations are statistically significant.
Occupancy penalties dramatically improve SA's performance when it is run with the suboptimal values of the initial temperature (Fig. 5A, Fig. S8).In fact, the results are better than those with the higher, SA-optimal values of T i : ⟨F best ⟩ = −0.005for Rastrigin (T i = 0.02, T f = 0.003, R = 0.2), ⟨F best ⟩ = 0.000 for Ackley (T i = 0.01, T f = 0.001, R = 0.25), ⟨F best ⟩ = −0.015for Griewank (T i = 0.01, T f = 0.003, R = 0.25).Thus, the best overall strategy is to run SA at very low temperatures (where it reduces to simple hill climbing), but on the modified fitness landscape.This is precisely the strategy implemented in SmartRunner.
Qualitatively similar results are obtained with SHC: non-zero values of R are preferable at higher, SHC-optimal values of T (Fig. S9); the effect is statistically significant (Fig. S10).However, as Fig. 5B and Fig. S11 demonstrate, the enhancement is especially dramatic when the values of T become very low, much lower than the SHC-optimal values explored in Fig. S9.Similar to SA, low-T runs with R ̸ = 0 yield the highest-quality solutions, again indicating that in the presence of occupancy penalties the best strategy is straightforward hill ascent.
Finally, occupancy penalties can rescue EA from being stuck in the local maxima (Fig. 5C, Fig. S12A,C) -with the nnb moveset, the population tends to condense onto a local maximum and become monomorphic.Local mutations of population members in such locally optimal states are mostly deleterious and therefore tend to get eliminated from the population.The population as a whole is therefore unable to keep exploring new states, as evidenced by the low number of function evaluations in Fig. S12B,D,F compared to the other algorithms.This drawback is fixed by making the fitness landscape adaptive with the help of the occupancy penalty.
SmartRunner can be viewed as stochastic generalization of the Taboo Search (TS) -a deterministic policy in which all nearest neighbors of the current state are explored one by one and the move to the neighbor state with the best fitness is accepted. 27To prevent backtracking to already-explored states, a list of 'taboo' states is kept to which jumps are forbidden; the length of this list, L tabu , is a hyperparameter.By construction, TS avoids visiting neighbor states more than once and is always guaranteed to find the best neighboring state to jump into; however, we expect it to lose efficiency in systems characterized by very large numbers of neighbors, since all of these neighbors have to be tried and most of them do not correspond to good solutions.In contrast, SmartRunner can make a decision to accept a move before all neighbors are explored, based on the move/neighbor statistics collected up to that point.In any event, with L tabu ≥ 400 TS demonstrates high performance on all three test functions, requiring a relatively low number of function evaluations to achieve this result (Fig. S13).
Interestingly, the situation is reversed with the spmut moveset -with SA and SHC, better performance is achieved when R = 0 (Fig. S14).This observation is not surprising given the somewhat special nature of the test functions we consider.As it turns out, with Rastrigin and Ackley functions it is possible to use TS to reach the global maximum in exactly 4 steps, regardless of the initial state.Each step sets one of the coordinates to 0.0 until the global maximum is attained (see Fig. S15A,B for representative trajectories).With the Griewank function, optimization depends on the initial conditions and, as a rule, additional steps are required since the first 4 steps only bring the system to the vicinity of the global maximum (Fig. S15C).Thus, with this landscape structure it is not beneficial to jump to a new node before all neighbors of the current node are explored.In other words, premature jumping between nodes simply resets the search.In this case, R = 0 is indeed preferable and correctly identified by our methods; however, this is a special case which we do not expect to hold true in general.
Comparison of global optimization algorithms.Different global optimization algorithms use different notions of a single step.While SA, SHC and SmartRunner define a single stochastic trial as an elementary step, a TS step involves querying all nearest neighbors, and an EA step involves rebuilding a population subjected to crossover and mutation.To ensure a fair comparison, we have allocated a fixed number of novel fitness function evaluations to each algorithm and observed the resulting performance (SA had to be left out because its performance depends on the cooling schedule, such that stopping SA at T > T f puts it at an unfair disadvantage).We note that with the nnb moveset, SmartRunner consistently shows the best performance (Fig. 6).As expected, the worst performance with this moveset is exhibited by EA as it is unable to utilize more and more function evaluations to find states with better fitness -in fact, EA often uses fewer function evaluations than was allocated to it, terminating instead when the maximum number of steps is exceeded.
SmartRunner tests on SK spin glass and Kauffman's NK models: quenched disorder.Next, we have turned to two challenging discrete-state systems with complex fitness landscapes.One is the Sherrington-Kirkpatrick (SK) spin glass model, 12 with N ±1 spins coupled by random interactions that are independently sampled from the standard Gaussian distribution (SI Methods).The other is Kauffman's NK model used in evolutionary theory, 39,40 in which each of the N (0, 1) sites interacts with 0 ≤ K ≤ N − 1 other sites chosen by random sampling.The fitness function for a given binary sequence is a sum over N single-site contributions; each single-site contribution is obtained by sampling from the standard uniform distribution (SI Methods).The model parameter K serves to tune the degree of landscape ruggedness: the number of local maxima increases rapidly as K goes up.In both systems, the moveset consists of changing the binary state at a single site.Thus, each of the 2 N states has N nearest neighbors.First, we focus on systems with quenched disorder, where random parameters of the system are generated only once and subsequently used in all comparisons of global optimization algorithms.
We have carried out a SmartRunner hyperparameter search for the SK model with N = 200 spins (Fig. S16).We find that among all of the values tried, α = 0.1 is clearly preferable (Fig. S16A,C), with a statistically significant improvement in performance (Fig. S16E).On the other hand, the dependence on R init is very weak.As expected, the number of function evaluations increases with α as novel states are explored more frequently (Fig. S16B,D).The same conclusions are reached with the NK model with N = 200 sites and K = 8 couplings per site, with α = 0.1 identified again as the optimal value (Fig. S17).We have also explored the α settings in models with 200, 500, and 1000 spins/sites (Fig. S18).While α = 0.1 is confirmed as the optimal choice for N = 200 models, α = 0.01 is preferable for N = 500, 1000.Finally, we carry out a side-by-side comparison of the performance of all 5 algorithms: SR, TS, SA, SHC, and EA on the SK models (Table 1) and the NK models (Table 2).To mimic a realistic situation in which computer resources are a limiting factor and the fitness landscapes are exceedingly large, we have chosen a single set of hyperparameter settings for each algorithm.Thus, SR was run with α = 0.01, even though the above analysis shows that α = 0.1 is in fact a better choice for N = 200.The only exception to this rule is SHC, where we carried out a mini-scan over the values T to optimize performance.All algorithms except for SmartRunner were run on the original landscapes without occupancy penalties.For SA, T f ≃ 0 should be reasonable, while T i = 1.0 is dictated by the overall scale of the landscape.For EA, a 3D scan over µ, r x , N pop is not feasible, so that we had to settle for 'typical' values.Thus, more complex algorithms with several hyperparameters are implicitly penalized, as they are likely to be in a best measures, SmartRunner is second for the N = 200 model and first for the larger models with 500 and 1000 sites.The somewhat weaker performance of SmartRunner on the N = 200 systems could be improved by switching to α = 0.1 (Figs.S16,S17).However, this would give SmartRunner an unfair advantage in the context of this competition, in which every algorithm was run with a single reasonable set of hyperparameters.  1 in Ref. 41 ).Dashed green line: a linear fit to Boettcher's ground state energies yielding ⟨E best (N )⟩ = ⟨E best (∞)⟩ + mN −2/3 , where m = 0.7047 is the slope and ⟨E best (∞)⟩ = −0.7633 is the asymptotic Parisi energy for the infinite system. 42All energies are divided by the number of spins N to produce intensive quantities.
Prediction of the SK ground state energies averaged over disorder.Next, we have investigated the ability of SmartRunner to reproduce finite-size corrections to average ground state energies of the SK model (Fig. 7).The ground state energy per spin averaged over random spin couplings is known theoretically to be −0.7633 in the N → ∞ limit of the SK model, 42 with the 2/3 scaling exponent for finite-size corrections (i.e., ⟨E best (N )⟩ ∼ N −2/3 ) available from both theoretical 43 and numerical 41 investigations.This provides a baseline against which SmartRunner's ability to find the global minima of the SK energy can be judged.We find that the average ground-state energy per spin predicted by SmartRunner is reasonably close to the expected straight line in Fig. 7, although there are statistically significant deviations for the three largest systems (N = 300, 350, 400), indicating that SmartRunner does not quite reach the true ground states in these cases.Overall, SmartRunner's performance on these systems is less reliable than that of Extremal Optimization, a heuristic algorithm specifically adapted to the SK model and requiring a simplified probabilistic model for spin couplings. 41,44

Discussion and Conclusion
In this work, we have developed a novel approach to global optimization called SmartRunner.Instead of relying on qualitative similarities with physical, chemical or biological systems, SmartRunner employs an explicit probabilistic model for accepting or rejecting a move on the basis of the immediate previous history of the optimization process.The key quantity guiding SmartRunner decisions is p f , the probability of finding a higher-fitness target in the next random trial.This probability has nearly universal asymptotics and can be effectively represented by a function that depends only on n, the number of previously rejected attempts to change the current state of the system.In other words, the dependence of SmartRunner's behavior on such details of the systems as the number of nearest neighbors and the transition rates is fairly weak, making our approach applicable to a wide range of objective functions and movesets.Overall, SmartRunner can be viewed as an adaptive search policy designed to maximize fitness gain per step.
Interestingly, SmartRunner's global optimization policy amounts to hill ascent on a fitness landscape modified with an easily computed adaptive occupancy penalty.The occupancy penalty makes rejecting moves less and less favorable as the number of unsuccessful attempts to change the current state grows.Ultimately, one of the nearest neighbors is accepted even if the step is deleterious on the original fitness landscape.This behavior allows SmartRunner to climb out of local basins of attraction (Fig. 2).In principle, the adaptive fitness landscape given by Eq. ( 28) can be explored using any global optimization algorithm.
We have tested SmartRunner's performance on a standard set of functions routinely used to evaluate the performance of global optimization algorithms. 29These 4D functions are characterized by numerous local maxima that make it challenging to find a single global maximum.We find that SmartRunner exhibits the highest fitness gain per novel fitness function evaluation compared to three other state-of-the-art gradient-free algorithms (Fig. 6).This is especially important in situations where fitness function calls are computationally expensive.Interestingly, when adaptive fitness landscapes were given as input to other global optimization algorithms, the best results were obtained when the other algorithms' policy for accepting and rejecting moves closely resembled the SmartRunner policy of hill climbing on the modified fitness landscape (Fig. 5).For example, with simulated annealing the globally best strategy was to set the initial temperature to a very low value, essentially reducing simulated annealing to hill ascent.Finally, we observe that the SmartRunner approach is flexible enough to adapt to substantial changes in the moveset, from O(10 0 ) local moves to O(10 2 − 10 3 ) random updates of a single randomly chosen coordinate (Figs.3,4).
We have also tested SmartRunner on two challenging models with long-range couplings and multiple local minima or maxima: the Sherrington-Kirkpatrick spin glass model 12 and the Kauffman's NK model of fitness. 39,40 n systems with quenched disorder, SmartRunner performs very well compared with four other general-purpose global optimization algorithms (Tables 1,2).It is also fairly reliable in locating ground-state energies averaged over disorder in the SK model, although the results are inferior to those obtained by Extremal Optimization, a heuristic algorithm specifically adapted to finding the ground states in the SK model 41,44 (Fig. 7).
In summary, SmartRunner implements a novel global optimization paradigm which offers a viable alternative to current algorithms.The SmartRunner approach described here works on discrete or discretized fitness landscapes and does not make use of the gradient of the objective function in implementing its stochastic policy.In the future, we intend to adapt SmartRunner to carry out global optimization on continuous landscapes where the gradient of the objective function can be computed efficiently.Such optimization will be of great interest in modern machine learning.For example, training artificial neural networks relies on the differentiability of objective functions and optimization methods based on stochastic gradient descent, 6,7 which may get trapped in local minima.

Software Availability
The Python3 code implementing SmartRunner and four other gradient-free global optimization algorithms discussed here is available at https://github.com/morozov22/SmartRunner.

Figure 1 :
Figure 1: A schematic representation of the directed graph G that represents the search process.Regular nodes (system states) are represented by black circles with the corresponding fitness values; terminal nodes are shown as green circles.Directed edges connecting regular and terminal nodes are shown as dashed green lines and assigned a value of −Rl m , where l m is the expected number of steps to find a novel beneficial move starting from regular node m, and R is the expected rate of fitness gain per step.Directed edges connecting two regular nodes are shown as solid black lines and assigned a value of the fitness difference between the two nodes.Note that the set of children of a given regular node always has one terminal node and m p = (0 . . .N ) regular nodes depending on how much exploration has been done.In general, G may contain directed cycles.

Figure 3 :
Figure 3: SmartRunner exploration of the Rastrigin test function: nnb moveset.(A) A scan over SmartRunner hyperparameters (l max = 2): the initial value of the expected rate of fitness gain per step R init and the level of optimism α.Each cell in the heatmap represents best fitness values found in each run, averaged over 50 independent runs with l tot = 10 5 steps each and randomly chosen starting states.(B) Same as A but with the average taken over the number of fitness function evaluations (unique fitness function calls) in each run.(C) A histogram of best fitness values for the heatmap cell with R init = 0.1, α = 1.0.(D) A histogram of the number of unique fitness function calls for the heatmap cell with R init = 0.1, α = 1.0.(E) Plots of 3 representative SmartRunner trajectories (R init = 0.1, α = 1.0).

Figure 4 :
Figure 4: SmartRunner exploration of the Rastrigin test function: spmut moveset.Same as Fig. 3 (including SmartRunner settings and hyperparameter value settings in panels C-E), but with the spmut moveset.

Figure 5 :
Figure 5: Occupancy penalties enhance performance of global optimization algorithms: nnb moveset.(A) A scan over ESA hyperparameters (linear cooling schedule) for the Rastrigin function: the expected rate of fitness gain per step R and the initial temperature T i (the final temperature is set to T f = 0.001).See Fig. S8 for details and for the Ackley and Griewank functions.(B) A scan over ESHC hyperparameters for the Ackley function: the expected rate of fitness gain per step R and the temperature T .See Fig. S11 for details and for the Rastrigin and Griewank functions.(C) A scan over EEA hyperparameters for the Griewank function: the expected rate of fitness gain per step R and the mutation rate µ (the crossover rate is set to r x = 0.2 and the population size to N pop = 50).See Fig. S12 for details and for the Rastrigin and Ackley functions.Each heatmap cell represents best fitness values found in each run, averaged over 50 independent runs with l tot = 10 5 steps each and randomly chosen starting states.

Figure 6 :
Figure 6: Comparison of the algorithms conditioned on the number of unique fitness function calls: nnb moveset.The maximum number of allowed function calls was set to {6250, 12500, 25000, 37500, 50000} for all algorithms.In panels A-C we show the best fitness values found in each run averaged over 100 independent runs with randomly chosen starting states.(A) Rastrigin function.(B) Ackley function.(C) Griewank function.(D) Same as C but for the globally best fitness values obtained over all 100 runs instead of the averages.EA -Evolutionary Algorithm (µ = 0.1, r x = 0.1, N pop = 50), SHC -Stochastic Hill Climbing (T = 0.5), SR -SmartRunner (l max = 2, R init = 0.1, α = 1.0 for Rastrigin and Ackley, α = 10.0 for Griewank), TS -Taboo Search (L tabu = 500).

Figure 7 :
Figure 7: Finite-size corrections to the ground state energy per spin in the SK model.Cyan dots: average of the best-energy values found by SmartRunner on SK landscapes with N = 50, 100, 150, 200, 250, 300, 350, 400 spins.For each value of N , 18 to 23 independent runs with l max = 2, R init = 0.01, α = 0.1 were carried out, each one with randomly generated spin couplings and starting from a random spin configuration.The total number of steps l tot ranged from 4.5 × 10 6 to 3.0 × 10 7 depending on N .Error bars represent the errors of the mean.Blue dots: numerical results for the finite-size corrections to the SK ground state energy reported by S. Boettcher using the Extremal Optimization algorithm (Table1in Ref.41 ).Dashed green line: a linear fit to Boettcher's ground state energies yielding ⟨E best (N )⟩ = ⟨E best (∞)⟩ + mN −2/3 , where m = 0.7047 is the slope and ⟨E best (∞)⟩ = −0.7633 is the asymptotic Parisi energy for the infinite system.42All energies are divided by the number of spins N to produce intensive quantities.

Figure S2 :
Figure S2: SmartRunner exploration of the Ackley and Griewank test functions: nnb moveset.(A) A scan over SmartRunner hyperparameters (l max = 2, nnb moveset) for the Ackley function: the initial value of the expected rate of fitness gain per step R init and the level of optimism α.Each cell in the heatmap represents best fitness values found in each run, averaged over 50 independent runs with l tot = 10 5 steps each and randomly chosen starting states.(B) Same as A but with the average taken over the number of fitness function evaluations (unique fitness function calls) in each run.(C) Same as A but for the Griewank function.(D) Same as B but for the Griewank function.(E) Same as C but with the globally best fitness value obtained over all 50 runs shown instead of the average.(F) Same as D but with the number of fitness function evaluations corresponding to the globally best run from E shown instead of the average over 50 runs.

Figure S4 :
Figure S4: Kolmogorov-Smirnov (KS) p-value analysis of the SmartRunner hyperparameter settings: Rastrigin test function.(A) One-sided KS tests were used to investigate the significance of the differences between two distributions of best-fitness values: the heatmap cell with the highest average of best-fitness values (over 50 independent SmartRunner runs) vs. every other cell.Low p-values (< 0.05) indicate that the cell with the highest average has a significantly better distribution of best-fitness values than the other cell.High p-values (≥ 0.05) indicate that the cell with the highest average has a distribution of best-fitness values that is either indistinguishable from, or worse than the distribution in the other cell.SmartRunner was used with the nnb moveset.(B) Same as A but with the spmut moveset.

Figure S5 :
Figure S5: Kolmogorov-Smirnov (KS) p-value analysis of the SmartRunner l max hyperparameter setting.(A) Rastrigin test function.One-sided KS tests were used to investigate the significance of the differences between distributions of best-fitness values yielded by SmartRunner runs with l max = 3 and l max = 2.Each KS test compared two best-fitness distributions with the same values of α and R init .Low p-values (< 0.05) indicate that l max = 3 runs have yielded a significantly better distribution of best-fitness values than the runs with l max = 2. High p-values (≥ 0.05) indicate that l max = 3 runs have yielded a distribution of bestfitness values that is either indistinguishable from, or worse than the distribution with l max = 2. (B) Same as A but for the Ackley test function.(C) Same as A but for the Griewank test function.All runs employed the nnb moveset.

Figure S6 :
Figure S6: Enhanced simulated annealing (ESA) hyperparameter scan: nnb moveset.(A) A scan over ESA hyperparameters (nnb moveset, linear cooling schedule) for the Rastrigin function: the expected rate of fitness gain per step R and the initial temperature T i (the final temperature is set to T f = 0.001 in all plots).Each cell in the heatmap represents best fitness values found in each run, averaged over 50 independent runs with l tot = 10 5 steps each and randomly chosen starting states.(B) Same as A but with the average taken over the number of fitness function evaluations (unique fitness function calls) in each ESA run.(C) Same as A but for the Ackley function.(D) Same as B but for the Ackley function.(E) Same as A but for the Griewank function.(F) Same as B but for the Griewank function.

Figure S7 :
Figure S7: Kolmogorov-Smirnov (KS) p-value analysis of the ESA hyperparameter settings: nnb moveset.(A) One-sided KS tests were used to investigate the significance of the differences between two distributions of best-fitness values: the heatmap cell with the highest average of best-fitness values (over 50 independent ESA runs) vs. every other cell.Low p-values (< 0.05) indicate that the cell with the highest average has a significantly better distribution of best-fitness values than the other cell.High p-values (≥ 0.05) indicate that the cell with the highest average has a distribution of best-fitness values that is either indistinguishable from, or worse than the distribution in the other cell.ESA was run on the Rastrigin function with the nnb moveset and a linear cooling schedule (Fig.S6A).(B) Same as A but for the Ackley function (Fig.S6C).(C) Same as A but for the Griewank function (Fig.S6E).

Figure S8 :
Figure S8: Enhanced simulated annealing (ESA) hyperparameter scan: low T i , nnb moveset.Same as Fig. S6 but with suboptimally low initial temperatures T i employed with a linear cooling schedule.

Figure S9 :
Figure S9: Enhanced stochastic hill climbing (ESHC) hyperparameter scan: nnb moveset.(A) A scan over ESHC hyperparameters (nnb moveset) for the Rastrigin function: the expected rate of fitness gain per step R and the temperature T .Each cell in the heatmap represents best fitness values found in each run, averaged over 50 independent runs with l tot = 10 5 steps each and randomly chosen starting states.(B) Same as A but with the average taken over the number of fitness function evaluations (unique fitness function calls) in each ESHC run.(C) Same as A but for the Ackley function.(D) Same as B but for the Ackley function.(E) Same as A but for the Griewank function.(F) Same as B but for the Griewank function.

Figure S10 :
Figure S10: Kolmogorov-Smirnov (KS) p-value analysis of the ESHC hyperparameter settings: nnb moveset.(A) One-sided KS tests were used to investigate the significance of the differences between two distributions of best-fitness values: the heatmap cell with the highest average of best-fitness values (over 50 independent ESHC runs) vs. every other cell.Low p-values (< 0.05) indicate that the cell with the highest average has a significantly better distribution of best-fitness values than the other cell.High p-values (≥ 0.05) indicate that the cell with the highest average has a distribution of best-fitness values that is either indistinguishable from, or worse than the distribution in the other cell.ESHC was run on the Rastrigin function with the nnb moveset (Fig.S9A).(B) Same as A but for the Ackley function (Fig.S9C).(C) Same as A but for the Griewank function (Fig.S9E).

Figure S12 :
Figure S12: Enhanced evolutionary algorithm (EEA) hyperparameter scan: nnb moveset.(A) A scan over EEA hyperparameters (nnb moveset) for the Rastrigin function: the expected rate of fitness gain per step R and the mutation rate µ (the crossover rate is set to r x = 0.2 and the population size to N pop = 50 in all plots).Each cell in the heatmap represents best fitness values found in each run, averaged over 50 independent runs with l tot = 2000 steps each and randomly chosen starting states (each EEA step involves evaluating fitness functions for all 50 population members).(B) Same as A but with the average taken over the number of fitness function evaluations (unique fitness function calls) in each EEA run.(C) Same as A but for the Ackley function.(D) Same as B but for the Ackley function.(E) Same as A but for the Griewank function.(F) Same as B but for the Griewank function.

Figure S13 :
Figure S13: Taboo search (TS) hyperparameter scan: nnb moveset.(A) A scan over the length of the taboo list L tabu (nnb moveset) for the Rastrigin function.Each cell in the heatmap represents best fitness values found in each run, averaged over 50 independent runs with l tot = 12500 steps each and randomly chosen starting states (each TS step involves evaluating fitness functions for all 8 nearest neighbors of the current node).(B) Same as A but with the average taken over the number of fitness function evaluations (unique fitness function calls) in each TS run.(C) Same as A but for the Ackley function.(D) Same as B but for the Ackley function.(E) Same as A but for the Griewank function.(F) Same as B but for the Griewank function.

Figure S15 :
Figure S15: Representative TS fitness trajectories: spmut moveset.Shown is the L 2 distance D between the current 4D state and the global maximum as a function of the number of steps ℓ for 3 randomly chosen TS trajectories.TS was run with L tabu = 100.(A) Rastrigin function.(B) Ackley function.(C) Griewank function.

Figure S16 :
Figure S16: SmartRunner (SR) hyperparameter scan: SK model with N = 200 spins.(A) A scan over SmartRunner hyperparameters (l max = 2): the initial value of the expected rate of fitness gain per step R init and the level of optimism α.Each cell in the heatmap represents the best fitness values found in each run, averaged over 100 independent runs with l tot = 1.5 × 10 6 steps each and randomly chosen starting states.(B) Same as A but with the average taken over the number of fitness function evaluations (unique fitness function calls) in each run.(C) Same as A but with the globally best fitness value obtained over all 100 runs shown instead of the average.(D) Same as B but with the number of fitness function evaluations corresponding to the globally best run from C shown instead of the average over 100 runs.(E) One-sided KS tests used to investigate the significance of the differences between two distributions of bestfitness values: the heatmap cell with the highest average of best-fitness values in panel A vs. every other cell.Low p-values (< 0.05) indicate that the cell with the highest average has a significantly better distribution of best-fitness values than the other cell.High p-values (≥ 0.05) indicate that the cell with the highest average has a distribution of best-fitness values that is either indistinguishable from, or worse than the distribution in the other cell.

Figure S17 :
Figure S17: SmartRunner (SR) hyperparameter scan: NK model with N = 200 sites and K = 8 nearest neighbors.Same as Fig. S16 but for the NK model.

Figure S18 :
Figure S18: SmartRunner (SR) hyperparameter scan: SK and NK models.The SK models have N = 200, 500 and 1000 spins, while the NK models have N = 200, 500 and 1000 sites, with K = 8 randomly chosen intra-sequence couplings per site.All SR runs were carried out with l max = 2 and R init = 0.01, and with the total number of steps l tot = 1.5×10 6 , 10 6 , 5×10 5 for the models with 200, 500 and 1000 spins/sites, respectively.(A) SK model: shown are the best fitness values found in each run, averaged over 10 independent runs with randomly chosen starting states.Error bars show standard deviations.(B) SK model: shown are the globally best fitness values obtained over all 10 runs.(C) Same as A but for the NK model.(D) Same as B but for the NK model.The dashed horizontal line is drawn at 0.7 to guide the eye.
Fitness landscape function: X → F Move set function: X old → X new Total number of iterations: l tot Maximum length of paths explored from each state: l max Initial guess of the fitness rate: R 0