Photonic architecture for reinforcement learning

The last decade has seen an unprecedented growth in artificial intelligence and photonic technologies, both of which drive the limits of modern-day computing devices. In line with these recent developments, this work brings together the state of the art of both fields within the framework of reinforcement learning. We present the blueprint for a photonic implementation of an active learning machine incorporating contemporary algorithms such as SARSA, Q-learning, and projective simulation. We numerically investigate its performance within typical reinforcement learning environments, showing that realistic levels of experimental noise can be tolerated or even be beneficial for the learning process. Remarkably, the architecture itself enables mechanisms of abstraction and generalization, two features which are often considered key ingredients for artificial intelligence. The proposed architecture, based on single-photon evolution on a mesh of tunable beamsplitters, is simple, scalable, and a first integration in quantum optical experiments appears to be within the reach of near-term technology.


I. INTRODUCTION
Modern computing devices are rapidly evolving from handy resources to autonomous machines [1].On the brink of this new technological revolution [2], reinforcement learning (RL) has emerged as a powerful and flexible tool to enable problem solving at an unprecedented scale [3][4][5][6][7].This breakthrough development was in part spurred by the technological achievements of the last decades, which unlocked vast amounts of data and computational power.One of the key ingredients for this advancement was the ultra-large-scale integration [8], which led to the massive capabilities of current portable devices.Meanwhile, in the wake of this technological progress, neuromorphic engineering [9] was developed to mimic neuro-biological systems on application-specific integrated circuits (ASIC) [10].Their improved performance is rooted in the parallelized operation and in the absence of a clear separation between memory and processing unit, which eliminates off-circuit data transfers.Furthermore, new materials and ASICs are being reported to boost neuromorphic applications [11].Among them, photonic devices represent a promising technological platform due to their fast switching time, high bandwidth and low crosstalks [12].
Inspired by the outstanding success of both RL and ASICs, here we present a novel photonic architecture for the implementation of active learning agents.More specifically, we consider an RL approach to artificial intelligence [13], where an autonomous agent learns through interactions with an environment.Within this framework, the proposed architecture can operate using any of three learning models: SARSA [14], Q-learning [15] and projective simulation (PS) [16].The main contribution of this paper is twofold.(i) First, we describe a photonic architecture that enables RL algorithms to act directly within optical applications.To this purpose, we focus on linear-optical circuits for their intuitive description, well-developed fabrication techniques and promising features as compared to electronic processors [17][18][19][20].For instance, nanosecond-scale routing and reconfigurability have already been demonstrated [21][22][23], while encoding information in photons enables decision-making at the speed of light, only limited by the generation and detection rates.Moreover, the use of phase-change materials for in-memory information processing [24] promises to enhance the energy efficiency, since their properties can be modified without continuous external intervention [25,26].(ii) The second contribution is the development of a specific variant of PS based on binary decision trees (tree-PS, or t-PS for short), which is closely connected to the standard PS and suitable for the implementation on a photonic circuit.Furthermore, we discuss how this variant enables key features of artificial intelligence, namely abstraction and generalization [27,28].
The paper is structured as follows.In Sec.II we summarize the theoretical framework of RL, exemplified by three common approaches: SARSA, Q-learning, and PS.In Sec.III we describe the blueprint for a fully integrated, photonic RL agent.We then numerically investigate its performance within two standard RL tasks and under realistic experimental imperfections in Sec.IV.Finally, in Sec.V we discuss promising features of this architecture within the context of t-PS.

II. REINFORCEMENT LEARNING
In this section, we briefly introduce the RL framework, which is the focus of this work.Within RL, the agent learns through a cyclic interaction with the environment (Fig. 1).The agent starts with no prior knowledge and randomly probes the environment by performing actions.The environment, in turn, responds to the actions by changing its state, which is observed by the agent through perceptual input, and by providing a reward that quantifies how well the agent is performing.The goal of the agent is then to maximize its long-term expected reward [29].In the following, we will first describe two standard RL algorithms, SARSA [14] and Q-learning [15], before introducing the more recent PS [16].

A. SARSA and Q-learning
As for all RL algorithms, SARSA (State-Action-Reward-State-Action) and Q-learning aim at adjusting the agent's behavior until it performs optimally, in the sense we discuss in the following.The agent's behavior is defined by the policy π a|s , which governs the choice of an action a ∈ A given a state s ∈ S. The evolution of the environment under the agent's action can be described by a conditional probability distribution over all state-action-state transitions.Each transition that was taken has an associated reward λ.For a given policy π a|s , the value of each state is defined by the expected future reward Here, λ t is the reward received from the environment at time t, while the so-called discount factor γ ∈ [0, 1] sets the relative importance of immediate rewards over delayed rewards, up to a temporal horizon T .The goal of the agent is to learn the optimal policy π * a|s that maximizes the value V π s for all states s.The expected future reward is estimated and iteratively updated through the experience gained from its interactions with the environment.Instead of the value V π s , this estimate is more conveniently described by the Q-value, which quantifies the quality of a state-action pair at a given time (Fig. 2a).For both SARSA and Q-learning, this quantity is updated at each step according to where s ,a ) is the new estimate due to taking action a in the state s observed after s, f is a suitable function that depends on the algorithm and the learning rate α determines to what extent this estimate overrides the old value.Given N actions and as many Q-values for state s, the expected future reward can be estimated as In both algorithms, decision-making is usually done by sampling actions according to a probability distribution that depends on the Q-values.In the context of RL, the softmax function is a convenient choice where the parameter β governs the drive for exploration within the agent.The difference between Q-learning and SARSA lies in the choice of the function f .In SARSA, f updates the value of the current state-action pair Q s,a with the estimate for the following state-action pair Q s ,a , i.e. f is the identity function.In state s , the action a is chosen according to the agent's policy.Thus, SARSA is called an on-policy algorithm.Q-learning, on the other hand, is an off-policy algorithm because, given the state s , f selects the action a with the maximal value Q s ,a , i.e. f = max a ∈A , so that the update is independent of the next action chosen according to the agent's policy.

B. Projective simulation
PS is a recent, physically-motivated RL model [16], which has already found several applications ranging from robotics [30] and quantum error correction [31] to the study of collective behavior [32] and automated experiment design [33].Decision-making in PS occurs in a network of clips that constitutes the agent's episodic and compositional memory FIG. 1. Reinforcement learning in a photonic circuit.In RL, an agent learns by interacting with its environment.Each new observation is internally processed until an action is chosen and performed on the environment.The processing unit, characterized by χ-values, adapts the agents behavior according to a specific update rule in order to maximize the expected, future reward within a given environment.This unit can be implemented on an integrated photonic circuit.
(ECM) (Fig. 2a).Each clip represents a remembered percept, a remembered action or a more complex combination thereof.The ECM can accommodate a multilayer structure, where intermediate layers represent abstract clips and connections.Decision-making is carried out by a random walk through the ECM, starting at a percept clip and ending at an action clip which triggers the corresponding action.The random walk is guided by transition probabilities between pairs of clips (c i , c j ), connected by edges carrying weights h ci,cj , by considering probabilities proportional to h ci,cj or by using the softmax function π cj |ci (h ci,cj ) as in Eq. 2. Learning occurs by updating the clip network in the agent's memory, i.e. by changing its topology or the edge weights h ci,cj .In the latter case, the update rule at time t has the form where γ ∈ [0, 1) is a damping parameter, λ is the reward and g ci,cj ∈ [0, 1] is the so-called edge-glow value or gvalue.
Here, γ and g ci,cj implement mechanisms that take into account forgetting and delayed rewards, respectively.More specifically, the damping parameter γ is essential for environments that change over time, effectively damping h-values at each time step.The edge-glow values serve to backpropagate discounted rewards to earlier sequences of actions.The g-values are updated at each time step: whenever an edge (c i , c j ) is traversed g ci,cj is set to 1, and from then on its value is discounted as g ci,cj where η ∈ [0, 1] is the glow parameter.Consequently, g-values are rescaled according to g ci,cj = (1 − η) δtc i ,c j , where δt ci,cj is the number of steps between the round when (c i , c j ) is traversed and the round when a reward is issued.Intuitively, values of η close to 1 reward sequences of actions only in the immediate past, while values close to 0 are used to reward longer sequences.and χ-values for their photonic implementation.b) Arbitrary probabilistic transitions can be implemented on a photonic platform with a cascade of beamsplitters, whose transmissivities τ reproduce the distribution given by the current policy.c) Tunable beamsplitters at each node (k, l) can be implemented using Mach-Zehnder interferometers with tunable phase shifts θ c,kl [34].Phases are adjusted according to a quantity χ c,kl , which is updated during the learning process.
rewards such as the GridWorld [13] discussed in Sec.IV.For a more detailed description of PS we refer the reader to Refs.[28,35,36].

III. PHOTONIC REINFORCEMENT LEARNING
In order to implement RL on a photonic platform we need to be able to satisfy two requirements: (i) implement arbitrary probabilistic transitions between clips and (ii) update the corresponding probability distributions in a controlled and effective way.A practical platform has to satisfy further criteria that are crucial for any implementation, such as scalability, ease of fabrication and miniaturizability.In this section, we will describe a linear optical architecture that is tailored to the task at hand, i.e. designing integrated photonic hardware for RL, in the spirit of neuromorphic engineering [12].

A. Decision trees as linear optical circuits
Using a bottom-up approach [37] , we focus on the implementation of state-action (SARSA and Q-learning) or clipto-clip (PS) transitions, as shown in Fig. 2a.For PS, each clip-to-clip transition is a building block for the random walk in the agent's memory.For brevity, we will only consider clip-to-clip transitions (c,c ), which are equivalent to stateaction pairs (s, a) for a two-layer ECM.Each transition is governed by the probability distribution of detecting a single photon over the output modes.The architecture we present consists of a cascade of reconfigurable beamsplitters arranged in a tree structure (Fig. 2b), which maps a single input mode (associated with a clip) to N output modes (corresponding to as many clips).Such an association can be initialized randomly or according to prior knowledge about the environment.Fully-reconfigurable linear-optical interferometers like this one allow to engineer arbitrary probability distributions over the optical modes and, given a probability distribution, it is possible to determine a set of phases that reproduces it exactly (see Sec. VII A).In the next section, we also provide further considerations on various layouts that can be adopted.
To employ this architecture for RL, we consider the following operational scenario: the current policy is stored electronically [18,20] in the phase shifters that define the singlephoton evolution in the circuit and, consequently, the probabilistic decision-making.Each phase-shifter θ c,kl at node (k,l) is set to implement the transition probabilities for the corresponding clip-to-clip connections.Decision-making (Fig. 2a) is hence realized as a single-photon evolution in a mesh of tunable beamsplitters (Fig. 2b), where the transition to the next state is made by detecting [38] Overall, this approach satisfies the requisite for arbitrary probabilistic transitions (i) described at the beginning of this section.Furthermore, it provides a solution that is scalable (one only needs to store the phases that implement a given transition) and that can be fully integrated on a miniaturized photonic chip [20].Importantly, sensors could be integrated on an optical chip, gyroscopes and magnetometers being first examples in this direction [18].
Concerning the second requirement (ii), to learn an optimal policy we want the agent to autonomously adjust the phases θ according to a suitable update rule.To this end, we first consider the path Γ c,c that connects clips (c,c ) and express the phases θ c,kl in the transition probability as a function of a quantity χ that is updated during the learning process, namely θ(χ) = θ 0 + θ χ .Here, θ 0 = π 4 corresponds to the configuration where all transitions are equally probable, while θ χ spans the whole range of transition probabilities, namely ]. Suitable candidates for θ χ are the sigmoid functions [39], which are monotonically increasing in a bounded interval and have domain over all real numbers.We then use the function θ χ = tanh χ, so that where the quantity χ is updated according to a suitable update rule within the framework of RL.For SARSA (S) and Q-learning (QL), we update χ according to the rules where being the depth of the circuit, and In the notation used in Sec.II A for SARSA and Q-learning, subscripts in R c and M c refer to states.Comparing the original Q-value update rule in Eq. 1 with the update rule in Eq. 6, we emphasize that Eq. 6 does not simply reproduce Eq. 1 using χ.The reason is that Qand χ-values provide different information, the former quantifying the quality of a clip-to-clip connection, the latter defining the splitting ratio at each beamsplitter.Indeed, though related once the agent has properly learned the policy, the two quantities are not directly linked during the learning process.For instance, when one clip-toclip connection (c,c ) is favorable (large Q-value) the policy π c is peaked (i.e.χ-values far from zero), but when multiple (c,c ) pairs are favorable (large Q-values) the policy π c is less peaked (χ-values closer to zero).Therefore, a feature we demand is to keep track of the overall quality of each state, from which the χ-values will reproduce the relative quality of each (c,c ) connection.We fulfill this task in Eq. 7, introducing a new parameter (in addition to the N − 1 phases) that updates the agent's confidence in the quality of clip c.Also, peakedness of each policy can be quantified by the average deviation from 0 (corresponding to a flat distribution) of the χs in each path, as done in M c in Eq. 6.
Besides SARSA and Q-learning, we can choose to operate in the framework of PS.In this case, we evolve χ according to the rule which is equivalent to the update rule for h c,c in Eq. 3, considering that h c,c (χ c,kl ) is initialized to 1 (0).Notably, the choice θ χ = tanh χ in Eq. 5 establishes a formal connection between the proposed architecture and a specific variant of PS, which we call tree-PS (t-PS).This connection is derived in Sec.VII B 1. In t-PS, every clip-to-clip transition is implemented as a binary decision tree between the input and the output clips.In Sec.VII B 2 and Sec.VII B 3 we prove that t-PS can reproduce the operation of the two-layer PS, which has been discussed extensively in the literature.While the two models appear to have the same representational power, t-PS provides an additional structure that can be exploited to enhance the learning process, as we describe below in Sec.V A.

B. Photonic architecture for the agent's memory
The architecture described in Fig. 2b, which represents the building block for decision-making, can take advantage of an efficient design enabled by its fractal geometry [40].In this section, we will outline three approaches to implement learning and decision making starting from such a building block.First, we can adopt a simple strategy where the circuit consists of a single decision tree: once a photon is detected (thus selecting a clip in the next layer), all phase-shifters are adjusted to implement the next transition and another photon is injected into the same circuit.Similarly, we can devise a loopbased implementation where photons are redirected back to the input while the circuit is reconfigured.Though appealing, this approach is more challenging since it requires nonlinearities to detect the presence of a photon in the output modes [41].Finally, we can conceive a more sophisticated scheme that fully exploits the advantages of a photonic platform.Here, all building blocks are arranged in a planar structure (Fig. 3) that represents the memory of the agent (Fig. 2a).In the latter configuration, decision-making corresponds to a single-photon random walk from the input to the output layer.Input photons are routed through a bus waveguide and optical switches [34,44] to one layer (out of L), where a clipto-clip transition is performed in a decision tree.The layered architecture is meaningful only for PS, where it represents the L-layer structure of an acyclic ECM, while for SARSA and Q-learning it is a convenient geometry to make the integrated circuit more compact.Fast and efficient routing [23,45], controlled by a feedback system that also monitors photon losses, guides single photons to the appropriate building block.Photons exit the tree in one of N waveguides (forming a second, reversed binary tree [46]), whose root node leads to a second bus waveguide connected to the detection stage.To find out which clip (i.e.output waveguide) was selected, a possibility is to add N − 1 different delay lines [47,48] to the reversed tree and look at the time bin where the photon was detected.Single photons are routed by optical switches through L layers, where state-action (L = 1) or clip-to-clip transitions are performed (Fig. 2).For PS, in this paper we only consider acyclic ECMs.Photons are then time-multiplexed, using delay lines, before reaching the detection stage in a single waveguide (whose outcome controls the optical switches and possible updates).In principle the system can even be self-stabilized [34,42,43].
An interesting feature of this approach is that it can take advantage of phase-change materials (PCM) [25,26] to realize the phase-shifters, whose physical properties can be modified in a reversible and controlled way with a single write operation [24].The intuition is that only the phases corresponding to traversed paths need to be updated, while the others remain fixed without any additional power consumption.Hence, the number of updates scales only logarithmically with the number of output clips.In Sec.VII C, we discuss how both computational complexity and energy consumption are even comparable to an electronic ASIC that exploits high locality and specialized data structure.Notably, using the circuit for self-optimization in optical interferometers eliminates the need for a separate generation and detection, since photons can be part of the embedding application.In addition, decision-making after learning consumes practically no power since phase-shifters do not need to be adjusted anymore.

IV. TESTING THE ARCHITECTURE
In this section, we employ the proposed architecture in a standard testbed for RL, the GridWorld environment [29].This task is of broad relevance since any stationary fullyobservable environment can be reformulated in this frame [29], notable examples being Atari games [3] and Super Mario Bros. [49].Henceforth, we will focus on (two-layer) PS, due to its simpler update rule (Eq.8) and to investigate the potential of t-PS.Indeed, GridWorld has been already investigated in the context of PS [35], a relevant example being the design of optical experiments, which was shown to be representable as a generalized GridWorld [33].Furthermore, note that for both SARSA and Q-learning we numerically observed a performance very similar to PS.
In the simplest formulation of the problem, the goal for the agent is to maximize its long-term expected reward while navigating an environment structured as a planar grid-like maze.
The agent starts from a fixed location p A = (x A , y A ) and is challenged to learn the shortest path that leads to a reward at location p R = (x R , y R ).Available to the agent is a set of actions (x ± , y ± ), where x ± corresponds to a movement in the positive/negative x-direction.The learning process is divided in a sequence of episodes, or trials, where the agent interacts with the environment until a predetermined condition is met.
In our analyses, the agent is reset if the number of interactions in one episode either exceeds 10 3 or a reward is obtained.To account for delayed rewards, the edge-glow mechanism (see Sec. II B) rescales the reward λ, assigned to a traversed transition (c i ,c j ), by a quantity that decreases exponentially with the number of steps that pass until a reward is received [29,35].
The above formulation can be extended to more complex scenarios, which include higher-dimensional mazes with walls, sophisticated moves and/or penalties.For our investigation we employed a 3D GridWorld with walls: whenever the agent tries to move onto the border of the grid or onto a wall, a time step is counted but no movement occurs.We chose a 3D maze, rather than a 2D or a 4D grid, to investigate more complex configurations that could still be visually inspected.As an example (see inset in Fig. 4), we considered a 10 × 10 × 10 GridWorld where the agent starts at position p A = (3, 1, 4) and a reward is hidden at position p R = (9, 9, 9).Fig. 4a shows the average learning curve numerically simulated for a photonic agent navigating this maze.We observe that the average path length rapidly decreases with the number of trials, from ∼ 10 3 (where the agent behaves like a random walker) to values close to the minimum path length (19 in this case).
The same numerical analysis was carried out simulating a non-ideal implementation of photonic PS, to test to what extent experimental imperfections are expected to spoil the process.To this end, each time phases were adjusted in the simulated device, Gaussian noise was added on top of the ideal value (a more detailed description on how imperfections were modeled is reported in Sec.VII D).Remarkably, we find that a realistic amount of noise can even aid the learning process, a feature that can be ascribed to an enhanced tendency of the agent to explore new paths.In Sec.VII D, we also expand on this aspect, which is reminiscent of the phenomenon of stochastic resonance [50], providing a visual intuition in support of this interpretation.Eventually, the fact that realistic levels of noise can enhance the agent's learning process makes the present approach even more appealing for a concrete implementation.Indeed, not only the architecture exhibits a natural resilience to noise, but also this very resilience relaxes the (often challenging) technological requirements for isolation and stability.

V. t-PS WITH GENERALIZATION AND ABSTRACTION
While the two-layer and the tree-based implementations of PS have the same representational power (see Sec. VII B), t-PS provides an additional structure that can be exploited to boost the learning process.As we will see, this feature allows an agent to exhibit simple forms of abstraction and generalization, which play a central role in artificial intelligence [27].Abstraction is the ability of an agent to filter out less relevant details, a process that involves a modification in the representation of the object.Generalization corresponds to the ability to identify similarities between objects, without necessarily affecting their representation.In this section, we will describe how an agent can take advantage of these features by suitably ordering the clips over the output modes according to some measure of relevance, such as the reward.

A. Generalization and abstraction
To introduce the notions of generalization and abstraction in the present architecture, let us start by considering the simplest case of a 2D GridWorld in the XY plane without walls.Given the tree structure of t-PS, we can expect there to be a beneficial arrangement of action clips over the outputs.Nodes in t-PS can represent meaningful sub-decisions towards a final decision made at the leaf nodes.Since nodes closer to the root are updated more regularly, sub-decisions can, in principle, be learned before the final policy is obtained.Of course, initially, nodes are not necessarily ordered in a way that has a meaningful interpretation.However, the agent can sort them during the learning process such that intermediate nodes obtain meaning which, in turn, guides the agent's decision-making.
Motivated by the above considerations, we propose a simple mechanism, which we call defragmentation, that is specifically designed to address this issue, though its benefits are not limited to this scenario.The name defragmentation is inspired by the usual process that occurs in hard-disks, which improves performance by reallocating fragments of memory according to dependencies and usage.The mechanism consists of (1) keeping track of the cumulative reward assigned to each action and (2) sorting actions over the output modes according to their respective cumulative reward.More sophisticated rules can also be designed for step (1), perhaps tailored to capture correlations in time or more intricate patterns between actions.From a practical perspective, step (2) only requires to compute the new phases that produce the reordered probability distribution (see Sec. VII A).In any case, whenever there are two or more rewarded actions, this mechanism favors the separation between good and unfavorable actions.It is precisely in its capability of grouping together actions of comparable relevance, e.g.similar collected rewards in the present context, that the agent expresses an elementary form of generalization [28].For instance, in a 2D GridWorld actions can be conveniently organized according to a hierarchy of criteria (Fig. 5), e.g.move 'forward' or 'backwards' and move 'along X' or 'along Y', resulting in composite actions such as 'up' ('forward' and 'Y') or 'left' ('backwards' and 'X').Numerical analyses involving defragmentation on both 2D and 3D GridWorld show that the agent does autonomously discover structures analogous to the one in Fig. 5, suggesting that this generalization feature is beneficial and informative, and that it can be used in more complex scenarios.
Naturally, defragmentation, as a way of knowledge exploitation, consumes time that has to be balanced with that reserved for exploration.Nevertheless, in the usual compromise between exploration and exploitation [29], the longer the agent explores the environment to assess the quality of an ac-tion, the more its generalization process will be reliable and successful.At a certain time, once a stronger representation is built in its memory, the agent could even perform a sort of abstraction by cutting out the least relevant actions, so as to focus only on those that are deemed more favorable.In RL tasks with large-scale action spaces, this process could even be iterated to progressively reduce the search space for good actions.Indeed, the photonic architecture enables this mechanism to be straightforwardly implemented, by simply setting specific transition probabilities to 0 or 1, which isolates all the subsequent branches of optical components.This feature could, in turn, entail a reduction in computational resources and, possibly, in learning time.

B. Exploiting the tree-like structure
To provide quantitative evidence for the above considerations, we numerically applied defragmentation to another standard problem in RL, the multi-armed bandit [29].In its general formulation, an agent is presented with N bandits (for instance, slot machines) characterized by a probabilistic reward function and, at each time step, the agent is allowed to pull the arm of one of the bandits (which issues a reward drawn from the corresponding distribution).Effectively, this gives an environment with one state and N possible actions.We consider a variant of the problem with additional structure in its action space, referred to in the literature as combinatorial multi-armed bandit [51].In this task, bandits (i.e., actions) are grouped in sub-categories according to a set of features.In the example described above, these features could be the casino, city, country, etc. the slot machine is situated in.This structure is provided to the agent at an abstract level (the dependence between features is not specified) by dividing the allowed actions into several sub-actions.As a result, the action space A = {1, ..., N } factorizes to A = A 1 × A 2 × ... × A k , where |A i | = n i is the number of possible choices for sub-action A i , and N = i n i .This kind of factorization is analogous to the decomposition of the state and action space into categories that was considered in Ref. [28], except that the structure we consider here is imposed on actions.For simplicity, let us assume that a deterministic reward r a is associated with each action a = (a 1 , ..., a k ), but that this reward distribution depends (partially) on the structure of the action space.The agent can then exploit the factorized structure to choose the best sub-actions according to their influence on the reward.In this regard, the proposed architecture can be particularly effective since consecutive levels can separately focus on each A k .Moreover, a mechanism to rearrange the layers (such as the defragmentation described in Sec.V A) can shift the layers associated with the most relevant sub-actions closer to the root, capturing correlations between actions and facilitating learning.In the above example, the agent could learn that the choice of a city is more relevant than the choice of a particular casino in that city, because casinos in a certain city are more lucrative, and choose the city earlier in the deliberation.
We expand on the above considerations in more detail in Sec.VII E with a simple example.In the following, we will focus on the performance boost induced by the defragmentation of the action space.Fig. 6 shows quantitative evidence of this boost in an instance of the bandit problem where two actions are always rewarded.Analogous advantages can be found in the 3D GridWorld described in Sec.IV, where only a subset of directions is relevant and grouping them is beneficial for the agent.In particular, these numerical results show that defragmentation allows to speed up the learning process, i.e. fewer trials are required to find an optimal policy.This situation is indeed typical in RL, where exploitation of current knowledge allows to reduce the time spent on exploration.From a practical perspective, this feature facilitates learning scenarios where interactions with the environment are costly.For these reasons, the proposed t-PS appears as a promising platform in the framework of RL, being able to support key features for artificial intelligence (in the form of a basic generalization and abstraction) while preserving a good control over its operation and performance.

VI. DISCUSSION
The development of autonomous agents capable of learning by interacting with an environment has seen a tremendous surge of interest over the past decade [3][4][5]7].Recently, RL has even claimed its place in the list of the top breakthrough technologies with the largest and broadest impact [52].Similarly, the design of neuromorphic applicationspecific hardware has attracted massive attention due to its enhanced computational capabilities in terms of speed and energy efficiency [9].In this work, we propose a blueprint for an application-specific integrated photonic architecture capable of solving problems in RL.Within this framework, the architecture easily accommodates various well-established RL algorithms such as SARSA, Q-learning, and PS.Also, its simple and scalable design warrants near-term implementations and is apt for embedding in portable devices.Indeed, all required optical components have already been experimentally demonstrated on integrated circuits [17][18][19][20][21][22].
We investigated the proposed platform both numerically and analytically, confirming the efficacy of the model also under realistic, imperfect experimental conditions.Besides its efficacy, the architecture enables a novel implementation of PS (t-PS) that is inspired by the geometry of the integrated circuit.This model does not only exhibit some key features of artificial intelligence, namely generalization and abstraction, but can also boost its learning performance via autonomous defragmentation of its memory.Indeed, both numerical and analytical results suggest that t-PS performs at least as well as the simulated standard PS model, which has already found various applications [30][31][32][33].Eventually, we envisage the experimental realization of a photonic RL agent which successfully exploits all these features within an optical environment.

VII. APPENDIX A. Programming the architecture
Here we describe how to create an arbitrary output probability distribution in t-PS by tuning the parameters available in a photonic architecture, i.e. θ.Given a clip c and an associated output probability distribution {q c,i }, we can analytically retrieve the set of phases θ c,kl that reproduces the probability distribution in the n-layer tree architecture.To this end, we consider the ratio ξ c,kl of the probabilities of taking the upper (p c,kl ) or the lower paths at node (k,l) (Fig. 2) where k ∈ [1, n], l ∈ [1, 2 k−1 ] and the sum in the numerator (denominator) runs over the output modes associated with the upper (lower) path.In particular, if we label the output nodes from 1 to 2 n , we find that Writing pc,kl = sin 2 θ c,kl (as we are dealing with phases in the photonic t-PS), we finally get θ c,kl = arctan ξ c,kl .

B. Update rules for t-PS
In this section, we discuss three relevant rules to update phases in the photonic t-PS architecture.The section is structured as follows: first, we derive the rule of Eq. 5.Then, we show that this architecture can also simulate the behavior of a two-layered PS where probability distributions are either calculated from normalized h-values (Sec.VII B 2) or the softmax function of h-values (VII B 3).Hence, t-PS can reproduce the results reported in the literature on PS.

Reproducing two-layer PS with standard probabilities
We look for an update rule θ c,kl → f (θ c,kl ) in t-PS that reproduces the update on the standard probabilities q c,i = h c,i / j h c,j of the two-layer PS [16].Clearly, t-PS can, in principle, reproduce the probabilities in the 2-layered PS since it can reproduce any probability distribution, as we showed in Sec.VII A. However, it is not obvious that there exists an update rule on the parameters {θ} that simulates an update on the h-values in the 2-layered PS.Therefore, we will first show that (i) there exists a local update rule g(•) on {p }∀t.Then, (ii) we will express pc,kl using θ c,kl , which gives the desired update rule θ c,kl → f (θ c,kl ).For brevity, in the following we ignore the time index because it suffices to consider a single update.
(i) We start by considering the ratio ξ c,kl of the transition probabilities at node (k,l) where U kl and D kl are defined in Eq. 10, B kl = U kl ∪ D kl is the set of branch indexes associated with all output modes reachable from node (k,l) and Since Eq. 14 holds at each time step, when the transition to a certain clip c is rewarded (by a value λ c , i.e. h c,c = h c,c + λ c ) we have (16) depending on whether the rewarded action is related to the upper path or to the lower path.Defining N c,kl = u c,kl +d c,kl we obtain where c k = 0 (c k = 1) if c ∈ U kl (c ∈ D kl ), and with the initial settings ξ can be seen as the kth digit of c written in base 2 (c k = 0 for upper paths, c k = 1 for lower ones).Eq. 17 shows that there exists an update g(•) on {p c,kl } that reproduces the update on {q c,i } in the two-layered PS.
(ii) By inserting pc,kl = sin 2 θ c,kl into Eq.14 we obtain θ c,kl = arctan ξ c,kl .This connection allows the reformulation of Eq. 17 in terms of θ c,kl , which gives the update rule we were looking for to reproduce the two-layer PS in t-PS.

Reproducing two-layer PS with softmax function
We now describe an update rule on t-PS that simulates the two-layer PS with softmax function.The softmax function (see Eq. 2) is a convenient tool to construct a probability distribution {p i } from a set of non-normalized quantities {h i }.The derivation of the update rule in this case develops in the same manner as in Sec.VII B 2, through two main steps.
(i) Considering the softmax function and the update rule h c,c = h c,c + λ c , following Eq.14 and Eq. 15 we have Depending on the rewarded path as in Eq. 16, we get Ideally, we would like an update rule that involves only the quantity that is being updated, i.e. ξ.In order to express the ratio e βh c,c /D c,kl in terms of ξ, we first observe that where p(c |kl) is the probability of a photon exiting the output mode corresponding to the next rewarded clip c given that it is at node (k,l) evaluating the product over all nodes {(v, w)} that connect (k, l) to c in the binary tree.Since pc,kl = ξ c,kl 1+ξ c,kl , we can finally express Eq. ( 19) as follows ξ c,kl = ξ c,kl 1 + (e βλ − 1) Note that this expression only involves the quantity ξ.
(ii) Using again θ c,kl = arctan ξ c,kl , Eq. 22 provides the update rule to simulate the two-layer PS with softmax function in the t-PS architecture.

C. Processing time and energy consumption
In this section, we discuss the computational complexity of the proposed photonic platform and of an ideal applicationspecific integrated circuit (ASIC), employing sampling algorithms and data structures which are best suited for the present application.To this end, let us assume that both the photonic hardware and the ASIC store weights, i.e. χ c,kl and h ij respectively, in an on-board memory, ideally a cache.Both architectures must perform three computational tasks: (i) updating and (ii) preprocessing N weights, and (iii) sampling from preprocessed data.Let us discuss each part in order.
(i) In the photonic architecture, updating the in-memory weights requires adjusting log N χ-values along a path in the binary tree of Fig. 2. Basically, this operation corresponds to O(log N ) number of FLOPS.Similarly, we only update a single h-value in the ASIC.However, in order to make steps (ii) and (iii) efficient, we demand that the h-values are ordered in a sorted list.Then, a single update may very well disturb this sorting and require up to O(N ) operations to recover from.Therefore, we assume that the h-values are stored in a selfbalancing tree data structure, a so-called B-tree [53].This data structure not only allows easy access in O(log N ) computational time but also includes insertion and deletion operations that maintain the order of elements while requiring the same logarithmic time complexity.
(ii) Preprocessing in the photonic architecture requires adjusting log(N ) PCM phase-shifters by evaluating θ(χ) for the updated values, each requiring ∼ 10 2 pJ at the nanosecond scale [24].This is comparable to the power consumption of ideal, specialized computing devices at ∼ 1pJ/FLOP [54] and may be improved due to the broad applicability of PCMs for energy storage, information processing, and optical communication [25,26].For comparison, a general-purpose computing device requires ∼ 1 nJ per DRAM access and ∼ 10 pJ per cache access [55].In the ASIC, we prepare for sampling by creating auxiliary data from the sorted list of weights, in accordance with the preprocessing outlined in the SORTEDPRO-PORTIONALSAMPLING algorithm proposed in Ref. [56].This preprocessing requires O(log 2 N ) computational time when data are stored as a B-tree.
(iii) In both cases sampling takes constant time: in the photonic device, sampling reduces to the generation and detection of a single photon, while the query complexity of SORT-EDPROPORTIONALSAMPLING is O(1) once preprocessing is concluded [56].
In summary, both the photonic architecture presented in the main text and the ASIC described here have about the same computational complexity O(log N ) [55].Note that, in principle, we need to take into account both memory access operations and FLOPS when estimating the energy cost.However, assuming a highly localized architecture approximately equalizes the power consumption of memory accesses and FLOPS.

D. Role of experimental imperfections
Experimental noise and fabrication imperfections represent an unavoidable issue for any implementation.Their detrimental effects on device fidelities can also increase rapidly for applications that involve multiphoton interference in large-size interferometers [57][58][59].As we discuss below, however, the tolerance to noise in the proposed architecture is comparatively high for at least two reasons.(i) The approach described in this work involves only single-photon evolutions in linearoptical circuits, reducing the influence of unbalanced phases that is critical for multiphoton interference.Also, the circuit depth scales logarithmically with the number of modes, thus limiting propagation losses.(ii) The additional randomness induced by noise can play a positive role in the operation of the device.In fact, since decision-making consists of singlephoton random walks, random deviations from the ideal probability distributions lead to a tendency to explore alternative paths, without sticking to the estimated policy (as opposed to greedy approaches).
To investigate this aspect, in Fig. 7 we consider a noisy architecture used to solve a 3D GridWorld analogous to Fig. 4. To model noise, we follow a standard approach for tunable photonic circuits, where each beamsplitter U BS c,kl is physically implemented as a Mach-Zehnder interferometer with a tunable phase-shifter between two symmetric beamsplitters Gaussian noise is then added to the phases θ c,kl , to simulate imperfect settings or mechanical instabilities.Specifically, in Fig. 7 we assume ideal beamsplitters transmissivities to isolate the contribution of phase errors, however a similar behavior is observed when noisy beamsplitters are considered.We observe that noise-free implementations (Fig. 7a) tend to remember only very few very good paths in the agent's memory.Conversely, noisy implementations (Fig. 7b) tend to explore many more effective paths, eventually giving rise to a cloud of paths that connect to the reward from different locations.We emphasize that, even though the plot only displays the behavior of a single agent on a single maze, the above results were found to hold for practically all the agents inspected.Numerical evidence for this advantage is provided in Fig. 4, which shows that realistic levels of noise can indeed speed up the learning process.

E. t-PS in factorized action spaces
The tree structure of t-PS is particularly convenient for problems with factorized action spaces.This is due to its architecture being able to capture the hierarchical structure of a problem, namely the correlation between different action subspaces.In Sec.V A, we discussed how defragmentation of the agents memory, which consists in reordering the way actions are assigned to the output modes, could allow forms of generalization and abstraction.In this section, we a b Noisy Noiseless FIG. 7. Noise-enhanced exploration in GridWorld.Noise in the photonic implementation represents an additional source of randomness in the learning process, which enhances the likelihood of exploring new paths, as well as avoid getting stuck with suboptimal behavior.This figure shows a comparison between the PS policies learned (a) without noise or (b) with noise: noisy plots tend to exhibit larger clouds, meaning that more paths have been explored and reinforced in the same time.Here, the green sphere ( pA = (2, 2, 2)) and the blue sphere ( pR = (2, 9, 9)) represent the PS agent and the reward, respectively.The learning parameters are λ = 8 and η = 0.11, and damping with γ = 0.999 is applied every 100 steps as in Fig. 3. Green arrows describe the most probable action the agent would take in each cell, with a size proportional to the probability.Black arrows highlight a single path taken by the agent after the learning process.
will show how defragmentation can capture the absence of correlation between action subspaces.Specifically, we take a closer look at the internal operation of a simulated photonic agent, which is challenged to learn the optimal policy in an instance of the multi-armed bandit problem with independent action spaces [29].Let us consider a problem with three subactions associated with the spaces (A 1 , A 2 , A 3 ) of size (2, 4, 2), i.e.A 1 = (a 1,1 , a 1,2 ), A 2 = (a 2,1 , a 2,2 , a 2,3 , a 2,4 ), A 3 = (a 3,1 , a 3,2 ), for a total of 16 actions (Fig. 8a).This construction is not natural in the formulation of the problem presented in Sec.V B (casino, country, ...), since we assume full independence between the components of the actions.To investigate the dynamics of the internal settings, let us label the output modes (m 1 , ..., m 16 ) by the action (or node) sequence, i.e. m 1 = (a 1,1 , a 2,1 , a 3,1 ), m 2 = (a 1,1 , a 2,1 , a 3,2 ) until m 16 = (a 1,2 , a 2,4 , a 3,2 ).Also, let us assign rewards to the subspaces according to Λ 1 = (0.95, 0.05), Λ 2 (x) = (2 − 2x + 2x 2 ) −1 (0, x 2 , 1, (1 − x) 2 ) and Λ 3 = (0.05, 0.95) (Fig. 8b), x and being a variable parameter and a rescaling factor, respectively.In this scenario, the beamsplitters in the first and last layer respectively control the behavior of A 1 and A 3 , while those in the intermediate layers control A 2 .Hence, we can monitor all the beamsplitters' transmissivities as rewards change with x (Fig. 8c).As we show in Fig. 8d, the probability pc,kl of taking the upper path at each node resembles the shape of Λ 2 (x) in Fig. 8b, in particular pc,kl = 1 for A 1 since the agent learns to make the first action no matter the value of x.Overall, it is possible to visually relate the curves in Fig. 8d to the underlying conditions described in Fig. 8c and (using colors that match curves and beamsplitters) in Fig. 8a.Furthermore, the fact that almost half of the beamsplitters (green) are not updated (p c,kl = 0.5, since their behavior is not relevant), can be seen as a form of abstraction that naturally occurred in the agent's memory.Eventually, this connection between factorized actions spaces and internal parameters encourages to devise further mechanisms to enhance the learning process, which could simplify tasks in factorized (or factorizable) problems of higher dimensionality.8. Operation of the agent's memory in factorized problems.In each layer, an independent policy can be learned for a factorized action space, which allows the agent to boost the learning process.See text in Sec.VII E for a description of the example shown here.a) Connection between beamsplitters and action subspaces within t-PS.Beamsplitters branches are labeled as (i, a), where i numbers the subspaces and a the actions, and colors are used to link them to panel (d).b) Evolution of the rewards Λ2(x)/ = (2 − 2x + 2x 2 ) −1 (0, x 2 , 1, (1 − x) 2 ) associated with subspace A2 as a function of a parameter x.Labels (2, a), with a = 1, ..., 4, link the curves to the four actions of A2. c) Evolution of the full landscape of 16 rewards, here normalized to the maximum values.Separate colors are used to follow the evolution of each bar, and are not connected with the color scheme in panels (a) and (d).d) Corresponding evolution of the probability pc,kl of taking the upper path at each node (k,l).Curves are colored according to the layout in panel (a).Beamsplitters that do not change are shown in green.Values are averaged over 10 3 agents, after 3 × 10 3 trials and rescaling the rewards by = 0.004.Clearly, the frequency with which beamsplitters in A3 are traversed depends on the reward distribution Λ2(x).The reason why pc,kl = 0 for the first four beamsplitters in A3, even though the action space is factorized, is that we are reporting only their average of their values, which oscillate between 0 (as expected) and 0.5 (when agents take other paths and beamsplitters are not enforced to change).

FIG. 3 .
FIG.3.Photonic architecture for learning and decision-making.Single photons are routed by optical switches through L layers, where state-action (L = 1) or clip-to-clip transitions are performed (Fig.2).For PS, in this paper we only consider acyclic ECMs.Photons are then time-multiplexed, using delay lines, before reaching the detection stage in a single waveguide (whose outcome controls the optical switches and possible updates).In principle the system can even be self-stabilized[34,42,43].

FIG. 4 .
FIG. 4.Simulating the photonic architecture in GridWorld.Average path length required by a PS agent to reach the reward in a 10 × 10 × 10 GridWorld, shown in the inset, as a function of the number of trials.The same analysis is carried out for implementations with ideal (blue) and noisy (orange) phase-shifters.See Sec.VII D for details on how experimental imperfections were modeled.Curves are averaged over 10 4 agents (λ = 8, η = 0.11 and damping γ = 0.999 applied every 100 steps), while the gray band excludes lengths below the minimum (19 steps).Inset: Path taken by a single, noisy, random agent after 150 trials.The green sphere ( pA = (3, 1, 4)) and the blue sphere ( pR = (9, 9, 9)) represent the agent and the reward, respectively, while blocks represent untraversable 3D walls.

4 FIG. 5 .
FIG.5.Generalization in GridWorld.t-PS can exploit symmetries in a task environment to boost the learning process.In a 2D Grid-World (a) where the agent (circle) and the reward (star) are initially located at opposite corners, we can associate with modes (1,2,3,4) the actions (→, ↑, ←, ↓), so that the agent can learn to focus on the first two by adjusting just one parameter (b).

FIG. 6 .
FIG.6.Boosting the learning process in t-PS.Learning can be sped up in tasks with structured action spaces like the combinatorial multi-armed bandit[51], by taking advantage of the tree-like structure.a) Difference (boost) between the average reward collected with and without defragmentation of the agent's memory, i.e. a dynamical rearrangement of the actions over the output modes.The analysis is carried out for actions spaces of size 2 d , with d = 3, ..., 6, with only two actions rewarded (b): one fixed on the first output mode, the other one displaced progressively further over the other modes.For each d, the magnitude of the boost depends on the number of layers (from 1 to d − 1) where the rewarded paths differ: neighboring (faraway) modes lead to smaller (higher) boosts.For clarity, curves are interpolated connecting one point every 10.Averages are computed over 5 × 10 3 PS agents (λ = 0.025, γ = 0.9975).