Automated gadget discovery in the quantum domain

In recent years, reinforcement learning (RL) has become increasingly successful in its application to the quantum domain and the process of scientific discovery in general. However, while RL algorithms learn to solve increasingly complex problems, interpreting the solutions they provide becomes ever more challenging. In this work, we gain insights into an RL agent’s learned behavior through a post-hoc analysis based on sequence mining and clustering. Specifically, frequent and compact subroutines, used by the agent to solve a given task, are distilled as gadgets and then grouped by various metrics. This process of gadget discovery develops in three stages: First, we use an RL agent to generate data, then, we employ a mining algorithm to extract gadgets and finally, the obtained gadgets are grouped by a density-based clustering algorithm. We demonstrate our method by applying it to two quantum-inspired RL environments. First, we consider simulated quantum optics experiments for the design of high-dimensional multipartite entangled states where the algorithm finds gadgets that correspond to modern interferometer setups. Second, we consider a circuit-based quantum computing environment where the algorithm discovers various gadgets for quantum information processing, such as quantum teleportation. This approach for analyzing the policy of a learned agent is agent and environment agnostic and can yield interesting insights into any agent’s policy.


Introduction
From playing games to predicting the structure of proteins [1][2][3], deep reinforcement learning (DRL) has become immensely successful in solving complex problems. This rapid progress has inspired the application of DRL in various scientific disciplines, ranging from biology [4] and medicine [5] to quantum physics [6,7]. These applications focus specifically on acquiring novel or optimized solutions to specific problems. In contrast, machine-assisted scientific discovery goes beyond finding an optimal solution to a narrowly defined task and asks for generic methods that can assist scientists in basic research [8][9][10][11][12][13][14][15][16][17][18][19] and may be of particular use in quantum information tasks. Here, problem solving is only one part of the scientific discovery process and can be complemented by methods such as explainable reinforcement learning (XRL) to improve the transparency of black box algorithms. In this way, we provide human users with tools to interpret complex reinforcement learning (RL) policies and even learn from them.
ML-assisted research requires systems that deliver insights into the solution strategy. For example, elements extracted from the observation of the agent's behavior can inform non-expert users about the strengths and limitations of the RL agent [20]. Often enough the path towards a result is as interesting as the result itself. This is especially true for problems in quantum physics, where our intuition easily fails to grasp how a certain method led to the desired result. In [11], for example, an active learning agent learns to prepare highly entangled quantum states in simulated quantum experiments. Interestingly, in an expert post-hoc analysis of the agent's episodic memory, the authors (re-)discovered multiple tools (so-called gadgets) that were particularly useful for this task. Some of these tools have previously been studied in the context of

Related work
As the autonomy and complexity of learning algorithms increase, the need for explainable approaches, that are both transparent and interpretable, becomes progressively more important. In the context of artificial intelligence (AI)-driven science, we deem transparency and interpretability as vital. If AI systems are used as research assistant systems, they must be able to provide the human user with insight into their solution process. Despite the relevance of the term interpretability for AI research, there is no commonly agreed upon definition for it [27]. In this work, we adopt the notion of interpretability introduced in [27]: 'the ability to not only extract or generate explanations for the decisions of the model but also to present this information in a way that is understandable by human (non-expert) users to, ultimately, enable them to predict a model's behavior.' The topic of explainability has been widely explored in RL using different approaches. However, to date there is no set of agreed-upon benchmark tasks to test these approaches. This lack of testbeds is likely to stem both from the difficulty to tackle explainability simultaneously in all aspects of a RL process, and from the fact that it is also a relatively recent field of research. This challenge is also reflected by recent reviews which have yet to reach a consensus on taxonomy [28][29][30]. To contextualize our work within the broader field of XRL and its recent developments, we draw upon the classification introduced by Puiutta and Veith [27] and Vouros [29]. Puiutta and Veith propose a classification that introduces two classification dimensions, time and scope. The time of information extraction can be divided into intrinsic and post-hoc, where intrinsic refers to methodologies that are inherently interpretable at the time of training, and post-hoc applies to methods that provided interpretability after training of either interpretable or black-box approaches. The scope is instead separated into local and global, where a local method explains the behavior of an agent focusing onspecific actions and states, while a global method provides more generalized explanations. One can then further refine this taxonomy using, for example, the work by Vouros [29], who identifies three subtasks which may be both intrinsic and post-hoc: (i) The model inspection problem which is local in scope, (ii) the outcome explanation problem which is also local in scope, and one global subtask: (iii) the model explanation problem, which includes the present work. In the following, we describe each of these settings in more detail, providing examples to contextualize our work with recent advances in the field of XRL.
(i) The model inspection problem aims at providing an explanation for the properties of a model through, for example, a careful inspection of their inner workings. Ribeiro et al proposed the method of local interpretable model-agnostic explanations [31], which uses a local interpretable model to explain predictions and to understand the correlations between input features and model responses. In a more recent work [32], the authors presents an in-depth analysis of the AlphaZero network trained to play chess, where the network seems to learn concepts similar to those used by human chess players. Their analysis allows to identify where and when such concepts are represented in the network. Our approach, instead of explaining the predictions of the agent's internal model, aims at providing insight into the agent's overall behavior. (ii) The goal in the outcome explanation problem is to gain insights into the agent's responses, such as skills and actions used in specific states. Additionally, it also aims at providing a local interpretation of the objective function. The reward decomposition approach by Juozapaitis et al [33] is a local, intrinsic XRL method designed to enhance a model's explainability by enabling it to distinguish different reward types. This approach allows the authors to gain insights into the decision-making process, specifically why a certain action is chosen based on the reward type received in specific states. However, Rietz et al [34] realized that local one-step explanations can also be ambiguous, and instead, they propose to use hierarchical goals to provide context for local explanations. The focus on local explanations is why our framework departs from this problem setting, since gadgets serve as summaries of the overall behavior of an agent, which cannot be limited to specific state-action configurations. (iii) The model explanation problem aims at providing insights into the agent's responses at a larger scale.
Here, Vouros further distinguishes between the objective explanation problem and the policy explanation problem, providing global explanations for the objectives and the policy, respectively. As an example of the objective explanation, Finkelstein et al [35] provide explanations in the form of transformations that adapt a model environment until the behavior of an, e.g. model-based agent, matches the behavior of an, e.g. model-free agent on the actual environment. Instead, our framework falls into the category of policy explanation problems. Here, we discuss and compare specifically those algorithms that share similarities with our approach. HIGHLIGHTS [36], for example, is an algorithm that selects important states of the environment and constructs summaries to provide an overview of the agent's behavior. Important states are those where taking actions that deviate from the ones sampled via the policy negatively impacts future rewards significantly. HIGHLIGHTS-DIV [36] is an extension of HIGHLIGHTS that emphasizes diverse aspects of the agent's behavior by excluding trajectories with similar states, ensuring that the summaries highlight distinct behaviors within different regions of the state space. Further building upon this work, [20] introduces an algorithm that searches for so-called interestingness elements, which denote interesting interactions across four different categories: frequency, execution certainty, transition value and learned strategies. These elements are extracted from a state-action transition graph using a best-first search, and are presented as visual summaries of the policy.
In our work, we extract interesting action sequences using pattern mining, which assesses the interestingness of a pattern across two dimensions: its frequency and its cohesion, which quantifies the average distance between elements. While pattern mining has the advantage of detecting interesting correlations between data samples, a challenge arises with respect to the interpretability of a large set of output patterns. We circumvent this issue by employing a clustering method, which reduces the complexity for the end-user by suitably grouping the output patterns. This is a major step towards making sequence selections for explainability more viable in practice, even compared to methods such as that in [20]. Moreover, the proposed algorithm is tailored to the needs of scientific discovery in the quantum domain, where gadgets represent compact experimental setups or useful subroutines in complex quantum protocols. Our framework was tested in both of these settings within the quantum domain. While this paper was under review, the framework of [20] has also been extended to improve explainability through methods such as clustering [37].

Methods
In this section, we will introduce the main methods used in the proposed architecture (see figure 1). With this approach, we develop a XRL method for scientific discovery in the quantum domain by (i) collecting data from the policy of an RL agent (see section 3.1), (ii) extracting gadgets from the RL data (see section 3.2), and (iii) cluster the gadgets (see section 3.3). In this way, we improve the interpretability of both the gadgets and the environment for the human user.

Reinforcement learning
In RL an agent interacts with its environment. At each time step t, the agent perceives information about the state of the environment in the form of an observation o ∈ O, takes an action a ∈ A, and receives feedback in the form of a reward r ∈ R. In this work, we consider the action set A and the observation set O to be finite and discrete sets. An episode describes all interactions between an agent and its environment until a termination condition is fulfilled. To simplify the interpretability of interaction sequences, we assume feature encodings for both actions and observations. That is, an action a can be described by n features a = (f 1 , f 2 , . . . , f n ). For example, if actions describe movements, their features may be direction and speed. Even with minimal knowledge of the environment, such an encoding is typically easy to define. At the same time, this encoding gives rise to two complementary ways of clustering gadgets (see section 3.3) and allows for more control of the mining procedure (see section 3.2.3).
In RL, the agent's behavior is described by a conditional probability distribution called policy π(a|o). The agent's interaction with the environment is captured by a sequence of observations, actions and rewards. A standard figure of merit to evaluate the agent's performance is given by the return, which is the sum of rewards discounted by an environment-specific discount factor γ. For our purpose, different from standard formulations of the return (see appendix A.2), we define it as: T is the interaction sequence and T is a fixed integer length called horizon. The goal of the agent is to find the optimal policy which maximizes the expected return. Scientific tasks can be phrased as RL problems with their solutions encoded in the agent's policy. This technique has already been applied successfully in various fields. In this work, we analyze the obtained policy by extracting gadgets, i.e.: frequent and compact subsequences of actions with large returns (see section 3.2). To extract gadgets, we collect sequences d of fixed length T with return values G(d) larger than a threshold G min . These gadgets provide a way to interpret and explain the agent's policy, ultimately allowing us to draw conclusions about the structure of the environment. This is particularly useful in environments such as the one discussed in [11], where the authors can conclude that certain interferometers are particularly useful for generating quantum entanglement.
To ensure that we gain as much information about the environment as possible, we employ a exploration-driven RL algorithm as opposed to the usual exploitation-exploration-balanced approach. That is, while large returns are important, we prioritize a diverse action sequence set. For example, we use an agent with intrinsic motivation (IM) that receives internal rewards to facilitate exploration. We refer to appendix A for a consideration of the different types of RL environments and how they facilitate the use of different RL agents. Independent of the choice of RL agent, gadgets are extracted from the collected dataset by the data mining algorithm described in the following subsection.

Gadget mining
In this work, we consider gadget mining as an extension of sequence mining to interaction sequences. Pattern mining, in turn, consists in discovering interesting and possibly useful patterns in a database. When pattern mining has to take into account the temporal order of events, it is called sequential pattern mining, or sequence mining for short [38]. In this work, we rank subsequences of labeled, sequential data as patterns by their support, cohesion, and interestingness, following the approach and notation introduced in [24].

Support, cohesion, and interestingness
Consider a dataset D consisting of |D| sequences. Each sequence in the dataset D can be described as a sequence of events e = (i, τ ), where i ∈ J is an item and τ ∈ N is the corresponding time stamp. J denotes the set of all items that can be arranged as a sequence. A sequence of l events is denoted as d = (e 1 , e 2 , . . . , e l ). A sequenced = (ẽ 1 ,ẽ 2 , . . . ,ẽ m ) is a subsequence of the sequence d, denoted asd ⊑ d, if there exist m indices j k for which 1 ⩽ j k < j k+1 ⩽ l such thatẽ k = e j k for all k ∈ {1, . . . , m}. Subsequences that start at timestamp τ s and end at timestamp τ e are denoted asd (τs,τe) .
Following [24], the support F D (d) of a subsequenced quantifies the frequency of sequences d that contain the subsequenced within the dataset D: where set N D (d) = {d ∈ D|d ⊑ d} is the set of all sequences d that containd. The cohesion quantifies the average compactness of a subsequence within a dataset D by: where the average shortest interval length With these definitions, the interestingness I(d) of a subsequenced is then simply the product of its support and its cohesion: In the following, we always consider that the support F D , cohesion C D , and interestingness I D are defined with respect to the full input dataset D, which allows us to ease the notation introduced in [24] by omitting the subindex D.

Patterns and gadgets
Patterns of a sequential dataset D are defined as subsequences that have a sufficiently large minimal support, cohesion, and interestingness with respect to the dataset: Gadgets are sequences of actions that are also patterns. To define gadgets, we first define the action sequence d A ⊑ d of an interaction sequence d as its restriction to the action set A, i.e. d A := d |A ∈ A T . Definition 2 (Gadget). Given a return threshold G min ∈ R, consider a dataset D R of rewarded interaction sequences such that G(d) ⩾ G min , ∀d ∈ D R . Let D A := {d |A |d ∈ D R } be the associated dataset of action sequences. A pattern w.r.t. D A , F min , C min , I min is a gadget w.r.t. D A , F min , C min , I min , G min .
Notably, we ignore observations and rewards in the definition of gadgets. However, we will use this information for clustering in the last stage of the algorithm, as described in section 3.3. In appendix B, we describe an efficient approach for identifying gadgets based on the SPADE algorithm [25], which was originally designed to distill patterns from sequential datasets. The algorithm takes as input a dataset D A of rewarded action sequences and hyperparameters F min , C min , I min and outputs a set of gadgets, with no predefined elements or length.

Pre-and postprocessing
To facilitate the interpretability of gadgets, we use additional data processing steps whenever possible. For example, as mentioned in section 3.1, knowledge about the structure of the action space can be used to encode each action a by a feature vector a = (f 1 , f 2 , . . . , f n ). If features possess a hierarchy, this allows controlling the granularity of the mining procedure by removing irrelevant features from the encoding before mining. In the same way, we can apply additional filters in postprocessing to remove gadgets with certain undesirable properties or to restrict the total amount of found gadgets.

Clustering gadgets
The mining algorithm described in appendix B returns gadgets as defined in section 3.2. All gadgets can be associated and ranked according to the three quantities presented in section 3.2.1. However, these quantities alone do not provide any relational information about gadgets that could be particularly interesting to a user. For instance, the agent should be able to autonomously group similar gadgets by different criteria. We use the well-established, density-based clustering algorithm hierarchical density-based spatial clustering of applications with noise (HDBSCAN) [26] to cluster gadgets in a metric space induced by two different distance measures. More specifically, clusters are associated with dense regions of gadgets, while gadgets that are farther apart according to these distances are more likely to be associated with different clusters. The algorithm is described in more detail in appendix C. In addition, we also make use of any extra information provided by the environment to refine the clustering process and group gadgets in a more meaningful way.
In the following, we consider two complementary approaches to measuring distances for clustering: The utility-based method described in section 3.3.1, which directly compares gadgets, and the context-based method in section 3.3.2, which compares the states of the environment when gadgets are used (context). Both methods serve to increase the interpretability of the gadgets, as we discuss in section 2 under the broader heading of interpretability.

Utility-based clustering
Utility-based clustering groups gadgets by their utility, which separates the gadgets by their action composition. As defined in section 3.2.2, gadgets consist of sequences of actions that are described by a set of features (see section 3.2.3). These features can be manipulated and used to encode the gadgets and define a suitable distance measure, which will be used during the clustering stage. For instance, a simple and convenient encoding could be a tally vector that counts the number of actions of a certain type each gadget g contains. In general, an encoded gadget can be described as a vector ϕ(g) ∈ R m . With this encoding, we can also define a utility weight α i ∈ R for each vector component i ∈ {1, . . . , m}, which weights the importance of a feature with respect to the other components. The encoding of a gadget g weighted by a list of utilities is denoted as ϕ w (g) = ( α 1 ϕ 1 (g), . . . , α m ϕ m (g) ). A distance measure between two gadgets M(g 1 , g 2 ) is then given by the distance between their weighted encoding vectors ϕ w (g 1 ) and ϕ w (g 2 ). Importantly, we remark that the specific encoding, utility weights, and distance measure are all user-defined quantities, which provide an extra degrees of freedom that can be especially useful when prior knowledge about the task environment is available. In the following sections, we will return to these definitions and discuss the specific choices made for each environment under investigation. The resulting clusters can then be further analysed by determining the most frequent feature composition of the gadgets within the cluster and compare them to the most frequent feature composition of the gadgets in other clusters.

Context-based clustering
In this section, we consider clustering gadgets by the context they are used in, that means, by the observations that triggered their use. For each gadget g = (a (1) , a (2) , . . .), consider the dataset of all interaction sequences D g that contain g as a subsequence, i.e. D g := {d ∈ D|g ⊑ d}. Recall from section 3.1 that an interaction sequence is given by T . For every interaction sequence d in D g , we collect the observation prior to the first appearance of the first element a (1) of the gadget g in a list. This list is referred to as initialization list I g and as its name suggests, this list contains the observations (the context) that initialized the use of the gadget.
To cluster gadgets g 1 , g 2 by their context, we need to define a distance between two initialization lists. Consider two observations o, q and their distance m(o, q), which is defined by the user (we use the Hamming distance, as described in appendix D.5). Then, we define the distance between initialization list as, This distance is thus the average minimum difference between all elements of the two initialization lists. This definition ensures that M(I g1 , I g2 ) = 0 if I g1 = I g2 . Here, we assumed that |I g1 | = |I g2 |, which is enforced by setting a maximum size to all initialization lists.

Results
In this section, we demonstrate our architecture in two quantum-inspired RL environments (see figure 2). The reason for this choice is that quantum physics is a field of science where human intuition is often challenged or fails. Therefore, an automated procedure that does not depend on our biased idea of reality may be of help for distilling key properties of the RL environment. In section 4.1, we start from the environment from [11] and use an RL agent with intrinsic motivation (IM) to design quantum optics experiments that produce multi-partite high-dimensional entanglement. In section 4.2, we consider a quantum information environment where a double deep Q-network (DDQN) [39] is tasked to design quantum circuits that preserve quantum information. For each task, we describe the environment and agent in more detail before we present the results of the gadget mining and clustering. A summary of the two environments is given in table 1. A brief introduction to quantum mechanics is given in appendix D.1 for non-expert readers.

Discovering quantum optics gadgets
In this section, we apply our algorithm to discover quantum optics gadgets, in experimental setups that challenge our human intuition based on classical physics.

Quantum optics environment
In this RL environment, an agent interacts with simulated quantum optics experiments (see [11] and figure 2(left). As observations, the agent receives information about the current experimental setup. As actions, the agent sequentially places linear optical elements such as beam splitters, mirrors, Dove prisms, and holograms along the path of a light beam. The degree of freedom that is manipulated is the orbital angular momentum (OAM) of photons [40], which spans a high-dimensional (in principle infinite) Hilbert space. The task of the agent is to create high-dimensional entanglement between three of the four photons. Since entanglement is a genuinely quantum phenomenon with no classical analog [41], designing experimental settings to generate entanglement routinely challenges our classical intuition [21,22]. Entanglement is characterized by the so-called Schmidt-rank vector (SRV) [42]: This is a three-dimensional vector (r 1 , r 2 , r 3 ) where each r i quantifies the amount of entanglement between photon i and the others. The reward function in this environment depends on the SRV, a positive reward is given only to those SRVs for which r i ⩾ 2 for all i = 1, 2, 3. We will not distinguish different orderings of the three SRV components. As an initial setup, the agent can operate on two pairs of entangled photons created by a spontaneous parametric down-conversion in two nonlinear crystals [21]. The detection of one photon at the end, in combination with the linear optical elements mentioned above, enables the probabilistic generation of entanglement. Further details on the environment and a brief background on quantum optics are provided in appendix D.2.

Intrinsically motivated agent
The environment described above has a unique starting state corresponding to two pairs of entangled photons in four distinct spatial modes. As the task of standard RL agents is to maximize the future expected (extrinsic) reward as defined in equation (1), we can expect that a trained agent always places elements that reproduce the simplest successful setup. To ensure continuous exploration, we use an intrinsically motivated agent based on a variant of Monte Carlo tree search (MCTS) [43]. In standard MCTS [44], a policy is derived from a search tree built for each state, by keeping track of the rewards and number of state visits during the interaction with a known model of the environment. The IM is added to the MCTS in the form of a novelty count and a boredom mechanism. The novelty count guarantees that each successful action sequence is only rewarded once and the boredom mechanism further favors the exploration of less explored action sequences. Further details of the RL algorithm are provided in appendix A. We also analyze the performance of the agent in this environment in appendix D.2.

Gadget analysis
Once data has been collected, we analyze the gadgets and clusters found by the data mining and clustering algorithms described in sections 3.2 and 3.3, respectively. We separately mined 10 datasets D

(i)
A generated by as many independent MCTS agents, labeled by i ∈ 1, .., 10. These datasets have a fixed size of |D  . Quantum optics and quantum information environments. Left) In the quantum optics (QO) environment, an agent observes an optical setup and places new elements to generate multipartite high-dimensional entanglement in the orbital angular momentum degree of freedom of photons. The initial state consists of two entangled photon pairs produced by spontaneous parametric down-conversion (SPDC). The set of available optical elements consists of beam splitters (BS), Dove prisms (DP), mirrors (Refl), and holograms (Holo). Right) In the quantum information (QI) environment, an agent observes a quantum circuit and places circuit elements to protect the information encoded in the first (top) qubit against measurements. The elements placed on the dotted lines marked with the letter E correspond to the initial circuit randomly generated by the environment in this example. The actions of the agent fill up the empty slots (or layers) between the elements of the initial circuit. The set of available actions consists of various one-qubit rotations and two-qubit entangling gates (G), as well as, one-qubit measurements and two-qubit Bell measurements (M). To facilitate a direct comparison to [11], we consider only gadgets that are rewarded in a similar way as in [11], that is, gadgets that produce quantum states with SRVs (r 1 , r 2 , r 3 ) such that r i ⩾ 2 for all i = 1, 2, 3. We consider this an additional postprocessing step as described in section 3.2.3; all other more generic postprocessing steps are detailed in appendix D.3. Once all datasets have been mined, we consider two ways of clustering them: Clustering gadgets by their SRV and by utility, respectively, as described in section 3.3. Both methods were applied to the same gadgets.
SRV-based clustering. Figure 3 illustrates the results of gadgets clustered by their SRV. We find that the cluster labeled by the SRV (3, 3, 2) indeed contains the same gadgets retrieved in [11]. A representative example of a gadget in this cluster-the so-called Mach-Zehnder interferometer-is further analyzed in appendix D.4. The other clusters contain additional gadgets that also correspond to interferometric setups. Further details on these clusters are shown in table 4. This clustering method can be used to look up how to generate different SRV states. Utility-based clustering can then be used to gain insights into the composition of the gadgets, as we discuss in the next section.
Utility-based clustering. We illustrate the results of utility-based clustering in figure 4. For this, we represent each gadget as a four-component vector, where each component counts how many optical elements of one type (beam splitter, hologram, Dove prism, mirror) appear in the gadget. The utility of each component is set to one: α i = 1 ∀i ∈ {1, . . . , 4}. The distance between two gadgets is the Euclidean distance between their composition encodings. Thereby gadgets are clustered by their composition, yielding insight into the optical elements needed to generate a specific SRV state. In SRV-based clustering, the cluster associated with SRV (3,3,2) contains the largest number of gadgets. In contrast, utility-based clustering breaks this cluster into five smaller clusters, each corresponding to specific sequences of elements that generate the (3, 3, 2)-state. Cluster C4 mostly contains gadgets that consist of a mirror, a hologram, and two beam splitters. Cluster C2 contains the two equivalent Mach-Zehnder interferometers described in [11] (see also appendix D.4): two equivalent arrangements of a mirror, a Dove prism, and two beam splitters. We also note that we find that, compared to cluster C2, the gadgets in cluster C3 often contain one additional optical element: a hologram. Equivalently, C1 contains one extra element, a hologram, w.r.t. C4. An additional hologram is useful to reach higher SRV-states when combined with an additional beam splitter. Two gadgets Gadgets have been mined from datasets generated by 10 intrinsically motivated MCTS agents. We find various interferometers in the clusters. Each of them consists of combination of at most four different types of optical elements: mirrors (Refl), Dove prisms (DP), holograms (Holo), and beam splitter (BS). Most notably, the cluster corresponding to SRV (3, 3, 2) (green) contains the most important gadgets found in [11], i.e. a Mach-Zehnder interferometer and an equivalent (non-local) version thereof. Most notably, we find that the two equivalent Mach-Zehnder interferometers (shown in the green C2 cluster) are contained in the same cluster even though they have two different representations. In this environment the (3, 3, 2)-states is most easily generated by the agent, this leads to the majority of the clusters (clusters C1-C4) containing mostly gadgets to generate this state, while only one cluster(cluster C0) contains the gadgets that generate the less frequent larger SRV states.
in C0, on the other hand, produce a (3, 3, 2)-state with additional holograms and mirrors. There is only a single unclustered element, it is the only gadget that creates a (3, 3, 2)-state with the same elements as C4 with one additional mirror. Thus, C2 and C4 contain the shortest gadgets that produce the (3, 3, 2)-states. Cluster C0 differs from the others, as six out of the eight gadgets produce states where at least one SRV component is greater than three. Interestingly, gadgets that generate these larger SRV states have a similar composition: All of these consist of holograms followed by beam splitters. Such a setup increases the dimensionality of the photons. A single hologram followed by a beam splitter is a gadget that is also mentioned in [11] which was only found due to an extensive post-hoc analysis of the agent's episodic memory. Instead, here it arises naturally in a single cluster. Further details on the clusters are shown in table 5.
Interpretation. It is difficult to understand the behavior, i.e. the policy of an exploration-driven agent such as the MCTS algorithm because its behavior constantly evolves. Here, we provide an alternative way of analyzing its policy by means of clustering gadgets. As an example, from these gadgets alone we can learn that interferometers are particularly useful in this environment. In addition, the clusters in figures 3 and 4 give us a strong starting point to study similarities between gadgets, which could potentially if further automation is applied lead the agent to the discovery of the Klyshko wavefront picture [45,46], where the two gadgets displayed in cluster C2 in figure 4 are equivalent. We provide a detailed list of all gadgets in appendix D.3.

Discovering quantum information gadgets
In this section, we use our architecture to discover gadgets useful for quantum information. Quantum information processing functions in a way that is different from classical information. For example, measuring quantum information that describes one observable entails that the information of another, complementary observable, is destroyed. At the same time, quantum information theory allows for fascinating and counter-intuitive phenomena like teleportation and non-locality. Here, we want to explore these phenomena by analyzing the gadgets that give rise to them.

Quantum information environment
In this RL environment, an agent interacts with a simulated four-qubit quantum circuit (see figure 2(right)). As observation, the agent perceives the current quantum circuit consisting of various gates and measurements. As actions, the agent can sequentially place quantum gates and measurements, including one-qubit rotations and two-qubit entangling gates, as well as one-qubit measurements and two-qubit Bell measurements. The task of the agent is 'to keep the quantum information alive' , that is, to avoid losing the information that is initially encoded in the first (top) qubit of the circuit. As the initial setup, the agent is confronted with an incomplete quantum circuit, consisting of random quantum operations only at fixed positions, i.e. layers (marked with an E in figure 2(right)). The agent performs actions until the circuit has been completed. Whenever the circuit preserves the quantum information until the end, the environment issues a positive reward. Conversely, whenever the quantum information is destroyed, through e.g. measurements, a negative reward is given. A more detailed description of this environment and the reward function can be found in appendix D.5, together with a short introduction to quantum information theory.

Deep RL agent
The environment described above has so-called random starts, i.e. the initial observation is sampled randomly from a finite set of starting circuits. For an agent to learn in this kind of environment, it needs to be able to adapt to many different situations and develop multiple strategies at the same time. In such environments with large observation spaces, deep RL is particularly advantageous due to its ability to generalize. Generalization is the ability to perform well when confronted with unseen data by identifying meaningful similarities between previously seen and novel (i.e. unseen) data. Here, we apply a DDQN [39]. DDQN is a Q-learning algorithm based on the standard deep Q-network (DQN) [47], which features two neural networks (NNs) to increase the stability of the prediction of Q-values.
Further details on the RL algorithm are provided in appendix A. We also analyze the performance of the agents in this environment in appendix D.5.

Gadget analysis
We analyze the gadgets and clusters found by the data mining and clustering algorithms described in sections 3.2 and 3.3, respectively.
We consider three individual DDQN agents, each of which independently generates a dataset D For this environment, the mining algorithm further preprocesses these datasets, transforming each dataset D . The values for the thresholds are automatically adjusted to output between one and ten gadgets per dataset (see appendix B). The final threshold values are given in appendix D. 6.
Each operation that the agent can place on the circuit is associated with four features (see also appendix D.5), which describe (i) whether the operation is a gate or a measurement, (ii) whether it is a one-qubit or two-qubit operation, (iii) the qubits it is applied on and (iv) the specific gate or measurement. This action encoding provides a minimal structure to the actions without inputting prior knowledge about the environment. Features (i) and (ii) are coarse-grained descriptions of the available operations, and feature (iii) is potentially useful, especially if there is no prior assumption about symmetries in the environment. As explained in section 3.2.3, giving this additional structure to the actions allows us to control the granularity of the mining process. Here, a coarse-grained mining is implemented by removing the last feature of each Once the mining process has ended and the algorithm has output gadgets for each dataset D

′(i)
A , all gadgets are collected and clustered by utility or by context as described in section 3.3. Both clustering methods were applied to the same gadgets. A detailed description of the distance measures used for clustering is given in appendix D.6.
Utility-based clustering. We illustrate the result of clustering gadgets by their utility in figure 5. We assume that each action feature has the same utility. As we describe in appendix D.6, each gadget is encoded into a vector that counts how many one-qubit and two-qubit gates and measurements it contains. We define the distance between two gadgets as the euclidean distance between their encodings (see also appendix D.6), which implies that two gadgets with similar tallies are closer together, even if the order is different and operations do not commute. Indeed, this is what we find in the clusters of figure 5. A list of all gadgets and the corresponding cluster label is given in table 8. Even though these results do not give us new insight into the environment, in this case, they show that this clustering method is a neat way of grouping gadgets based on the operations they contain, which can be particularly useful when the set of available operations is large.
Context-based clustering. We illustrate the results of clustering gadgets by their context in figure 6. This clustering method complements and refines the utility-based method as it does not depend on the encoding of the gadget but rather on that of the observation space of the environment. In this metric space, gadgets are closer together if the agent had observed similar circuits before using the gadgets. Gadgets that belong to cluster C1 are placed at the beginning of the circuit when the initial circuit is harmless (it does not contain any measurement that destroys the quantum information). Gadgets in cluster C0 are placed later in the circuit (see example in figure 6). These gadgets are also applied when the circuit does not need to be corrected to protect the information. The gadgets in cluster C2 are used by the agent to correct initial circuits that contain a measurement on the first qubit (see example in figure 6). Interestingly, gadgets in this cluster contain two-qubit gates and measurements that teleport the quantum information from the first qubit to some other qubit in the circuit to circumvent the destructive effect of the measurement initially placed by the environment. An illustrative example of a gadget in this cluster is analyzed in appendix D.7, where we explicitly compute the reward before and after the placement of the gadget, thereby showing how quantum information propagates through the circuit. Detailed information on these results is given in table 9 and further discussed in appendix D.6.
Interpretation. As this environment is inherently random, it is difficult to understand the policy of an RL agent, i.e. why the agent does what it does, especially if it is a black-box NN. We provide an alternative way of analyzing such a policy by means of gadgets that are then clustered according to several methods. From gadgets alone, we do not learn much as they consist of various combinations of all actions. However, the clustering discloses relevant information about the environment. First, we can distinguish gadgets by the types of operations they use (see figure 5). Second, we can analyze them by the similarity of the contexts they appear in (see figure 6). The latter approach provides interesting insights into the environment. For example, Figure 6. Quantum information gadgets clustered by context. In the center, we show examples of quantum information gadgets for each cluster. The cluster labels on the sides show an example of a circuit from the initialization list (context) of the depicted gadget. The empty square in the circuit indicates where the gadget will be placed. Operations marked with an E are randomly chosen by the environment as the initial start. We find that the clustering algorithm distinguishes gadgets that are used in circuits that do not need to be corrected to protect the information (clusters C0 and C1), from teleportation gadgets (cluster C2) in the form of entangling gates between qubits and measurements. Circuits in the initialization lists of gadgets in C2 usually contain a measurement on the first qubit, which destroys the quantum information. To correct these circuits, the agent learned to use the gadgets in C2, which teleport the information to a different qubit to circumvent the destruction of the quantum information through the measurement on the first qubit. The gadget depicted here as an example in cluster C2 is explicitly analyzed in appendix D.7. A complete list of all gadgets and their cluster labels can be found in table 9 (appendix D.6).
we observe that entangling gates on the first qubit are used when the context involves a measurement on that qubit.

Conclusion
Machine learning has already been acknowledged as part of the modern-day physicist's toolbox [48], and it promises to take on a more fundamental role that goes beyond problem-solving.
One challenge to the development of automated research assistants is the fact that most modern ML approaches operate as black boxes. While research assistant systems that solve a complicated problem are already a valuable asset, we aim to develop them further specifically for the quantum domain to gain insights into the solution process.
In this work, we develop an architecture that operates on top of RL agents to autonomously extract promising patterns (gadgets) from the agent's behavior in an automated, post-hoc analysis. In this way, we provide an additional analysis tool that enhances our understanding of the agent's learning and problem solving beyond extracting a single solution to the problem at hand. We demonstrate this method in two quantum physics environments, which are particularly interesting because of the often counter-intuitive nature of quantum mechanics. We subjected the resulting gadgets and clusters to a human expert analysis and confirmed that most of them are indeed meaningful in these environments.
We expect that the framework developed in this work will form the basis of more general architectures that further improve the transparency of RL agents in quantum physics. For example, we currently use intrinsically motivated RL agents to stir the discovery of gadgets in simulated quantum optics experiments. In the future, we expect IM to become part of the gadget discovery architecture, such that the information about current gadgets is fed back into the RL agent via an intrinsic reward function. In this way, we can stimulate either the discovery of new gadgets or the use of already known gadgets (i.e. in the spirit of hierarchical learning [49]). This approach can be facilitated and made efficient by the use of well-established data stream pattern mining methods [50,51]. Moreover, while the current architecture focuses on discrete action sequences, in the future this can be extended to continuous action spaces, as well as actions sampled from a policy. Indeed, we observe that gadgets are closely related to so-called options [52], a well-established concept in RL. Specifically, options are triples (I, π, T ) consisting of a sub-policy π, an initialization condition I and termination condition T . In the context of our architecture, we would consider π as a gadget and use I and T in the same way that we use the initialization list for gadget clustering now. However, many interesting problem settings involve a discrete action space [7,53], where the presented architecture can already be applied out of the box.

Data availability statement
The data that support the findings of this study will be openly available at the following URL/DOI: https:// www.doi.org/10.5281/zenodo.8304550.

Acknowledgments
We acknowledge support by the Austrian Science Fund (FWF) through the DK-ALM: W1259-N27 and SFB BeyondC F7102, and by the Volkswagen Foundation (Az:97721). This project has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement, Grant Nos. 801110 and 885567. It reflects only the author's view, the EU Agency is not responsible for any use that may be made of the information it contains. ESQ has received funding from the Austrian Federal Ministry of Education, Science and Research (BMBWF). This work was also supported by the European Union (ERC, QuantAI, Project No. 101055129). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

Appendix A. Reinforcement learning
Here, we motivate and discuss in more detail the RL algorithms that are used to collect data in section 4. Let us consider the two different types of RL environments that appear in section 4 and how they motivate our choice of RL agents. Specifically, we consider environments with and without random starts. Our goal is to discover gadgets that are representative of the environment or policy. If the environment is deterministic and has a fixed starting state (such as the quantum optics environment in section 4.1), then a normal agent is likely going to converge to a single strategy, which would result in just one or two gadgets. In the case of such an environment, we thus employ intrinsically motivated agents to ensure exploration. In contrast, if we consider so-called random starts environments (such as the quantum information environment in section 4.2), the initial state is randomly sampled at every start of an episode. This already ensures sufficient exploration but necessitates the use of an agent that can generalize over a large set of observations. This is why we use a deep RL agent for this environment. In the following, we will describe both types of RL agents used in this work.

A.1. Monte Carlo tree search with intrinsic motivation
In environments with sparse extrinsic rewards, methods such as IM can enhance an agent's performance. In this work, we use IM to enhance the exploration by the agents when there is a deterministic single-start RL environment. An IM agent receives intrinsic rewards to drive exploration. For example, agents may be self-driven using a sense of surprise [54,55], curiosity [56][57][58], boredom [59][60][61], need for control [62,63], or empowerment [64,65].
The explorative version of MCTS described here draws upon three of the four main components of the original algorithm: expansion, backpropagation, and selection. Additionally, we consider changes to these subroutines which we partially adopted from [43]. In the MCTS algorithm, a decision tree is built from interactions with an environment, each node in the tree corresponds to a state 4 encountered in the environment and each directed edge represents a transition to the next state triggered by an action. The first step of IM MCTS is a multi-step expansion that starts from the root node and iteratively builds up a tree. At each node with depth d if there is an action that has not been taken from the corresponding state, then a new node with depth d + 1 is added as a child node. This procedure is repeated until the depth is equal to an expansion horizon T. The environment we consider is episodic with an episode length of T and a single start, thus every rollout starts from the same root node and ends at a leaf node at depth T. In order to facilitate explorations, we define an intrinsic reward r that only rewards unique goal-states. Each child s ′ of node s is uniquely defined by the tuple (s, a) and has an associated novelty count U (s, a), which is the number of unique rewarded goal-states reachable from this node. Additionally, the number of visits to a node is tracked with a visit count N(s, a). At the end of each rollout, during backpropagation, the novelty counts U(s, a) and the visit counts N(s, a) The parent visit count N(s) = ∑ a N(s, a) is given by the sum of visit counts to all children nodes. The first term drives exploitation, while the second term drives exploration, so that the hyperparameter C balances exploration and exploitation. The policy for selecting the next action a is greedy: In the quantum optics environment, an explorative agent is essential, which is why we introduce an additional hyperparameter, a boredom threshold B which resets exploitative agents after some time. The corresponding boredom mechanism resets the novelty count U(a, s) as soon as it reaches the chosen boredom threshold B, ensuring further diversity of the collected dataset.

A.2. Double deep-Q network
Deep RL methods are employed in environments with large observation or action spaces where the generalization capabilities of NNs are beneficial for the agent to improve its performance. At their core, many of these methods are based on adapting the policy of the agent to optimize the return G t : with the discount factor γ ∈ [0, 1). Assuming this figure of merit, each state 4 and action pair (s, a) can be assigned an action-value that quantifies the return expected to be received starting from state s in step t taking action a and subsequently following policy π The aim is to find the optimal policy, i.e. a policy with a greater or equal expected return compared to all other policies for all states. Such a policy can be derived from the optimal action-value function q * . The recursive relationship between the value of the current state and the next state can be expressed in the Bellman optimality equation, which, when solved, yields the optimal policy Instead of directly solving the Bellman optimality equation, in value-based RL, the aim is to learn the optimal action-value function from data samples to derive the optimal policy from the learned values. One of the most prominent value-based RL algorithms is Q-learning, which was designed for discrete state and action spaces, where each state-action pair (s, a) is assigned a so-called Q-value Q(s, a) which is updated to approximate q * . Starting from an initial guess for all values Q(s, a), the values are updated for each state action pair (s, a) while the agent interacts with the environment according to the following update rule: where α is the learning rate and s ′ is the next encountered state after taking action a in state s. This algorithm has been proven to converge under certain criteria, one of which is a policy that is infinitely exploring but greedy in the limit. In practice, a popular choice for a sufficiently explorative policy is the ϵ-greedy policy: In the following, we will look at how Q-learning can be extended to large spaces through the use of NNs as function approximators.
NNs work exceedingly well, solving tasks that could not be handled by prior RL methods. An essential requirement for NN training, independently and identically distributed data, is not naturally provided by the sequential nature of RL data. This problem is circumvented by experience replay. This method trains the NN with batches of experiences previously divided into single episode updates and randomly sampled from a memory. To stabilize training, two NNs are employed, a policy network, that is continuously updated and a target network that is an earlier copy of the policy network. The policy network is used to estimate the current value while the target network is used to provide a more static target value Y: Using the double deep Q-network (DDQN) algorithm, the action for the target value is sampled from the policy network to reduce the overestimation bias present in standard DQN. The corresponding target is defined by: This target value will be approximated using a chosen loss function, we choose a smooth L1-Norm loss.

Appendix B. Data mining algorithms
Here, we describe an efficient algorithm for identifying interesting subsequences, based on the SPADE algorithm [25] to find frequent subsequences. The algorithm starts by collecting all length-one subsequences with I(d) > I min into a set of interesting subsequencesD. The minimum interestingness I min is a user-defined threshold. All length-one subsequences in this set are then combined into length-two subsequences. Every length-two subsequence with F(d) > F min , C(d) > C min and I(d) > I min will be added to the set of interesting subsequencesD. The minimum cohesion C min and the minimum support F min are user-defined thresholds. The set of interesting subsequencesD will be iteratively filled with longer subsequences until a predefined maximum subsequence length is reached. After the generation of the most frequent and compact subsequences, the output patterns may be filtered in a postprocessing step. For our purposes, we filtered by length and coverage. The length filter discards all patterns that are shorter or longer than a pre-defined interval. The coverage filter ensures that the maximum number of patterns that appear in each sequence is limited by the maximum coverage value. This filter works as follows: for each pattern with more than one element, starting from the patterns with the highest interestingness 5 , it checks in which sequences the pattern appears, or, in other words, which sequences are 'covered' by the pattern. Each sequence can only be covered by a maximum value of patterns, set by the user. Patterns that are left with no sequence to cover are discarded. Hence, choosing a low maximum coverage value means that more patterns will be filtered out. Additionally, we define an RL data-specific mechanism that filters patterns by their reward.
We added an hyperparametrization scheme to the mining procedure that automatically adjusts the threshold values F min , C min , I min to output a number of patterns that lies within an interval. Therefore, the user can either set fixed values for the thresholds, or choose an interval for the number of output patterns instead, to ensure that the algorithm does not yield too few nor too many patterns.
The algorithm described above is straightforwardly extended to find gadgets by restricting the underlying dataset to a set of action sequences of interaction sequences d that have a return G(d) ⩾ G min as described in section 3.1.

Appendix C. HDBSCAN
To cluster the gadgets we use the HDBSCAN algorithm [26], which is a hierarchical clustering algorithm that extracts a set of clusters based on their stability. This is done in a series of steps which we will briefly review in this section. First, the data will be transformed such that dense regions stay dense and sparse regions become even sparser. This is achieved by transforming the metric space using the so-called core distance core k (x), which is the distance within which the next k neighboring data points can be reached. Then, the mutual reachability distance, which either preserves the original distance or increases it to the larger core distance of two data points, can be defined: Here , d(a, b) is the distance between the data points a and b under the original distance metric.
Then, the data is represented in a graph, where each node represents a data point and the edges between the nodes are weighted by the mutual reachability distance. If all edges above a certain mutual reachability distance threshold are removed, the graph breaks into subgraphs. Then, a hierarchy is established that orders the graph from completely connected to disconnected under the varying distance threshold. Such a hierarchy can be extracted from the minimum spanning tree, a loop-free tree that contains all M nodes of the tree and M − 1 edges with the lowest sum of weights given all the original weighted edges. Prim's algorithm is used to obtain this graph.
After building the hierarchy from progressively more disconnected subgraphs, the hierarchy can be further condensed by introducing a minimal cluster size. The clusters are retrieved from the hierarchy using the following definition of the stability of a cluster: where C is a cluster and λ p is the time when the data point p is removed from the cluster, while λ birth is the time when this cluster was formed. The clusters are then determined starting from the leaf nodes and declaring all of the leaf clusters as selected. If the sum of the stabilities of the children is higher than the stability of the parent, then the stability of the parent is set to the sum of its children's stabilities. If the sum of stabilities is lower, the parent will be declared selected instead of its children. When the root node is reached the algorithm terminates and outputs the set of selected clusters. Further details on the algorithm can be found in this tutorial.

Appendix D. Further results
In this section, we describe the environments and the obtained results in more detail. Specifically, we list all gadgets that were found and analyze the different clusters obtained by the clustering.

D.1. Brief introduction to quantum mechanics
In this section, we provide a brief introduction to quantum mechanics in order to facilitate the assessment of our results. A quantum system is described as a vector in a so-called Hilbert space H, and it is denoted (using Dirac notation) as |ψ ⟩. One key property of quantum systems is that they can be in a superposition state, which is a linear combination of the basis vectors {|i ⟩} in H = C d , i.e.
where α i ∈ C and ∑ i |α i | 2 = 1. Transformations of a quantum state are described as unitary matrices U, with U −1 = U † , which are linear and preserve the norm. Another crucial feature of quantum systems is the stochastic nature of the measurement process. The property that is measured, called observable, is described by a Hermitian matrix O = O † with spectral decomposition O = ∑ m a m P m . The coefficients a m are the matrix (real) eigenvalues and represent the possible outcomes of the measurement, while P m is the projector onto the subspace that corresponds to the eigenvalue a m . The probability of obtaining the outcome a m when measuring |ψ ⟩ is given by As a result of the measurement, the state of the quantum system is projected into the eigenstate corresponding to the measurement outcome a m as Note that, unlike other state transformations that are described by unitary matrices, measurements are non-unitary and, thus, irreversible. The above notions can be generalized to more complex, composite quantum systems by taking the tensor product of the Hilbert spaces corresponding to the individual systems, i.e. H = C d ⊗ · · · ⊗ C d = (C d ) ⊗n . Composite quantum systems can be in so-called entangled states, which present correlations that cannot be observed in classical systems. A state is called entangled when it cannot be expressed as a product state of the individual components, i.e. as |ϕ ⟩ = |ψ 1 ⟩ 1 ⊗ · · · ⊗ |ψ n ⟩ n . In the following sections, we introduce two quantum environments, one in the field of quantum optics and one in quantum information. For each section, we provide a brief introduction with examples to illustrate how these properties apply in each scenario.

D.2. Quantum optics environment
Here, we describe in more detail the quantum optics environment introduced in [11] as studied here in section 4.1.
Background.-Quantum optics studies the interaction between the individual quanta of light, called photons, and atoms or molecules. In this learning environment, the quantum systems (see section D.1) are thus photons. In the framework of quantum mechanics introduced in section D.1, the basis states of the Hilbert space describe the possible states of a degree of freedom of the photon. In this work, we consider the OAM of single photons, which is associated with the azimuthal phase of the complex electric field. There are, in principle, infinitely many discrete values for the OAM of a photon.
Here, the task is to construct an optical experiment where different optical elements are suitably arranged to achieve a final multi-photon entangled state, that is, a state that cannot be described as the tensor product of the single-photon states. An example of a two-photon entangled state is, In quantum optics, the unitary operations that manipulate photons are implemented by optical elements. For example, a beam splitter BS a,b splits a beam into two, in this case transforming a quantum state |m⟩ with OAM m into a superposition state of two OAM beams [11,21]. Examples of similar transformations are listed in table 2.
Interferometers are setups where photons travel through different paths and recombine in optical elements, thereby creating interesting interference patterns. As shown in [21], a combination of optical elements arranged as a Mach-Zehnder interferometer was found to produce a high-dimensional entangled state of the form where we use the compact notation |000⟩ = |0⟩ a ⊗ |0⟩ b ⊗ |0⟩ c , dropping the labels that identify the individual photons. From an initial 4-photon state, this interferometer selects photons b and c with the same parity (via constructive interference) and, by measuring photon d, projects the remaining 3-photon state onto the entangled state of equation (21) (see also equation (19)). In the rest of the section, we describe an environment in which the agent is tasked to generalize this procedure and find other optical setups that generate this type of entanglement. As briefly introduced above, in this environment the agent is tasked to generate high-dimensional, multipartite entangled states by suitably interfering single photons. At each time step, the agent interacts with a simulation of an optical experiment placing optical elements in a four-photon path. At the beginning of each episode, the state is initialized to a product of two OAM-entangled photon pairs,  Table 3. Optimized threshold values for the support F and cohesion C for each agent in the gadget mining. The algorithm is set to output between 1-5 gadgets, so that the threshold values are automatically adjusted accordingly. The minimum interestingness I min is the product of minimum support F min and minimum cohesion C min .  a, b, c, d. In every episode, the agent can place 12 elements before the episode terminates. At the end of each episode, the quantum state is measured in path d and the SRVs of the resulting conditional three-photon states are evaluated. The SRV is a real vector containing the rank of the reduced density matrix of each subsystem. That is, considering the tripartite system after measurement, the vector takes the shape d ϕ = (r ϕ 1 , r ϕ 2 , r ϕ 3 ), where r i = rank(ρ i ) for the reduced density matrix ρ i = tr¯i(ρ) whereī is the complement of i ∈ {a, b, c}. The environment issues a reward of one if r i ⩾ 2 for all i = 1, 2, 3, otherwise the reward is chosen to be zero. There is one exception to this rule, since the (4, 3, 3)-state is easily generated, this state is not rewarded to facilitate that the agent finds a variety of states.
In this environment, the dataset D consists of unique rewarded sequences collected during exploration. Gadgets are retrieved from the dataset D by further processing the dataset with the sequence mining algorithm described in section 3.2. The obtained gadgets are listed and analyzed in the following section.

D.3. Quantum optics gadgets
Gadget mining. The dataset D of unique, rewarded action sequences collected during exploration is mined by the sequence mining algorithm (see section 3.2 and appendix B). Each sequence consists of 12 optical elements. Instead of fixing the threshold values for the minimum support F min , minimum cohesion C min and minimum interestingness I min , the algorithm is set to output between 1 and 5 gadgets for each of the ten agents run in parallel. The thresholds are then automatically adjusted until the correct number of gadgets is retrieved. The resulting threshold values for each agent are given in table 3. When selecting gadgets for this environment, we apply three additional filters (see appendix B for details): (1) Gadgets are filtered by length to allow only gadgets of length 4-6. (2) Gadgets are filtered by coverage to ensure that at most 2 gadgets can cover each sequence. (3) We filter gadgets by their reward, which means that only those gadgets are kept that themselves constitute successful experiments. The latter filter is in line with the considerations made in [11].
We collect the gadgets from 10 different agents and cluster them in two different ways as described in the following.
Gadget clustering. In order to help us interpret the list of gadgets obtained from 10 different agents, we cluster them by (i) information provided by the environment in form of the SRV and (ii) the type of operations they contain (see section 3.3.1 for a description of utility-based clustering).
SRV-based clustering. Gadgets are sorted into clusters by the SRV that the gadget would yield if seen as a full experiment. All gadgets, their interestingness, and their clusters are listed in table 4.
Utility-based clustering. The clustering is done by the algorithm HDBSCAN (see appendix C), after encoding the gadgets using action features. In RL environments, one can assume knowledge about the structure of the action space. Typically, this is used to encode action by their features. In this case, we describe each action only by one of its features, in particular, the type of optical elements. Thus, this yields a coarse-grain description of the gadgets, which we use to define a distance between them. As described in section 4.1, the four types of optical element are BS, DP, Refl, and Holo. Each gadget is then encoded as a vector of the form ϕ(g) = (n BS , n DP , n Refl , n Holo ), where each component is the total number of occurrences of one element type within the gadget. Additionally, we define a utility vector α = (α BS , α DP , α Refl , α Holo ) that assigns a weight to each optical element type. Since we do not assume any of the elements to be more costly than others, we weight all four types equally, i.e. α = (1, 1, 1, 1). The distance between two gadgets M(g 1 , g 2 ) is then defined as the euclidean distance between ϕ(g 1 ) and ϕ(g 2 ). Clustering gadgets with this distance, we obtain the clusters in table 5. We observe that gadgets with similar compositions are clustered together. For example, cluster C2 contains only gadgets consisting of two BS, one SP and one Refl. All gadgets, their interestingness, and their clusters are listed in table 5.

D.4. Illustration of a quantum optics gadget
In this section, we provide an explicit example of how a quantum optics gadget operates. Through the mining process we obtained the Mach-Zehnder interferometer as a gadget. Generally, interferometers are important optics tools to merge multiple light sources to generate interference patterns. A Mach-Zehnder interferometer in particular, consists of the elements BS b,c , DP b , Refl c , BS b,c and produces a state with SRV Table 4. List of gadgets mined from different datasets collected by ten agents clustered by SRV. The column labeled 'Agent' shows from which agent's dataset the gadget was retrieved. The SRV corresponding to each gadget is listed in the second column. In each cluster, the gadgets are sorted from highest to lowest interestingness value. Additionally, the support and cohesion, which when multiplied together yield the interestingness, are displayed.

Pattern
SRV Interestingness Support Cohesion Agent Holo a (2) ,Holo d (2) , Holo a(1) ,BS a, (22). Here, we analyze an easier example where the input is a photon in path c in the superposition state of two OAM values: 1/ √ 2(|0⟩ c + |1⟩ c ). Using this example, we illustrate how the Mach-Zehnder interferometer sorts the input OAM values by parity into the different output paths. First, we apply the beam splitter BS b,c (see table 2) and obtain: Table 5. List of gadgets mined from 10 datasets and clustered by utility. The clusters are ordered from the highest interestingness to the lowest. Each gadget is assigned a probability (column labeled 'Prob') for belonging to the cluster. If this probability is zero, the sample is considered noise for the corresponding cluster. Cluster −1 contains all gadgets that are not assigned to any of the other clusters. The clusters 1-4 predominately contain interferometers to generate the (3, 3, 2)-state, while the gadgets in cluster 0 generate higher SRV states by using the combination of a hologram and a beam splitter. Within each cluster, the gadgets are ordered by interestingness. Additionally, for each gadget, the support and cohesion are displayed.  ,Holo b(1) ,BS a,d ,Holo c(−2) ,BS c,d C0 1.0 0.1 (5, 4, 2) Holo b (2) ,Reflc,Holo a(−2) ,BS a,d ,DP d1 ,BS b,d C0 1.0 0.0998 (5, 4, 2) Holo a (2) ,Holo d(2) ,Holo a(1) ,BS a,b ,BS Holo d(1) ,BS a,d ,Holo c(−2) ,BS a,b C0 1.0 0.0833 (4, 4, 2) Holo a (2) ,Refla,Holo a(1) ,BS a,b ,Holo c(−2) ,BS a,d C0 Second, we apply the transformations of the Dove prism DP b and the mirror Refl c on paths b and c, respectively: Finally, we apply the last beam splitter BS b,c and obtain: where we see how the even OAM value is now in path b and the odd OAM value is in path c. This can be generalized to the case where the input photon is in a superposition of more than two OAM values. Thus, this gadget is referred to as a parity sorter. However, this becomes more complex if two photons are input to this gadget, for example with the starting state |ψ(0)⟩ in equation (22). In this case, a final measurement of path d in the state 1/ √ 2(|0⟩ d + | − 1⟩ d ) leads to the high-dimensional entangled state of equation (21), with SRV (3,3,2) [21].

D.5. Quantum information environment
Here, we describe in detail the quantum information environment of section 4.2. We start by providing a brief background on quantum information processing, which is instrumental in describing the learning environment studied in this work.
Background.-In quantum information theory, the unit of information is the quantum bit, or qubit for short. A qubit is a two-level quantum system of the form where α 0 , α 1 ∈ C (see also equation (17)), and {|0⟩, |1⟩} are the states of the computational basis in the Hilbert space H = C 2 . In matrix form, |0⟩ = (1, 0) T and |1⟩ = (0, 1) T . In this work, we consider a multi-qubit system consisting of four qubits. A basis in the Hilbert space of the composite system H = (C 2 ) ⊗4 is given by the tensor product of the basis states of each individual subsystem, i.e. {|0000⟩, |1000⟩, |0100⟩, . . . , |1111⟩} using the notation of equation (21). Quantum circuits consist of several operations (represented as boxes in figure 2, boxes to the left are applied first) that are applied to qubits (represented as lines in figure 2) to transform the quantum information encoded in the initial quantum state. These operations can be either quantum gates or measurements, and can involve one or two qubits. Gates are described by unitary matrices. The one-qubit gates we consider in this work are: As for the two-qubit gates, they typically involve one control qubit and one target qubit: the effect of the transformation on the target qubit depends on the state of the control qubit. Here, we consider the CNOT and the CZ gates, which apply X or Z to the target if the control qubit is in |1⟩, respectively. If instead the control qubit is in the state |0⟩, they do not change the target qubit.
As briefly introduced in appendix D.1, entangled states are states that cannot be expressed as a product state of the individual qubits. An example of entangled state is the two-qubit Bell state |ϕ + ⟩ = (|00⟩ + |11⟩)/ √ 2, which can be obtained by applying a CNOT to the state (|0⟩ + |1⟩)/ √ 2 ⊗ |0⟩: One-qubit measurements project states onto the eigenstates of the corresponding observable (e.g. the eigenstates of Z are |0⟩, with eigenvalue 1, and |1⟩, with eigenvalue −1). Two-qubit Bell measurements project states onto a Bell state. Note that an initial state of the form (α 0 |0⟩ + α 1 |1⟩) A ⊗ |ϕ + ⟩ BC can be rewritten as: where |ϕ − ⟩, |ψ + ⟩, |ψ − ⟩ denote the other Bell states and {A, B, C} are the qubit labels. One can see that, if this state is projected onto one of the four Bell states, the resulting state-i.e. one of the four summands in the above equation-will have (up to a local unitary) the quantum information that was initially encoded in the first qubit teleported to the third qubit. In the following, the above transformations (i.e. one-and two-qubit gates, one-and two-qubit measurements, and teleportation procedures), are used by the agent to fill in the missing operations in the circuits generated by the environment. One-qubit Two-qubit Gates X, Z, H, S CNOT, CZ Measurements P X , P Y , P Z P Bell , S ⊗ I P Bell S † ⊗ I, I ⊗ S P Bell I ⊗ S † After this brief introduction to quantum information, we now proceed with introducing the learning environment. In this environment, the agent needs to complete an initial quantum circuit without losing the quantum information that is encoded in the first qubit. The four-qubit circuit that is given to the agent consists of three operations randomly chosen by the environment, placed at fixed positions (marked with an E in figure 2(right)). The agent then completes the circuit by sequentially placing six additional operations in between the initial ones (see figure 2(right)). Thus, the final circuit has nine operations: three placed by the environment and six by the agent. The list of possible operations is given in table 6. Each operation is encoded in four features: feature 1 distinguishes between a gate or a measurement (G/M), feature two declares a one-qubit or a two-qubit operation (1/2), feature three labels the register on which the action is applied, and feature four labels the specific operation from table 6. A circuit element is thus described by a four-feature vector. For example, a Hadamard gate on the third register is denoted as (G, 1, 3, H), or, to ease the notation, G1 3H .
The agent's observation is the current circuit, given by the sequence of operations that have already been placed. Thus, the state space consists of all the possible circuits that result from combining the operations given in table 6, which can be applied to the different qubits. Each operation is described by four features as explained above. The DDQN receives as input a further encoded vector, where each operation of the circuit is encoded as the concatenation of the one-hot encoding of each feature.
At each time step, the agent observes the current circuit and chooses an action, i.e. which circuit element to place next (from the ones given in table 6) and on which qubits the element is applied. For a four-qubit circuit, the total number of different possible actions is 58. Following each action, the environment evaluates the circuit and issues a reward. The reward function distinguishes whether the quantum information, which is initially located on the first register, has been lost or not. In order to check whether circuit C keeps the quantum information alive, we verify that the circuit maps orthogonal input states to orthogonal output states. That is, we consider the starting states |0⟩|φ i ⟩, |1⟩|φ i ⟩, where the first qubit is either in the state |0⟩ or |1⟩ while all other qubits are either initialized in |φ 1 ⟩ = |000⟩ or |φ 2 ⟩ = | + ++⟩. If the resulting states |ψ a ⟩ = C|0⟩|φ i ⟩ and |ψ b ⟩ = C|1⟩|φ i ⟩ are nonzero and orthogonal in at least one of the two cases |φ i =1,2 ⟩, the reward is R = 1. Otherwise, the reward is negative R = −1. A reward (R ∈ {1, −1}) is issued every time the agent places a new operation on the circuit. Once the agent has completed the full circuit (it has placed six operations, completing a nine-operation circuit), it receives a reward that is rescaled by a factor of 5 (R ∈ {5, −5}). This rescaling factor improved the learning performance of the agents, as we saw in a coarse-grain hyperparameter search.
The learning curves of the three independently trained DDQN agents are shown in figure 7. The performance is measured as the fraction of circuits that the agent completed successfully. The agent completes one circuit per training episode. Once the agent finishes 4 × 10 5 training episodes, the learning process stops and the agent starts collecting positively rewarded circuits until it completes a dataset D with |D| = 8 × 10 4 unique, correct circuits of length 9. This dataset D is then input to the sequence mining (see section 3.2) to mine for gadgets. The gadgets obtained in this environment are listed and analyzed in the following section.
D.6. Quantum information gadgets Gadget mining. The dataset D of unique, positively rewarded quantum circuits that the agent collected after the training is input to the sequence mining algorithm (see section 3.2) to generate a set of gadgets. From the nine operations that each circuit contains, we choose to mine only the sequence consisting of the six operations, i.e. actions, placed by the agent. In this way, we gain insights about the policy that the agent has learned to complete the circuits effectively. In addition, as a preprocessing step within the sequence mining, the last feature of each action (i.e. information on the specific gate or measurement basis) in the sequence is removed. The threshold values F min , C min and I min are automatically adjusted such that the sequence mining outputs between one and ten gadgets. The resulting values for the thresholds for each agent are given in table 7. When selecting gadgets for this environment, we apply two additional filters (see appendix B for Figure 7. Learning performance. Performance (measured as the fraction of circuits that the agent completed successfully) of the three agents trained in the quantum information environment. The agent completes one circuit per episode. Average and one standard deviation over 1000 episodes. Table 7. Threshold values for the support F and cohesion C for each agent in the gadget mining. We fix a range between 1 and 10 gadgets that the algorithm should output, and it automatically adjusts the threshold values accordingly. For the minimum interestingness, we set I min = F min · C min . Gadget clustering. In order to help us interpret the list of gadgets-collected from three independent DDQN agents-, they are clustered by (i) the type of operations that they contain (see section 3.3.1) and (ii) the context in which they were used (see section 3.3.2). In both cases, the clustering is done by the algorithm HDBSCAN (see appendix C), and the difference lies in the distance measure that is used.
Utility-based clustering. As for many RL environments, we understand the actions of the agent well enough to feature encode them. In this case, we can categorize actions into four broad types, namely, oneand two-qubit gates (denoted G1, G2) as well as one-and two-qubit measurements (denoted M1, M2). We make use of this coarse-grained information to define a distance measure between gadgets. Each gadget is encoded as a vector of the form ϕ(g) = (n G1 , n G2 , n M1 , n M2 ), where each component is the total number of operations of a given type that the gadget contains. We can also define a utility vector α = (α G1 , α G2 , α M1 , α M2 ) that assigns a weight to each operation type. In this case, we assign the same utility to each operation type, i.e. α = (1, 1, 1, 1). The distance between two gadgets M(g 1 , g 2 ) is then defined as the euclidean distance between ϕ(g 1 ) and ϕ(g 2 ). Clustering gadgets with this distance, we obtain the clusters in table 8. We observe that gadgets with the same types of operations are clustered together. For example, all gadgets that contain at least two one-qubit gates belong to cluster C1.
Context-based clustering. We define another distance between two gadgets as the distance between their corresponding initialization lists (see section 3.3.2). First, the initialization list I g of each gadget g is obtained from the dataset of interaction sequences D g that contain g. I g is the list of observations prior to the placement of gadget g. We collect initialization lists of size |I g1 | = |I g2 | = 2000. Then, the distance m(o, q) for each pair of observations o ∈ I g1 and q ∈ I g2 is computed as where observations o and q are defined as sequences of actions o = (a (1) , . . . , a (L) ), q = (b (1) , . . . , b (L) ). Note that only sequences of the same length |o| = |q| = L are compared. Each action a (k) in the sequence is an operation described as a feature vector a (k) = (a (k) 3 ) (the last feature was removed in the gadget mining). We then choose d(a i ) as the normalized Hamming distance between the feature i of operations a (k) and b (k) . Analogous to the utilities defined in section 3.3.2, the user can introduce weights w i for each feature i. In this case, we choose equally distributed weights. The distance M(I g1 , I g2 ) that is used to cluster gadgets g 1 and g 2 is then computed according to equation (5). Table 9 shows the classification by context. In this case, we can gain insights into the policy and environment by analyzing not only the clusters and gadgets themselves but also the initialization lists of the corresponding gadgets. In table 9, we provide information on (i) the average length of the circuits inside I g Table 8. List of gadgets obtained by the three DDQN agents and classified through utility-based clustering. The column 'Prob' indicates the probability that the gadget belongs to the given cluster. A zero probability means that this data point is considered noise. For each gadget, we also provide its interestingness, support, and cohesion. The last gadgets are unclassified. for each g, which indicates at which point of the circuit the agent decides to place the gadget on average; and (ii) the average reward of circuits in I g , which informs us about whether the agent places the gadget to correct the circuit (negative InitList Reward) or not. By visually inspecting some circuits in the initialization lists of gadgets in cluster C2 with negative InitList Reward, we see that, as expected, they contain measurements on the first register (see e.g. figure 8). We confirmed that the agent learned to use gadgets in C2 to correct the circuit by analyzing the reward of the circuit before and after placing one of these gadgets, which turned from negative to positive (see figure 8 for an example of such a case). In addition, we observe that the circuits in the initialization list I g of gadgets g that belong to cluster C0 and of the unclassified gadgets are longer on average. Again by inspecting the circuits in the corresponding I g , we see that almost all start with a one-qubit gate (either G1 1 or G1 2 ) for g in cluster C0. An example can be seen in figure 6: the circuit displayed in the label of C0 is an element of the initialization list of gadget (G1 4 , G1 4 ), where the first operation is G1 1 . 80.7% of the circuits in its context start with G1 1 . This percentage is higher for other gadgets in C0, for example (G1 4 , M1 2 ), for which 98.9% of the circuits in its initialization list start with G1 1 . Table 10 shows the percentages of circuits that start with a given operation in the initialization list of gadgets belonging to C0, C1, and the unclassified gadgets. This explains why gadgets in C0 are clustered separately.

D.7. Illustration of a quantum information gadget
To provide a more in-depth understanding of the quantum information gadgets, we further analyze the circuit displayed in figure 8. First, we describe the initial random circuit before the gadget is placed. As an example, we choose the first gate to be Z, the second one to be H and the two-qubit measurement to be P Bell , so the circuit is described (note the inverse ordering of the operations in the sense of operator composition on the Hilbert space) as C = P Bell(14) H 2 Z 1 , where subscripts {1, 2, 3, 4} label the first, second, third and fourth qubit, respectively. As explained in appendix D.5, to obtain the reward, we first compute |ψ a ⟩ = C|0 + ++⟩ and |ψ b ⟩ = C|1 + ++⟩. In matrix notation |+⟩ = 1/ √ 2(1, 1) T and |−⟩ = 1/ √ 2(1, −1) T , which are the eigenstates of X. After the first two operations, i.e. Z 1 and H 2 , the states read: |ψ (2) a ⟩ = |00 + +⟩ Table 9. List of gadgets obtained by the three DDQN agents classified by context. The column 'Prob' indicates the probability that the gadget belongs to the given cluster. The columns 'InitList Length' and 'InitList Reward' indicate the average length and average reward, respectively, of the circuits in the initialization list of the gadget. The column labeled 'Agent' shows from which agent's dataset the gadget was retrieved. The last gadgets are unclassified.  Figure 8. Teleportation gadget. Example of a circuit in the context of gadget G212, M213. Before placing the gadget, the circuit yields a negative reward (R = −1) due to the measurement on the first register. After the agent places the gadget, the circuit is positively rewarded because the quantum information is teleported. and |ψ (2) b ⟩ = −|10 + +⟩, where the superscript indicates the number of operations that have already been applied. We assume that the two-qubit measurement projects the state onto the Bell state |ϕ + ⟩, which leads to the final states: |ψ a ⟩ = |ψ b ⟩ = |ϕ + ⟩ 14 ⊗ |0⟩ 2 ⊗ |+⟩ 3 . Since the two final states are not orthogonal, i.e. ⟨ψ a |ψ b ⟩ ̸ = 0, the reward of circuit C is negative. Note that if we instead consider |ψ a ⟩ = C|0000⟩ and |ψ b ⟩ = C|1000⟩, then |ψ b ⟩ = 0, so the reward is also negative. Table 10. Fraction of circuits that start with the same operation (indicated in the column label) in the initialization list of gadgets that belong to clusters C0 and C1. Now we analyze the circuit with the gadget G2 12 , M2 13 placed before the three initial operations. We take as an example G2 12 = CZ 12 and M2 13 = P Bell (13) . Again, we take the initial states |0 + ++⟩ and |1 + ++⟩ and first apply the operations that form the gadget, and then circuit C. After the first operation, the states read: After the second operation, we assume that the state of the first and third qubits is projected onto the Bell state |ϕ + ⟩, leading to: |ψ (2) b ⟩ = |ϕ + ⟩ 13 ⊗ |−⟩ 2 ⊗ |+⟩ 4 .
Lastly, we apply circuit C to both states in the same way as we have done above and obtain: where we see that now, after placing the gadget, the states are orthogonal ⟨ψ a |ψ b ⟩ = 0, which leads to a positive reward.

Appendix E. Implementation details
For the quantum optics environment, we use an intrinsically motivated MCTS, with horizon T = 12, C = 0.01 that balances exploration and exploitation, and a boredom threshold of B = 1000 time steps after which exploitative agents are reset. For the quantum information environment, we showcase that a standard implementation of a deep RL agent can be integrated and analysed with the present architecture. In particular, we use the DDQN implementation from the github repository accompanying the paper [7]. The NN for the implementation has three layers with 150 neurons each. We choose the ADAM optimizer with a learning rate of 0.000 01. The training batches of the main network are of size 100 and the capacity of the replay memory is of 2000 interactions, updated in a first-in first-out manner. The target network is updated every 1000 calls to the replay memory, which occurs every time step once the first batch is filled. The agent's policy is ϵ-greedy, with ϵ decaying exponentially with factor 0.999 991, until a minimum value ϵ = 0.01.
Details on the sequence mining parameters and filters chosen for the quantum optics and quantum information environments are given in appendices D.3 and D.6, respectively.
For clustering, we applied HDBSCAN with a minimum number of samples of 2 and a minimum cluster size of 5, in all cases.
For further implementation details, we refer to our open-source repository on GitHub.