Operationally meaningful representations of physical systems in neural networks

To make progress in science, we often build abstract representations of physical systems that meaningfully encode information about the systems. The representations learnt by most current machine learning techniques reflect statistical structure present in the training data; however, these methods do not allow us to specify explicit and operationally meaningful requirements on the representation. Here, we present a neural network architecture based on the notion that agents dealing with different aspects of a physical system should be able to communicate relevant information as efficiently as possible to one another. This produces representations that separate different parameters which are useful for making statements about the physical system in different experimental settings. We present examples involving both classical and quantum physics. For instance, our architecture finds a compact representation of an arbitrary two-qubit system that separates local parameters from parameters describing quantum correlations. We further show that this method can be combined with reinforcement learning to enable representation learning within interactive scenarios where agents need to explore experimental settings to identify relevant variables.


Introduction
Neural networks are among the most versatile and successful tools in machine learning [1][2][3] and have been applied to a wide variety of problems in physics (see [4][5][6] for recent reviews). Many of the earlier applications have focused on solving specific problems that are intractable analytically and for which conventional numerical methods deliver only unsatisfactory results. Conversely, neural networks may also lead to new insights into how the human brain develops physical intuition from observations [7][8][9][10][11][12][13][14].
Recently, the potential role that machine learning might play in the scientific discovery process has received increasing attention [15][16][17][18][19][20][21][22]. This direction of research is not only concerned with machine learning as a useful numerical tool for solving hard problems, but also seeks ways to establish artificial intelligence methodologies as general-purpose tools for scientific research. This is motivated from various directions: from an artificial intelligence perspective, having machines autonomously discover scientific concepts about the world is often seen as an important step towards artificial general intelligence [23]; from the perspective of science, machine learning might complement human hendrik.poulsen-nautrup@uibk.ac.at tmetger@ethz.ch * These authors contributed equally to this work. scientific research to both speed up scientific discovery and make it less susceptible to human biases.
An important step in the scientific process is to convert experimental data, which can be seen as a very high-dimensional and noisy representation of a physical system, to a more succinct representation that is amenable to a theoretical treatment. For example, when we observe the trajectory of an object, the natural experiment is to record the position of the object at different times; however, our theories of kinematics do not use time series of positions as variables, but rather describe the system using quantities, or parameters, such as velocity and initial position. Concepts such as velocity are more versatile because they can be used in different ways for making predictions in many different physical settings.
When using neural networks to find such parameterisations, one encounters the limitation of standard techniques from representation learning [24][25][26], an area of machine learning devoted to problems of this type. With these standard techniques, we are typically not able to specify explicit criteria on the parameterisation, such as which aspects of a system should be stored in distinct parameters. Instead, a separation or disentanglement typically arises implicitly from the statistical distribution of the training data set. This works well for many practical problems [26]; however, for scientific applications, it is desirable that different parameters We assume that different agents only deal with parts of the experimental settings to achieve their objective. This induces structure on our description of Nature. When the agents build a model of Nature, i.e., learn to parameterise their experimental settings, we want this model to reflect the structure enforced by the requirement that agents compress and communicate parameters in the most efficient way.
in the representation are relevant for different experiments one can actually perform on the system. Otherwise it is likely that our model reflects biases we implicitly, and likely unknowingly, had in collecting the experimental data. In the following we will call a representation that fulfills this desideratum an operationally meaningful representation.
Naturally, formulating operationally meaningful requirements for a representation and translating them to a neural network implementation depends heavily on the specific scenario one is interested in. In this work, we consider a scenario which is particularly relevant in the context of scientific discovery. Specifically, we impose structure on the parameterisation of experimental data 1 by assuming that different agents, which deal with different sets of questions, each only require knowledge of a subset of the parameters to successfully answer any specific question from their respective set of questions. For instance, two agents may each have to predict the movement of a charged particle in the presence of an electrostatic field: the field's strength is relevant for both agents, whereas the individual parameters of each particle, such as charge, mass, etc., are only relevant for one agent; or many agents have to make sequences of operations to answer whether (and if so, how) various phenomena -such as high-dimensional entanglement [16] -can be generated in an experiment.
More generally, the criterion for imposing structure can be understood in terms of communicating agents. 2 An ensemble of agents would like to predict the results of various experimental settings. However, only one agent A has access to reference data from the ex-perimental setting (which is the high-dimensional full representation of the experimental setting). This agent has to identify and communicate the relevant parameters of the system to the other agents, each of whom only requires partial information to solve their question. Agent A therefore splits the parameters in such a way that the remaining agents can share them optimally, in the sense that each agent requires the smallest possible subset of parameters and that parameters required by multiple agents are shared without redundancies. We formalise this notion in Sec. 3.
In this work, we introduce a network architecture that allows us to explicitly impose the aforementioned operational criterion on the parameters used by a neural network to represent a physical system [15]. The model architecture is detailed in Sec. 4. In Sec. 5 we provide two illustrative examples of a scenario in which agents are given (a high-dimensional representation of) an experimental setting and are required to make a prediction w.r.t. a specific question. In an example from classical mechanics, the network autonomously distinguishes parameters that are only relevant to predict the behaviour of an individual particle from parameters that affect the interaction between particles. This structure arises naturally in an experiment with multiple charged particles, when different agents have to predict the motion of their charged particle in the presence or absence of the other agents' charged particles. The method is agnostic to the theory underlying an experiment and can thus also be applied to quantum mechanical experiments. We illustrate this by learning a representation of a two-qubit system that separates parameters relevant for two individual qubits from those parameters describing the quantum correlations between qubits. In fact, this parameterisation is similar to the standard, analytic representation described in Refs. [28,29].
In Sec. 6, we consider scenarios where the answer to a specific question can be described as a sequence of actions; such a sequence either does or does not achieve a specific goal, i.e., feedback about the quality of an action may be discrete and delayed. For instance, this tends to be the case for optimisation problems such as the design and control of complex systems or the development of gadgets and software solutions for different scientific and technological purposes [30]. In the context of scientific discovery, the specific goal may be to build experimental settings which bring about a specific phenomenon, e.g., entanglement [16]. In such a scenario, we may first explore the space of experimental settings and learn solutions through reinforcement learning [31] before applying the criterion of minimal communication to impose structure on the parameterisation of experimental data. Therefore, we provide a formal description of reinforcement learning environments where our architecture may capture operationally meaningful structure, and demonstrate this by means of an illustrative example.

Related work
The field of representation learning is concerned with feature detection in raw data. While, in principle, all deep neural network architectures learn some representation within their hidden layers, most work in representation learning is dedicated to defining and finding good representations [24]. A desirable feature of such representations is the interpretability of its parameters (stored in different neurons in a neural network). Standard autoencoders, for instance, are neural networks which compress data during the learning process. In the resulting representation, different parameters in the representation are often highly correlated and do not have a straightforward interpretation. A lot of work in representation learning has recently been devoted to disentangling such representations in a meaningful way (see e.g. [26,[32][33][34][35]). In particular, these works introduce criteria, also referred to as priors in representation learning, by which we can disentangle representations. β-variational autoencoders. Autoencoders are one particular architecture used in the field of representation learning, whose goal is to map a highdimensional input vector x to a lower-dimensional latent vector z using an encoding mapping E(x) = z. For autoencoders, z should still contain all information about x, i.e., it should be possible to reconstruct the input vector x by applying a decoding function D to z. The encoder E and the decoder D can be implemented using neural networks and trained unsupervised by requiring D(E(x)) = x. β-variational autoencoders (β-VAEs) are autoencoders where the encoding is regularised in order to capture statistically independent features of the input data in separate parameters [26].
In Ref. [15] a modified β-VAE, called SciNet, was used to answer questions about a physical system. The criterion by which the latent representation is disentangled is statistical independence equivalent to standard β-VAE methods. In the present work, we use a similar architecture but impose an operational criterion in terms of communicating agents for the disentanglement of parameters.
Another prior that was recently proposed to disentangle a latent representation is the consciousness prior [36]. There, the author suggests to disentangle abstract representations via an attention mechanism by assuming that, at any given time, only a few internal features or concepts are sufficient to make a useful statement about reality.
State Representation Learning. State representation learning (SRL) is a branch of representation learning for interactive problems [37]. For instance, in reinforcement learning [31] it can be used to capture the variation in an environment created by an agent's action [34,35,38,39]. In Ref. [34] the representation is disentangled by an independence prior which encourages that independently controllable features of the environment are stored in separate parameters. A similar approach was recently introduced in Ref. [35] where model-based and model-free reinforcement learning are combined to jointly infer a sufficient representation of the environment. The abstract representation becomes expressive by introducing representation and interpretability priors. Similarly, in Ref. [39] robotic priors are introduced to impose a structure reflecting the changes that occur in the world and in the way a robot can interact with it. As shown in Ref. [35] and [39], such requirements can lead to very natural representations in certain scenarios such as creating an abstract representation of a labyrinth or other navigation tasks.
In Ref. [40] many reinforcement learning agents with different tasks share a common representation which is being developed during training. They demonstrate that learning auxiliary tasks can help agents to improve learning of the overall objective. One important auxiliary task is given by a feature control prior where the goal is to maximise the activations of hidden neurons in an agent's neural network as they may represent task-relevant high-level features [41,42]. However, this representation is not expressive or interpretable to the human eye since there is no criterion for disentanglement.
Projective Simulation The projective simulation (PS) model for artificial intelligence [43] is a model for agency which employs a specific form of an episodic and compositional memory to make decisions. It has found applications in various areas of science, from quantum physics [16,44,45] to robotics [46,47] and the modelling of animal behaviour [48]. Its memory consists of a network of so-called clips which can represent basic episodic experiences as well as abstract concepts. Besides the usage for generalisation [49,50], these clip networks have already been used to represent abstract concepts in specific settings [17,47]. In Ref. [17], PS was used to infer the existence of unobserved variables such as mass, charge or size which make an object respond in certain experimental settings in different ways. In this context, the authors point out the significance of exploration when considering the design of experiments, and thereby adopt the notion of reinforcement learning similar to Ref. [16]. In line with previous works, we will also discuss reinforcement learning methods for the design of experimental settings. Unlike previous works however, we provide an interpretation and formal description of decision processes which are specifically amenable to representation learning. Moreover, we employ neural networks architectures to infer continuous parameters from experimental data. In contrast, PS is inherently discrete and therefore better suited to infer high-level concepts.
In this work, we suggest to disentangle a latent representation of a neural network according to an operationally meaningful principle, by which agents should communicate as efficiently as possible to share relevant information to solve their tasks. Technically, we disentangle the representation according to different questions or tasks, as described in more detail in the following section.

Formal setting
Our setting is inspired by the idea that the physically meaningful parameters are those which are useful for answering different questions about or solving different tasks related to the same physical system. For instance, we use the parameters mass m and charge q to describe a particle because there exist operationally meaningful questions about the particle in which only the mass or only the charge is relevant, and other questions for which both are required. If we stored m + q and m − q instead (in some fixed units), we would still have the same information, but to answer a question just involving the mass, we would need both parameters instead of one; in contrast, there are few, if any, operationally meaningful questions for which only m + q is relevant. Therefore, we say that m and q are operationally meaningful parameters, whereas m + q and m − q are not. Note that we assume a specific notion of experiments in this paper and will continue to do so implicitly in the following. Here we understand an experiment as a stochastic function which maps a space of input parameters onto an output space representing measurement data such that the output distribution is reproducible for fixed parameters. An experimental setting is then an instance of an experiment with specified parameters. We assume that we can sample many different experimental settings, i.e., sample many instance of the same experiment with different parameters. For example, in an experiment involving a mass, we can sample many experimental settings with different (but possibly unknown) values for this mass.

Communicating agents
Here, we consider the following generic setting (see Fig. 2(a)). Various agents have access to a physical system in form of e.g., measurement data. However, not all agents have access to the same data and some agents need to communicate with each other in order to answer a question or solve a task within this physical system. The constraint that agents need to communicate efficiently imposes structure on the representation of the communicated data. In the simplest case, a single encoding agent A makes an observation o ∈ O on a physical system with randomly chosen unknown parameters and generates a parameterised representation r ∈ R, where O is the set of possible observations and R is a representational parameter space. For instance, the agent could observe a time series of particle positions and represent the velocity parameter. Other, decoding agents B 1 , . . . , B k are given questions q 1 , . . . , q k randomly sampled from Q 1 , . . . , Q k , respectively, and are required to produce an answer a i (o, q i ). For now, we assume that both the observation o and the optimal answer a * i (o, q i ) ∈ A i may be obtained directly from the respective experimental setting. In Sec. 6, we consider the case where the optimal answer is not immediately apparent from an observation o but may be learnt through reinforcement learning. Formally, one data sample consists of o, q 1 , . . . , q k , a * 1 (o, q 1 ), . . . , a * k (o, q k ) . To generate the training data set, we collect such samples for many configurations of the unknown parameters of the physical system and many randomly chosen questions. By contrast, in Sec. 6 the training data is effectively generated by a trained reinforcement learning agent. In practice, we can represent observations, questions and answers as tuples of real numbers, and we will do so implicitly for the rest of this paper. Instead of having access to the entire observation o, B 1 , . . . , B k only receive (part of) the encoding r. That is, A is required to communicate part of its representation to the other agents such that they can solve their respective tasks optimally.
The values (o, q i , a * i ) are related to our notion of experiments (see Fig. 3) in the following way. An experiment E = (E 1 , . . . , E k ) maps a parameter space Φ and a question space The parameters in Φ may be (partially) hidden or not directly observable, and we would like to learn a representation for them. Therefore, we construct a reference experiment E r which can be used to generate measurement data o as a highdimensional representation of the hidden parameters, i.e., E r : Φ → O. This experiment is labeled the reference because it provides the data which is used, or referenced, in parts by the other agents to make predictions. Questions may be considered as additional parameters of the experiment of which we do not seek to find a representation, but which are useful for finding meaningful representations of the parameters in Φ.  In the generic communication setting that we consider, agents have access to information (e.g., observations) about the environment, which can be considered as a physical system, and can interact with part of it (e.g., by making predictions). Different agents may observe or interact with different subsystems, but might not have access to other parts of the environment. To solve certain tasks, agents may require information that can only be accessed by other agents. Therefore, the agents have to communicate. To this end, they encode their information into a representation that can be communicated efficiently to other agents, i.e., they have to find an efficient "language" to share relevant information. Arrows passing through boundaries represent interactions with the physical system in form of data gathering or predictions. Arrows between agents suggest possible communication channels. (b) In the specific settings that we consider, an encoding agent maps an observation (given as a sample) obtained from the current experimental setting onto a latent representation, part of which has to be communicated to decoding agents. Decoding agents receive additional information specifying the question which they are required to answer; for example, the question may just define the problem setting. In our architecture, we thus view the question as coming from the environment, but in general, it could also include information from other agents. The functions E, ϕ i , D i , representing encoder, filter, and decoder respectively, are each implemented as neural networks. To answer a given question, each decoder receives the part of the representation that is transmitted by its filter. The cost function is designed to minimise the error of the answer and the amount of information that is being transmitted from the encoder to decoders, which can be seen as minimising the amount of parameters that have to be communicated between agents.
Given the hidden parameters (encoded in o ∈ O) and question (encoded in q i ∈ Q i ), the experiment produces some results a * i ∈ A i which may be used to evaluate the answer given by an agent B i .

Learning objectives
The operational criteria or learning objectives imposing structure on a representation take the form of different losses, which are often referred to as priors in representation learning [35,36,39]. In our case, the representation is generated under two criteria: • With a prediction loss we impose that agents need to learn to answer their questions as accurately as possible, given (part of) the representation.
• With a communication loss, we impose that agents have to share the representation in the most dataefficient way.
In other words, the objective of the ensemble of agents A, B 1 , . . . , B k is to correctly answer as many questions as possible, while also minimising the communication between A and the other agents. Therefore, A needs to disentangle its representation in a way that allows it to communicate the relevant parameters. More formally, we specify the encoding agent A by a function E : O → R ≡ R l for some l (see Fig 2). This function can be thought of as an encoding from the highdimensional experimental observation o to a lowerdimensional vector of physically relevant parameters. In representation learning, the output of this function is called the representation. Each decoding agent B i is specified by a filter ϕ i : R l → R li and a function Here, we illustrate the type of experiments that is used to provide data to agents. In our notion of experiments, experimental measurement results are governed by (hidden) parameters depicted as gears. When designing experiments (pictured as a complex network of gears) for the specific examples in Sec. 5, we start with reference experiments whose measurement results (data) comprise a high-dimensional encoding of the hidden parameters. This data is referenced by other agents to make predictions, hence the name. With the encoding and additional information given as a question, our architecture can produce predictions (or answers) about the results of similar experimental settings, dubbed prediction experiments. Insofar as different agents answer different questions, their prediction experiments may yield distinct outcomes. Those parts of an experiment that are associated with a (coloured) agent are depicted by the same colouring. In our setting, questions may influence the prediction experiment in various ways. For instance, we consider experiments where questions are encoded by specifying additional parameters of the experiment given to an agent as low-or high-dimensional representation (see Sec. 5). We also study the minimal case where questions are constant and just distinguish the tasks of different agents in Sec. 6. Fig 2). The filter effectively restricts the agent's access to the representation by only transmitting a part of the representation; formally, ϕ i (r 1 , . . . , r l ) = (r j1 , . . . , r j l i ). We call l i the dimension dim(ϕ i ) of the filter. Intuitively, one may imagine that the dimension and the indices j 1 , . . . , j li of the transmitted components can be chosen by the agent. It is important that the filter is independent of the observation and question, since the transmission of parameters to agents should not depend on a particular data sample, but is instead viewed as a property of the theory that applies to all data samples equally. The function D i , called a decoder, takes the transmitted part of the representation and the question and produces an answer. Ideally, agent A produces a representation which allows each agent B i to answer its questions correctly while only accessing the smallestpossible part of the representation.

Multiple encoding agents
Up to now, we have assumed that there exists one agent A who has access to the entire system to make an observation and to communicate its representation. However, just as different decoding agents B i only deal with a part of the system, we can consider the more general scenario of having multiple encoding agents A 1 , . . . , A j . In this scenario, each agent A i makes different measurements on the system. For example, one agent might make a collision experiment between two particles, while another observes the trajectory of a particle in an external field. Here, only the aggregate observations of all agents A 1 , . . . , A j provide sufficient information about the system required for the agents B 1 , . . . , B k to make predictions. The formalisation is analogous to the previous section and we only sketch it here: we associate to each agent A i an encoder function E i . The domain of the filter functions of the agents B 1 , . . . , B k is now a cartesian product of the output spaces of the encoders (i.e., the output vectors of the encoders are concatenated and used as inputs to the filters).
In the case where a physical system has an operationally natural division into k interacting subsystems, a typical case would be to have the same number of encoding agents A 1 , . . . , A k as decoding agents B 1 , . . . , B k , where both A i and B i act on the same i-th subsystem. Here, we expect that A i and B i are highly correlated, i.e., the filter for B i transmits almost all information from A i , but less from other agents A j . In this case, one can intuitively think of a single agent per subsystem i, that first makes an observation about that subsystem, then communicates with the other agents to account for the interaction between subsystems, and uses the information obtained from the communication to make a prediction about subsystem i.

Model implementation
Here, we discuss the details of the implementation and training of E, ϕ i and D i (see Fig 2). For brevity, we consider the case of a single encoding agent. The implementation of the multi-encoder scenario is analogous. The functions E, ϕ i , D i are each implemented as neural networks. The encoder and decoder functions of the agents can be easily implemented using fully connected deep neural networks analogously to the architecture from Ref. [15]. To be more precise, the encoder is simply a deep neural network that maps a high-dimensional input to a low dimensional output consisting of a few so-called latent neurons. After being passed through the filter functions (which will be described in detail later) the representation is forwarded to all decoders. Additionally, each decoder receives a corresponding question vector as input. The decoder's neural network maps these to an output representing the answer.
While encoder and decoder are easy to implement, the implementation of filter functions ϕ i poses a difficulty because these essentially need to learn a binary value, "on" or "off", for each of the latent neurons. Learning such discontinuous functions is not possible with standard backpropagation-based gradient descent. Therefore, we will need to introduce a smoothed version of this problem. However, we first need to understand the measure of success, i.e. the loss function that will be minimised.

Learning objectives as loss functions
As described above, the learning objective can be expressed in terms of loss or cost functions which are to be minimised by the ensemble of agents. The overall performance of the ensemble is quantified by a weighted sum of the following terms: that measures how well the decoder answers the question.
• A communication loss L f = i dim(ϕ i ) that counts the total number of parameters transmitted to the agents B 1 , . . . , B k .
In order to minimise the total cost, the neural network corresponding to the agent ensemble is then trained on a set of triples o, q 1 , . . . , q k , a * 1 (o, q 1 ), . . . , a * k (o, q k ) . As described in the previous section, this data is provided in the form of measurement data obtained from various experimental settings.

Implementation of filters
Due to the difficulty of implementing a binary value function with neural networks, we need to replace the ideal cost L f by a comparable version with a smooth filter function. To this end, instead of viewing the latent layer as the deterministic output of the encoder (the generalisation to multiple decoders is immediate), we consider each latent neuron j as being sampled from a normal distribution N (µ j , σ j ). The sampling is performed using the renormalisation trick [51], which allows gradients to propagate through the sampling step. The encoder outputs the expectation values µ j for all latent neurons. The logarithms of the standard deviations log(σ j ) are provided by neurons, which we call selection neurons, that take no input and output a bias; the value of the bias can be modified during training using backpropagation. Using the logarithm of the standard deviation has the advantage that it can take any value, whereas the standard deviation itself is restricted to positive values. The ideal filter loss The intuition for this scheme is as follows: when the network chooses σ j to be small (where the standard deviation of µ j over the training set is used as normalisation), the decoder will usually obtain a sample that is close to the mean µ j ; this corresponds to the filter transmitting this value. In contrast, for a large value of σ j , a sample from N (µ j , σ j ) is usually far from the mean µ j ; this corresponds to the filter blocking this value. The lossL f is minimised when many of the σ j are large, i.e., when the filter blocks many values.
Instead of thinking of probability distributions, one can also view this scheme as adding noise to the latent variables, with σ j specifying the amount of noise added to the j-th latent neuron. If σ j is large, the noise effectively hides the value of this latent neuron, so the decoder cannot make use of it.
We also note thatL f is in principle unbounded. However, in practice this does not present a problem since the decoder can only approximately, but not perfectly, ignore the noisy latent neurons. For sufficiently large σ j , the noise will therefore noticeably affect the decoders' predictions, and the additional loss incurred by worse predictions dominates the reduction inL f obtained from larger values for σ j .
The success of this method to lead to an approximation of a binary filter depends on the weighting of the success loss in relation to the communication loss. This weight is a hyperparameter of the machine learning system.

Examples with simple systems
We demonstrate our method, both for single and multiple encoders, on two examples, one from classical mechanics, one from quantum mechanics. In all cases, the network finds a representation that complies with our operational requirements. We emphasise again that we (a) There are two separated decoding agents, so there is no interaction between their individual experimental setups. Each agent is required to shoot a mass m i into a hole in the presence of a fixed gravitational field. They do this by elastically colliding a projectile of fixed mass m fix with the mass m i . The distance to the hole is fixed. The agents are given the velocity with which the projectile is shot as a question. The correct answer is the angle out of the plane of the table s.t. if the projectile is fired with the given velocity at this angle, the mass m i lands directly in the hole. (b) Now we consider two decoding agents that are subject to the Coulomb interaction between their charged masses. Each agent is again required to shoot a projectile at its charged mass (m i , q i ). The charged mass will move in the Coulomb field of the other agent's charge. Similar to the first situation, each agent is given a velocity v i as a question and has to predict an angle (this time in the plane of the table, i.e., gravity does not play a role) s.t. if the mass is fired at this angle, it will roll into the hole while the position of the other charge stays fixed. (The experiment is then repeated with the roles of the agents reversed, i.e., the agent that first fired his mass now fixes it at its starting position, and vice versa.) refer to the term of experiment in order to describe a function mapping input parameters onto measurement data. An experimental setting is then an instance of an experiment with specified parameters and we assume access to a sampling method that produces experimental settings with varying parameters. In designing the following example experiments we follow the approach in Fig. 3 specifying reference and prediction experiments.

Setup
We consider the setup shown in Fig. 4: take particles with masses m 1 , m 2 and charges q 1 , q 2 , where both masses and charges are parameters that are varied between training examples. To generate the input data which is provided to the encoding agent A, we perform the following two reference experiments: 2. For each of the particles (m i , q i ), we place the particle at the origin at rest, and place a reference particle (m ref , q ref ) with fixed mass and charge at a fixed distance d 0 . Both particles are free to move. We observe a time series of positions of the particle (m i , q i ) as it moves due to the Coulomb interaction between itself and the reference particle.
Different agents now are required to answer different questions about the system in form of a prediction experiment (cf. Fig. 3). In this context, these questions can most easily be phrased as the agents trying to win games, both involving a target hole. The initial positions of the particles and the target holes are fixed.
• Agents B 1 and B 2 each are given projectiles with a fixed mass m fix . As question input, they are given the (variable) velocity v i with which this projectile will hit m i . They can vary the angle α i in the yz-plane with which they shoot this projectile against the mass m i . After being hit, the mass will fly towards the target hole under the influence of gravity. The agent's goal is to hit the mass in precisely such a way that it lands directly in the hole, similar to a golfer attempting a lob shot that lands directly in the hole without bouncing. The prediction loss is given by the squared difference between the angle chosen by the agent and the correct angle that would have landed the mass directly in the hole; this correct angle can be determined by experiments on the system. Alternatively, one could use the minimal distance of the trajectory of the particle to the hole as a cost function.
• Similarly, agents B 3 and B 4 are given projectiles. The velocities of these projectiles are again given as a question input. The goal of the agent is to choose the angle ϕ i in the xy-plane so that when the mass moves in the Coulomb field of the other mass (which stays fixed, then the experiment is repeated with the roles of moving and fixed mass reversed for the other agent), it will roll into the hole.
In both cases, we restrict the velocities given as questions to ones where there actually exists a (unique) angle that makes the particle land in the hole.

Results
To analyse the learnt representation, we plot the activation of the latent neurons for different examples with different (known) values of m 1 , m 2 , q 1 , q 2 against those known values. This corresponds to comparing the learnt representation to a hypothesised representation that we might already have. The plots are shown in Fig. 5. The first and second latent neurons are linear in m 1 and m 2 , respectively, and independent of the charges; the third latent neuron has an activation that resembles the function q 1 ·q 2 and is independent of the masses. This means that the first and third latent neurons store the masses individually, as would be expected since the setup in Fig. 4(a) only requires individual masses and no charges. The third neuron roughly stores the product of the charges, i.e., the quantity relevant for the strength of the Coulomb interaction between the charges. This is used by the agents dealing with the setup in Fig. 4(b), where the particle's trajectory depends on the Coulomb interaction with the other particle.

Multiple encoders
One can easily adapt the above example to the multiencoder setting described in Sec. 3.3. Instead of having a single agent A, we use two agents A 1 and A 2 , where agent A i only observes the results of the reference experiment associated with particle i. We provide detailed results in Appendix A. The main finding is that there is no way for the encoding agents to directly encode the product of the charges q 1 · q 2 anymore because each agent only has access to reference experiments involving a single charge. Instead, the representation produced by each encoding agent now stores q i individually (in addition to the mass m i as before). Hence, the additional structure imposed by splitting the encoding agent in two yields further disentanglement of the physical parameters of the system, allowing us to identify the individual charges rather than merely their product.

Setup
We consider a two-qubit system, i.e., a four dimensional quantum system. Finding a representation of such a system from measurement data is a non-trivial task called quantum state tomography [52]. In our operational setting, an agent A has access to a reference experiment consisting of two devices, where the first device creates (many copies of) a quantum system in a state ρ, i.e., a positive semi-definite 4 × 4 matrix with unit trace, which depends on the parameters of the device. The second device can perform binary measurements (with output "zero" or "one"), described by projections |ψ ψ|, where |ψ is a pure state of two qubits. 3 For the reference experiments, we fix 75 randomly chosen binary measurements |ψ 1 ψ 1 | , . . . , |ψ 75 ψ 75 |. For a given state ρ, the input to the encoder A then consists of the probabilities to get "one" for each of the fixed 75 measurements, respectively. The state ρ is varied between training examples. Three agents B 1 , B 2 and B 3 are now required to answer different questions about prediction experiments with the two-qubit system: • Agent B 1 and B 2 are asked questions about measurement output probabilities on the first and second qubit, respectively.
• Agent B 3 is asked to predict joint measurement output probabilities on both qubits.

Results
We find that three latent neurons are used for each of the local qubit representations as required by agents B 1 and B 2 . These local representations store combinations of the x-,y-and z-component of the Bloch sphere representation ρ = 1/2(1 + xσ x + yσ y + zσ z ) of a singe qubit (see Fig. 6), where σ x , σ y , σ z denote the Pauli matrices. In general, a two-qubit mixed state ρ is described by 15 parameters, since a Hermitian 4 × 4 matrix is described by 16 parameters, and one parameter is determined by the others due to the unit trace condition. Indeed, we find that the agent who has to predict the outcomes of the joint measurements accesses 15 latent neurons, including the ones storing the two local representations. Having chosen a network structure with 20 latent neurons, the 5 superfluous neurons are being successfully recognised and ignored by all of the agents B 1 , B 2 and B 3 . These numbers correspond to the numbers found in the analytical approach in Ref. [28]. Figure 5: Results for the classical mechanics example with charged masses. The used network has 3 latent neurons and each column of plots corresponds to one latent neuron. For the first row we generated input data with fixed charges q 1 = q 2 = 0.5 and variable masses m 1 , m 2 in order to plot the activation of latent neurons as a function of the masses. We observe that latent neuron 1 and 2 store the masses m 1 , m 2 respectively while latent neuron 3 remains constant. In the second row, we plot the neurons' activation in response to q 1 , q 2 with fixed masses m 1 , m 2 = 5. Here, the third latent neuron approximately stores q 1 · q 2 , which is the relevant quantity for the Coulomb interaction while the other neurons are independent of the charges. The third row shows which decoder receives information from the respective latent neuron. Roughly, the y-axis quantifies how much information of the latent neuron is transmitted by the 4 filters to the associated decoder as a function of the training epoch. Positive values mean that the filter does not transmit any information. Decoders 1 and 2 perform non-interaction experiments with particles (m 1 , q 1 ) and (m 2 , q 2 ), respectively. Decoders 3 and 4 perform the corresponding interaction experiments. As expected, we observe that the information about m 1 (latent neuron 1) is received by decoders 1 and 3 and the information about m 2 (latent neuron 2) is used by decoders 2 and 4. Since decoders 3 and 4 answer questions about interaction experiments, the product of charges (latent neuron 3) is received only by them (the green line of decoder 3 in the last plot is hidden below the red one).

Reinforcement learning
So far, we have considered scenarios where agents make predictions about specific experimental settings and disentangle a latent representation by answering various questions. There, we understood answering different questions as making predictions about different aspects of a subsystem. Instead, we could have understood answers as sequences of actions that achieve a specific goal. For example, such a (delayed) goal may arise when building experimental settings that bring about a specific phenomenon, or more generally when designing or controlling complex systems. In particular, we may view a prediction as a one-step sequence.
In the case of predictions, it is easy to evaluate the quality of a prediction, since we are predicting quantities whose actual value we can directly observe in Nature. In contrast, the correct sequences of actions may not be easily accessible from a given experimental setting: upon taking a first action, we do not yet know whether this was a good or bad action, i.e., whether it is part of a "correct" sequence of actions or not. Instead, Figure 6: Results for the quantum mechanics example with two-qubit states. We consider a quantummechanical system of two qubits. An encoder A maps tomographic data of a two-qubit state to a representation of the state. Three agents B 1 , B 2 and B 3 are asked questions about the measurement output probabilities on the two-qubit system, where a question is given as the parameterisation of a measurement. Agents B 1 and B 2 are asked to predict measurement outcome probabilities on the first and second qubit, respectively. The third agent B 3 is tasked to predict measurement probabilities for arbitrary measurements on the full two-qubit system. Starting with 20 available latent neurons, we find that only 15 latent neurons are used to store the parameters required to answer the questions of all agents B 1 , B 2 and B 3 . Agent B 3 requires access to all parameters, while agents B 1 and B 2 need only access to two disjoint sets of three parameters, encoded in latent neurons 3,4,13 and 9,10,17, respectively. The plots show the activation values for these latent neurons in response to changes in the local degrees of freedom of each qubit, with the bottom axes of the plots denoting the components of the reduced one-qubit state ρ = 1/2(1 + x σ x + y σ y + z σ z ) on either qubit 1 or 2.
we might only receive a few, sparsely distributed, discrete rewards while taking actions. In the typical case, there is only a binary reward at the end of a sequence of actions, specifying whether we reached the desired goal or not. Even in a setting where a single action suffices to reach a goal, such a binary reward would prevent us from defining a useful answer loss in the same manner as before. To see this, consider the toy example in Fig. 4a again: the agent had to choose an angle α i , given a (representation of the) setting, specified by the parameters (m f ix , m 1 , q 1 , m 2 , q 2 ) and a question v i , in order to shoot the particle into the hole. We assumed that we can evaluate the "quality" of the angle chosen by the agent by comparing it to the optimal angle (or equivalently measuring the distance between the agent's shot and the hole). If we instead only have access to a binary reward specifying whether or not the agent successfully hit the (finite-sized) hole, we cannot define a smooth answer loss, which is required for training a neural network.
The problem that the feedback from the environment, i.e., the reward, is discrete or delayed can both be solved by viewing the situation as a reinforcement learning environment: given a representation of the setting (described by the masses and charges) and a question (a velocity), the agent can take different actions (corresponding to different angles at which the mass is shot) and receives a binary reward if the mass lands in the hole. Therefore, we can employ reinforcement learning techniques and learn the optimal answer.
In reinforcement learning [31], an agent learns to choose actions that maximise its expected, cumulative, future, discounted reward. In the context of our toy example, we would expect a trained agent to always choose the optimal angle. Hence, predicting the behaviour of a trained agent would be equivalent to predicting the optimal answer and would impose the same structure on the parameterisation. In this example, the optimal solution consists of a single choice. In a more complex setting, it might not be possible to perform a (literal and metaphorical) hole-in-one. Generally, an optimal answer may require sequences of (discrete or continuous) actions, as it is for example the case for most control scenarios. In the settings we henceforth consider, questions might no longer be parameterised or given to the agent at all. That is, the question may be constant and just label the task that the agent has to solve.
In this section, we impose structure on the parameterisation of an experimental setting by assuming that different agents only require a subset of parameters to take a successful sequence of actions given their respective goals. To this end, we explain how experimental settings may be understood in terms of instances of a reinforcement learning environment and demonstrate that our architecture is able to generate an operationally meaningful representation of a modified standard reinforcement learning environment by predicting the behaviour of trained agents.
Moreover, in Appendix C, we lay out the details for the algorithm that allows us to generate and disentangle the parameterisation of a reinforcement learning environment given various reinforcement learning agents trained on different tasks within the same environment. There, we also prove that this algorithm produces agents which are at least as good as the trained agents while only observing part of the disentangled abstract representation. The detailed architecture used for learning is described in Appendix D and is combing methods from GPU-accelerated actor-critic architectures [53] and deep energy-based models [54] for projective simulation [43].

Experiments as reinforcement learning environments
In Ref. [16] the design of experimental settings has been framed in terms of reinforcement learning [31] and here we formulate a similar setting: an agent interacts with experimental settings to achieve certain results. At each step the agent observes the current measurement data and/or setting and is asked to take an action regarding the current setting. This action may for instance affect the parameters of an experimental setting and hence might change the obtained measurement data. The measurement results are subsequently evaluated and the agent might receive a reward if the results are identified as "successful". The correspondence between experiments as described in this section and reinforcement learning environments can be understood as follows (cf. Fig. 7a). An experimental setting is interpreted as the current, internal state of an environment. The measurement data then corresponds to the observation received from the environment. The agent performs an action according to the current observation and its question. Actions may affect the internal state of the experimental setting. For instance, the experimental parameters describing the setting can be adjusted or chosen by an agent through actions. The reward function, which takes the current measurement data as input, describes the objective that is to be achieved by an agent.
Since the same experiment can serve more than one purpose, we can have many agents interact with the same experimental setting to achieve different results. In fact, we can expect most experiments to be highly complex and have many applications. For instance, photonic experiments have a plethora of applications [55] and various experimental and theoretical gadgets have been developed with these tools for different tasks [56][57][58]. In this context, we may task various agents to develop gadgets for different task. At first, we assume that all reinforcement learning agents have access to the entire measurement data. Once they have learnt to solve their respective tasks, we can employ our architecture from the previous section to predict each agent's behaviour. Effectively, we can then factorise the representation of the measurement data by imposing that only a minimal amount of information be required to predict the behaviour of each trained reinforcement learning agent. That is, we interpret the space of possible results in an experiment as highdimensional manifold. When solving a given task however, an agent may only need to observe a submanifold which we want to parameterise.
Due to the close resemblance to reinforcement learning, we consider a standard problem in reinforcement learning in the following and demonstrate that our architecture is able to generate an operationally meaningful representation of the environment. More formally, we consider partially-observable Markov decision processes [59] (POMDP). Given the stationary policy of a trained agent, we impose structure on the observation and action space of the POMDP by discarding observations and actions which are rarely encountered. This structure defines the submanifold which we attempt to parameterise with our architecture. A detailed description of these environments is provided in Appendix B.

Setup
Here, we consider the simplest version of a task that is defined on a high-dimensional manifold while the behaviour of a trained agent may become restricted to a submanifold. Consider a simple grid world task [31] where all agents can move freely in a three-dimensional space whereas only a subspace is relevant to finding their respective rewards (see Fig. 7b). Despite the apparent simplicity of this task, actual experimental settings may be understood as navigation tasks in com- An agent can also interact with an experimental setting (pictured as a complex network of gears) to answer its question by e.g., adjusting some control parameters (represented as red gears). It receives perceptual information in form of measurement data which may also have been analysed to provide an additional assessment of the current setting. (b) Sub-grid world environment. In this modified, standard reinforcement learning environment, agents are required to find a reward in a 3D grid world. Different agents are assigned different planes in which their respective rewards are located. Agents observe their position in the 3D gridworld and can move along any of the three spatial dimensions. An agent receives a reward once it has found the X in the grid. Then, the agent is reset to an arbitrary position and the reward is moved to a fixed position in a plane intersecting the agent's initial position.
plicated mazes [16]. This reinforcement learning environment can be phrased as a simple game.
• Three reinforcement learning agents are positioned randomly within a discrete 12×12×12 grid world.
• The rewards for the agents are located in a (x, y)-, (y, z)-and (x, z)-plane relative to their respective initial positions. The locations of the rewards in their respective planes are fixed to (6,11), (11,6) and (6,6).
• The agents observe their position in the grid, but not the grid itself nor the reward.
• The agents can move freely along all three spatial dimension but cannot move outside the grid.
• An agent receives a reward if it can find the rewarded site within 400 steps. Otherwise, it is reset to a random position and the reward is re-positioned appropriately in the corresponding plane.
Generally, in reinforcement learning the goal is to maximise the expected future reward. In this case, this requires an agent to minimise the number of steps until a reward is encountered. Therefore, the optimal policy of an agent is to move on the shortest path towards the position of the reward within the assigned plane. Clearly, to predict the behaviour of an optimal agent, we require only knowledge of its position in the associated plane. We refer to Appendix C for a concise protocol to predict behaviour of a reinforcement learning agent. A detailed description of the architecture can be found in Appendix D.

Results
The optimal policy of an agent is to move on the shortest path towards the position of the reward within its assigned plane. Predicting the behaviour of an optimal agent, we require only knowledge of its position in the associated plane. Hence, the information about the coordinates should be separated such that the different agents have access to (x, y), (y, z) and (x, z), respectively. Using the minimal number of parameters, this is only possible if the encoding agent A encodes the x, y, z coordinates of the agents B 1 , B 2 and B 3 and communicates their respective position in the plane 4 . We verify this by comparing the learnt representation to a hypothesised representation. For instance, we can test whether certain neurons respond to certain features in the experimental setting, i.e., reinforcement learning environment. Indeed, it can be seen from Fig. 8 that the neurons of the latent layer only respond separately to changes in the x, y or z position of an agent respectively. Note that the encoding agent uses a nonlinear encoding of the x-and z-parameters. Interestingly, this reflects the symmetries in the problem: x -a x is Training Episodes 1e6 Decoder 1 Decoder 2 Decoder 3 Figure 8: Results for the reinforcement learning example. We consider a a 12 × 12 × 12 3D grid world. The used network has 3 latent neurons and each column of plots corresponds to one latent neuron. For the first and second row we generated input data in which agent's position is varied along two axes and fixed to 6 in the remaining dimension. The latent neuron activation is plotted as a function of the agent's position. We observe that the latent neurons 1,2 and 3 respond to changes in the x-, y-and z-position, respectively. The third row shows which decoder receives information from the each latent neuron. Roughly, the y-axis quantifies how much of the information in the latent neuron is transmitted by the 3 filters to the associated decoder as a function of the training episode. Positive values mean that the filter does not transmit any information. Decoder 1 has to make a prediction about the performance of a trained reinforcement learning agent whose goal is located within a (x, y)-plane relative to its starting position. We observe that decoder 1 indeed only receives information about the agent's x-and y-position, i.e. latent variables 1 and 2. Similarly, predictions made by decoders 2 and 3 only require knowledge of the agents' (y, z)-and (x, z)-position, respectively, which is confirmed by the selection neuron activations (the blue line of decoder 1 in the second plot is hidden behind the orange one).
the reward is located at position x = z = 6 whenever x or z are relevant coordinates for an agent, whereas for the y-coordinate, the reward is located at position 11. The encoding used by the network in this example suggests that an encoding of discrete bounded parameters may carry additional information about the hidden reward function, which may eventually help to improve our understanding of the underlying theory.

Conclusion
Machine learning is rapidly developing into the newest tool in the physicists' toolbox [60]. In this context, neural networks have become one of the most versatile and successful methods [2,3]. However, deep neural networks, while performing very well on a variety of tasks, often lack interpretability [61]. Therefore, representation learning, and in particular methods for learning interpretable representations, have recently received in-creased attention [17,26,[34][35][36]39]. In the scientific process in particular, representations of physical systems play a central role. To this end, we have developed a neural network architecture that can generate operationally meaningful representations within experimental settings. Roughly, we call a representation operationally meaningful if it can be shared efficiently between various agents that have different goals. We have demonstrated our methods for small toy examples in classical and quantum mechanics. Moreover, we have also considered cases where the experimental process may be framed as an interactive reinforcement learning scenario [16]. Our architecture also works in such a setting and generates representations which are physically meaningful and relatively easy to interpret.
In this work, we have interpreted the learnt representation by comparing it to some known or hypothesised representation. Instead, we could also seek to automate this process by employing unsupervised learning tech-niques that categorise experimental data by a metric defined by the response of different latent neurons. For the toy examples that we considered here, the learnt representation is small and simple enough to be interpretable by hand. However, for more complex problems, additional methods for making the representation more interpretable may be required. For example, instead of using a single layer of latent neurons to store the parameters, recent work has shown the potential of semantically constrained graphs for this task [62]. We expect that these methods can be integrated into our architecture to produce interpretable and meaningful representations even for highly complex latent spaces.
While we used an asynchronous, deep energy-based projective simulation model for reinforcement learning, our method for representation learning within reinforcement learning environments is independent of the exact reinforcement learning model and can be combined with other state-of-the-art techniques such as asynchronous, advantage actor-critic (A3C) methods [63]. In fact, it may even be applied in settings with auxiliary tasks [40] to develop meaningful representations.

Source code and implementation details
The source code, as well as details of the network structure and training process, including parameters, is available at https: //github.com/tonymetger/communicating scinet (for the first examples) and https://github.com/ HendrikPN/reinforced scinet (for the reinforcement learning part) The networks were implemented using the Tensorflow [64] and PyTorch [65] library, respectively.

Contributions
HPN, TM and RI contributed equally to the initial development of the project and composed the manuscript. HPN and TM performed the numerical work. SJ and LMT contributed to the theoretical and numerical development of the reinforcement learning part. HJB and RR initialised and supervised the project. All authors have discussed the results and contributed to the conceptual development of the project.
In this Section, we provide details about the representation learnt by a neural network with two encoders for the example involving charged masses introduced in Sec. 5.1. The setup is the same as that in Section 5.1, with the only difference being that we now use two encoders (the number of decoders and the predictions they are asked to make remain the same). Accordingly, we split the input into two parts: the measurement data from the reference experiments involving particle 1 are used as input for encoder 1, and the data for particle 2 are used as input for encoder 2. Each encoder has to produce a representation of its input. We stress that the two encoders are separated and have no access to any information about the input of the other encoder. The representations of the two encoders are then concatenated and treated like in the single-encoder setup; that is, for each decoder, a filter is applied to the concatenated representation and the filtered representation is used as input for the decoder.
The results for this case are shown in Fig. 9. Comparing this result with the single-encoder case in the main text, we observe that here, the charges q 1 and q 2 are stored individually in the latent representation, whereas the single encoder stored the product q 1 · q 2 . This is because, even though the decoders still only require the product q 1 · q 2 , no single encoder has sufficient information to output this product: the inputs of encoders 1 and 2 only contain information about the individual charges q 1 and q 2 , respectively, but not their product. Hence, the additional structure imposed by splitting the input among two encoders yields a representation with more structure, i.e., with the two charges stored separately. Figure 9: Results for the example with charged masses using two encoders. The used network has 4 latent neurons and each column of plots corresponds to one latent neuron. For an explanation of how these plots are generated, see the caption of Fig. 5. We observe that latent neurons 2 and 3 store the masses m 1 and m 2 , respectively, while latent neurons 1 and 4 are independent of the mass. Latent neurons 1 and 4 store (a monotonic function of) the charges q 1 and q 2 , respectively, and are indepependent of m 1 and m 2 . The third row shows that the charges q 1 and q 2 are only transmitted to decoders 3 and 4, which are asked to make predictions about interaction experiments (the blue line of decoder 1 and the green line of decoder 3 are hidden under the orange and red lines, respectively, in both of these plots). The mass m 1 , stored in the latent neuron 2, is transmitted to decoders 1 and 3, which are the two decoders that make predictions about particle 1. Analogously, m 2 is transmitted to decoders 2 and 4, which make predictions about particle 2.

B Reinforcement learning environments for representation learning
In this appendix, we give a formal description of the reinforcement learning environments that we consider for representation learning. As we will see, the sub-grid world example in the main text is a simple instance of such a class of environments. In general, we consider a reinforcement learning problem where the environment can be described as a Partially Observable Markov Decision Process [59] (POMDP), i.e., a MDP where not the full state of the environment is observed by the agent. We work with an observation space  O = {o 1 , ..., o N }, an action space A = {a 1 , ..., a M } and a discount factor γ ∈ [0, 1). This choice of environment does not reflect our specific choice of learning algorithm used to train the agent, as the latter does not construct so-called belief states that are commonly required to learn optimal policies in a POMDP. Rather, we want to show that our approach is applicable to slightly more general environments than Markov Decision Processes (MDPs) for which the learning algo-rithms we use are proven to converge to optimal policies in the limit of infinitely many interactions with the environment [31,66]. The generalisation to POMDPs still preserves the "Markovianity" of the environments and allows to consider only stationary (but not necessarily deterministic) policies π(a|o), associated to stationary expected returns R π (o). Now consider an agent which exhibits some nonrandom behaviour in this environment, which is characterised by a larger expected return than from a completely random policy. Such a stationary policy may restrict observation-action space (O × A) to a subset (O × A)| π(a|o)≥1/|A| of observations and actions likely to be experienced by the agent depending on its learnt policy π and the environment dynamics. This notation indicates that, in any given observation, we discard actions that have probability less than random (i.e., less than 1 |A| ) of being taken by the agent, indicating that the agent's policy has learnt (un)favoring actions. In general, discarding actions also restricts the observation space. The subset (O × A)| π(a|o)≥1/|A| , along with the POMDP dynamics, describes a new environment. For simplicity, we assume that the restricted environment can be described by an MDP. This is trivially the case if the original environment is itself an MDP, and also the case for the sub-grid world environment discussed in the main text. The MDP inherits the discount factor γ ∈ [0, 1) of the original POMDP, which allows us to consider w.l.o.g. finite-horizon MDPs 5 , which are MDPs of finite episodes lengths (here, we set the maximum length to 3l max ). A conceptual view on this POMDP restricted by policies is provided in Fig. 10.

C Representation learning in reinforcement learning environments
In our approach to factorising abstract representations of reinforcement learning agents, we assume that an agent's policy can impose structure on an environment (as described in Appendix B) and we want this structure to be reflected in its latent representation. Therefore, decoders need to predict the behaviour of a reinforcement learning agent while requiring minimal knowledge of the latent representation. However, we still lack a definition of what it means for a decoder to predict the behaviour of an agent. Here, we consider decoders predicting the expected rewards for these agents given the representation communicated by the encoder. Later, we show that this is enough to produce a policy which is at least as good as the policy of 5 An infinite-horizon MDP with discount factor γ ∈ [0, 1) can be ε-approximated by a finite-horizon MDP with horizon lmax = log γ ( the reinforcement learning agent. To be precise, each decoder attempts to learn the expected return R π (o, a) given an observation-action pair (o, a) ∈ (O × A)| π(a|o)≥1/|A| under the policy π of an agent. For observation-action pairs outside the restricted subset we assign values 0. The input space of the decoder and the restriction to the subset is illustrated in Fig. 10. In fact, decoders not only learn to predict R for a single action but for a sequence of actions {a (1) , . . . , a (l) } l with length l ≥ 1. This is because it can help stabilise the latent representation of environments with small actions spaces and simple reward functions. In practice however, l = 1 is sufficient to obtain a proper representation. In the same way, we can help to stabilise the latent representation by forcing an additional decoder to reconstruct the input from the latent representation. For brevity, we write {a (i) } l for sequences of actions of length l.
The method described in this appendix, allows us to pick a number of reinforcement learning agents that have learnt to solve various problems on a specific kind of reinforcement learning environment (see Appendix B) and parameterise the subspaces relevant for solving their respective tasks. Specifically, the procedure splits into three parts: (i) Train reinforcement learning agents.
(ii) Generate training data for representation learning from reinforcement learning agents (see Appendix C.1).
(iii) Train encoders with decoders on training data such that they can reproduce (w.r.t. performance) the policy of the reinforcement learning agents (see Appendix C.2).
The purpose of this Appendix is to prove that the trained decoders contain enough information to derive policies that perform as well as the ones learnt by their associated agents. Only if this is the case, we can claim that the structure imposed by the decoder reflects the structure imposed on the environment by an agent's policy. To that end, we start by (ii) introducing the method to generate the training data, followed by (iii) a construction of a policy from a trained decoder with given performance bounds.

C.1 Training data generation
The decoders are trained to predict the return values R π (o, {a (i) } l ) for observations o and sequences of actions {a (i) } l of arbitrary length l ≤ l max , given a policy π. The training data is then generated as follows (see Figure 11): 1. Sample two numbers m, l uniformly at random from {1, . . . , l max }. Note, that this algorithm does not require any additional control over the environment beyond initialisation and performing actions. That is, it can be generated on-line while interacting with the environment.
In the case of a deterministic MDP and policy, one iteration of this algorithm yields the exact values of R π (o, {a (i) } l ). In the case of a stochastic MDP or policy, one obtains instead an unbiased estimate of these values due to the possible fluctuations caused by the stochasticity of the environment dynamics and the policy. Repeated iterations of the algorithm followed by averaging of the estimates allows to decrease the estimation error. We neglect this estimation error in the next Section.
The collected tuples are used to train the encoder and decoder through the answer loss L a as discussed in the main text. In practice, short action sequences are sufficient to factorise the abstract representation of the trained agents. In the example of the main text, l = 1 was used. We kept the general description of the return function with arbitrary sequence lengths as a possible extension for more stable factorisations.

C.2 Reinforcement learning policy from trained decoders
Let us call R NN the function learnt by the decoder. We prove that a policy π satisfying R π (o (0) ) ≥ R π (o (0) ) ∀o (0) in the MDP can be constructed from the decoder if it was trained with a certain loss ε. Theorem 1. Given a POMDP with observationaction space O × A and a policy π that restricts the POMDP into an MDP with observation-action space (O × A)| π(a|o)≥1/|A| , there exists a policy π that satisfies R π (o (0) ) ≥ R π (o (0) ) ∀o (0) in the MDP and that can be derived from a function which is ε-close (in terms of a mean squared error), with ε > 0, to: Proof. For clarity, we first prove that the construction of π is possible if the return values are learnt perfectly, i.e., the training loss L is zero. Later, we relax this assumption and show that the proof still holds for nonzero values of the loss. We choose the loss function to be a weighted mean square error on the subset extended to arbitrary length action sequences, i.e., (O× k=1,...,lmax A k )| π(a|o)≥1/|A| , An analogous approach yields similar results for other loss functions. Here, P π (o) is the probability that the observation o is obtained given that the agent follows the policy π and A i is the action space from which the action a (i) is sampled, as restricted by the subset. Now, let us further restrict the sum to action sequences of length one, i.e., for which it is easily verified that L ≤ L.
Using R NN , we derive the following policy: maximising this return hence leads to a return In the following, we discuss the implications of the decoder not learning to reproduce R π perfectly, i.e., L = ε > 0. More precisely, we derive a bound on ε under which a policy π satisfying R π (o (0) ) ≥ R π (o (0) ) ∀o (0) in the MDP can still be constructed from the decoder.
The decoder can be used to construct the policy π defined in Eq. (1) if the approximation error of R NN is small enough to distinguish the largest and secondlargest return values R π (o, a) given an observation o.
In the worst case, this difference can be as small as the smallest difference between any two returns given an observation where δ R = min i |r i+1 − r i | is the minimal non-zero difference between any two values the reward function of the environment can assign (including a reward r = 0). Let us set, It is sufficient for R NN to approximate R π with precision ε 4 . Therefore, it is sufficient to bound the error of the loss function L by This worst case analysis shows that the error needs to be exponentially small with respect to the parameters of the problem so that we can derive strong performance bounds of the policy on the entire subset. In practice, we expect to be able to derive a functional policy even with higher losses during the training of the decoder.

D Model implementation for representation learning in reinforcement learning environments
In this appendix, we give the details for the architecture that has been used to factorise the abstract representation of a reinforcement learning environment. The  Figure 12: Architecture for representation learning in reinforcement learning settings. We store neural network models in the shared memory of a graphics processing unit (GPU). As in Ref. [53], we make use of an asynchronous approach to reinforcement learning. N A copies for each of k different agents interact with copies of the environment. Observations are queued and transferred to the GPU by prediction processes which also distribute policies, returned by the GPU, to the agents. Batches of observations, actions and rewards are queued and transferred to the GPU by trainer processes for updating the neural networks. On the GPU, batches of observations obtained from predictors are evaluated with deep energy-based projective simulation models [54] to obtain a policy, and batches from policy trainers are used to update the model via the loss in Eq. (2). Everything above the red dotted line concerns the training of the reinforcement learning agents' policy analogous to Ref. [53]. Below the dotted line, we depict our architecture which is trained by predicting discounted rewards obtained by trained reinforcement learning agents (see Appendix C). We allow switching between training the policy and training the representation (i.e., selection of latent neurons). From training the policy to learning the representation, the training data changes only slightly. Importantly, in both cases, the data can be created on-line by reinforcement learning agents.
code has been made available at https://github.com/ HendrikPN/reinforced scinet. For convenience, we repeat the training procedure here: (i) Train reinforcement learning agents.
(ii) Generate training data for representation learning from reinforcement learning agents (see Appendix C.1).
(iii) Train encoders with decoders on training data to learn an abstract representation (see Appendix C.2).
The whole procedure is encompassed by a single algorithm (see Fig. 12).

D.1 Asynchronous reinforcement and representation learning
Due to the highly parallelisable setting, we make use of asynchronous methods for reinforcement learning [53]. That is, at all times, we have stored the neural network models in the shared memory of a graphics processing unit (GPU). Both, predicting and training, are therefore outsourced to the GPU while interactions of various agents with their environments are happening in parallel on central processing units (CPUs). The interface between the GPU and CPU is provided by two main processes which are assigned their own threads on CPUs, predictor 7 and training processes. Predictor processes get observations from a prediction queue and batch them in order to transfer them to the GPU where a forward pass of the deep reinforcement learning model is performed to obtain the policies (i.e., probability distributions over actions) which are redistributed to the respective agents.
Training processes batch training data as appropriate for the learning model in the same way as predictors batch observations. This data is transferred to the GPU to update the neural network. In our case, we need to be able to switch between two such training processes. One for training a policy as in Ref. [53] and as required by step (i) of our training procedure, and one for representation learning as required by step (iii). Interestingly, the training data which is used by the policy trainers in step (i) is very similar to the training data which is used by the selection trainers in step (iii). Therefore, in the transition from step (i) to (iii), we just have to slightly alter the data which is sent to the training queue as required by the algorithm in Sec. C.1. Note that the similarity of the training data for the two training processes is due to the specific deep reinforcement learning model under consideration as described in the following section. For further details on the implementation of asynchronous reinforcement learning methods on GPUs see Ref. [53].

D.2 Deep energy-based projective simulation model
The deep learning model used for the numerical results obtained here is a deep energy-based projective simulation (DPS) model as first presented in Ref. [54]. We chose this model because it allows us to easily switch between training the policy and training the decoders since the training data is almost the same for both. In fact, besides different initial biases and network sizes, the models used as reinforcement learning agents and the models used for decoders are the same.
Note that we are free to choose other loss functions such as the mean square error, or a Huber loss. We want the current h-value to be updated such that it maximises the future expected reward. Approximating this reward at time t for a given discount factor, we write (1 − η) j−1 r t+j where η ∈ (0, 1] is the so-called glow parameter accounting for the discount of rewards r t+j obtained after observing o and taking action a at time t up to a temporal horizon l max . The target h-value can then be associated with this discounted reward as follows, where γ PS ∈ [0, 1) is the so-called forgetting parameter used for regularisation. The h-values are used to derive a policy through the softmax function, where β > 0 is an inverse temperature parameter which governs the drive for exploration versus exploitation. The tabular approach to projective simulation has been proven to converge to an optimal policy in the limit of infinitely many interactions with certain MDPs [66] and has shown to perform as good as standard approaches to reinforcement learning on benchmarking tasks [67]. For a detailed description and motivation of the DPS model we refer to Ref. [54]. Note that the training data required to define the loss in Eq. (2) consists of tuples containing observations, actions and discounted rewards (o, a, R). Since this is in line with the training data required for training the decoders as described in Appendix C.1, this model is particularly well suited for the combination with representation learning as introduced in this paper.

E Classical mechanics derivation for charged masses
In this section, we provide the analytic solution to the charged masses example in Sec. 5.1 that we use to evaluate the cost function for training the neural networks. This is a fairly direct application of the generic Kepler problem, but we include the derivation for the sake of completeness. We use the notation of Ref. [68].
The setup we consider is shown in Fig. 13. Our goal is to derive a function v 0 (ϕ) that, for fixed q, Q, d 0 and given ϕ, outputs an initial velocity for the left mass such that the mass will reach the hole. Introducing the inverse radial coordinate u = 1 r , the orbit r(θ) of the left mass obeys the following differential equation (see e.g., Ref. [68,Sec. 4.3]): with the constant k = −qQ 4πε 0 m (4) and the mass-normalised angular momentum l = r 2 dθ dt .
This is a conserved quantity and we can determine it from the initial condition of the problem The general solution to Eq. (3) is given by where A and θ 0 are constants to be determined from the initial conditions. The initial conditions are r(θ = 0) = 1 A cos(θ 0 ) + k  Figure 13: Setup and variable names for a charged mass being shot into a hole. A charged particle with mass m and charge q moves in the electrostatic field generated by another charge Q at a fixed position. The initial conditions are given by the velocity v 0 and the angle ϕ. We want to determine the value for ϕ that will result in the particle landing in the target hole, given a velocity v 0 .

Combining these yields
The condition that the mass reaches the hole is expressed in terms of r(θ) as follows: Using cos(π/4 − θ 0 ) = cos(θ 0 )/ √ 2 + sin(θ 0 )/ √ 2 and the definition of l as well as Eqs. (10) and (11), we can solve this for v 0 : Restricting ϕ to a suitably small interval, this function is injective and has a well-defined inverse ϕ(v 0 ). The neural network has to compute this inverse from operational input data. To generate valid questionanswer pairs, we evaluate v 0 (ϕ) on a large number of randomly chosen ϕ (inside the interval where the function is injective).