The R-mAtrIx Net

We provide a novel Neural Network architecture that can: i) output R-matrix for a given quantum integrable spin chain, ii) search for an integrable Hamiltonian and the corresponding R-matrix under assumptions of certain symmetries or other restrictions, iii) explore the space of Hamiltonians around already learned models and reconstruct the family of integrable spin chains which they belong to. The neural network training is done by minimizing loss functions encoding Yang-Baxter equation, regularity and other model-specific restrictions such as hermiticity. Holomorphy is implemented via the choice of activation functions. We demonstrate the work of our Neural Network on the two-dimensional spin chains of difference form. In particular, we reconstruct the R-matrices for all 14 classes. We also demonstrate its utility as an \textit{Explorer}, scanning a certain subspace of Hamiltonians and identifying integrable classes after clusterisation. The last strategy can be used in future to carve out the map of integrable spin chains in higher dimensions and in more general settings where no analytical methods are available.


Introduction
Neural Networks and Deep Learning have recently emerged as a competitive computational tool in many areas of theoretical physics and mathematics, in addition to their several impressive achievements in computer vision and natural language processing [1].In String Theory and Algebraic Geometry for instance, the application of these methods was initiated in [2][3][4][5].Since then, deep learning has seen several interesting and remarkable applications in the field, both on the computational front [6][7][8][9][10][11][12][13][14][15] as well as towards the explication of foundational questions [16].They have appeared in the context of Conformal Field Theory, critical phenomena, spin systems and Matrix Models [17][18][19][20][21][22][23].More generally, deep learning has found interesting applications in mathematics, ranging from the solution of partial nonlinear differential equations [24,25], to symbolic calculations [26] and to hypothesis generation [27,28].
Interestingly, deep learning is starting to play an increasingly important role in symbolic regression, i.e. the extraction of exact analytical expressions from numerical data [29].While it is difficult to pinpoint any one solitary reason for this confluence of several fields into deep learning, there are some important themes that do seem to play a recurring role.Firstly, deep neural networks are a highly flexible parametrized class of functions and provide us an efficient way to approximate various functional spaces and scan over them [30][31][32][33].The same neural network, as we shall shortly see, can learn a Jacobi elliptic function as easily as it does a trigonometric or an exponential function.Such approximations train well for a variety of loss landscapes, including non-convex ones.Secondly, over the previous many years, robust frameworks for the design and optimization of neural networks have been developed, both as an explication of best practices [34][35][36][37][38][39] and the development of standardized software for implementation [40][41][42].This has made it possible to reliably train increasingly deeper networks which are optimized to carry out increasingly sophisticated tasks such as the direct computation of Ricci flat metrics on Calabi Yau manifolds [12][13][14][15] and the solution of differential equations without necessarily providing the neural network data obtained from explicitly sampling the solution.Further, in recent interesting developments, deep learning has been applied to analyze various aspects of symmetry in physical systems ranging from their classification to their automated detection [19,[43][44][45][46][47].
The profound role played by symmetry in theoretical physics and mathematics is hard to overstate.
Probably its most compelling expression in theoretical physics is found in the bootstrap program which rests on the idea that a theory may be significantly or even fully constrained just by the use of general principles and symmetries without analysis of the microscopic dynamics.For example, the S-matrix Bootstrap bounds the space of allowed S-matrices relying only on unitarity, causality, crossing, analyticity and global symmetries [48][49][50].This provides rigorous numerical bounds on the coupling constants and significantly restricts the space of self-consistent theories [51][52][53].This line of consideration finds its ultimate realization in two dimensions once applied to integrable theories.Integrable Bootstrap complements the aforementioned constraints with one extra functional Yang-Baxter equation, manifesting the scattering factorization, which allows us to fix the S-matrix completely [54].The same Yang-Baxter(YB) equation appears in the closely related context of integrable spin chains.Now instead of S-matrix, it restricts the R-matrix operator whose existence allows one to construct a commuting tower of higher charges and prove integrability.Practically, one has to solve the functional YB equation in a certain functional space.There is no known general method to do so, and all existing approaches are limited in the scope of application and fall into three groups.The first class of methods is algebraic in nature, exploiting the symmetry of R-matrix [55,56].The second approach aims to directly solve the functional equation or the related differential equation [57].The third alternative utilizes the boost operator to generate higher charges and impose their commutativity [58][59][60].
In this paper we shall demonstrate how neural networks and deep learning provide an efficient way to numerically solve the Yang-Baxter equation for integrable quantum spin chains.On an immediate front, we are motivated by recent interesting work on classical integrable systems using machine learning [43][44][45]61].The approach taken in the work [61] of learning classical Lax pairs for integrable systems by minimization of the loss functions encoding a flatness condition has a particularly close parallel to our approach.However, to the best of our knowledge, the present work is the first attempt to apply machine learning to quantum integrability, the analysis of R-matrices and the Yang-Baxter equation.
Our analysis utilizes neural networks to construct an approximator for the R-matrix and thereby solve functional Yang-Baxter equation while also allowing for the imposition of additional constraints.
We look into the sub-class of all possible R-matrices, namely those that are regular and holomorphic, and incorporate the Yang-Baxter equation into the loss function.Upon training for the given integrable Hamiltonian, we successfully learn the corresponding R-matrix to a prescribed precision.Using spin chains with two-dimensional space as a main playground we reproduce all R-matrices of difference form which was recently classified in [58].Moreover, this Solver can be turned into an Explorer which scans the space (or a certain subspace) of all Hamiltonians looking for integrable models, which in principle allows us to discover new integrable models inaccessible to other methods.Below we provide the summary of the Neural Network and its training, as well as an overview of the paper.

Summary of Neural Network and Training:
The functional Yang-Baxter equation, see Equation (2.3) below, is holomorphic in the spectral parameter u ∈ C and as such, holds over the entire complex plane.In this paper, we shall restrict our training to the interval Ω = (−1, 1) on the real line, but design our neural network so that it analytically continues to a holomorphic function over the complex plane.Each entry into the R-matrix is separately modeled by multi-layer perceptrons (MLP) with two hidden layers of 50 neurons each, taking as input parameters the variable u ∈ Ω.More details are available in Section 3.2 and Appendix B. All the neurons are swish activated [62], except for the output neurons which are linear activated.Training proceeds by optimizing the loss functions that encode the Yang-Baxter equations (3.7), regularity (3.12), and constraints on the form of the spin chain Hamiltonian, for instance via (3.13).Hermiticity of the Hamiltonian, if applicable, is imposed by the loss (3.15).Optimization is done using Adam [63] with a starting learning rate of η = 10 −3 which is annealed η = 10 −8 in steps of 10 −1 by monitoring the Yang-Baxter loss (3.7) on validation data for saturation.
Adam's hyperparameters β 1 and β 2 are fixed to 0.9 and 0.999 respectively.In the following, we will refer to this learning rate policy as the standard schedule.We apply this framework to explore the space of R-matrices using the following strategies: 1. Exploration by Attraction: The Hamiltonian loss (3.13) is imposed by specifying target numerical values for the two-particle Hamiltonian, or some ansatz/symmetries instead (like 6-vertex, 8-vertex, etc.).We also formally include here the ultimate case of general search when no restrictions are imposed on the Hamiltonian at all.This strategy is predominantly used in our Section 4.1.

Exploration by Repulsion:
We can generate new solutions by repelling away from an ansatz or a given spin chain.This requires us to activate the loss function (3.17) for a few epochs in order to move from the specific Hamiltonian.This strategy is employed in Section 4.2.
Further, we also have two schemes for initializing training.

Random initialization:
We randomly initialize the weights of the neural network using He initialization [64].This samples the weights from either a uniform or a normal distribution centered around 0 but with a variance that scales as the inverse power of the layer width.

2.
Warm-start: we use the weights and biases for an already learnt solution .
A brief overview of this paper is as follows.In section 2 we quickly introduce the R-matrix and other key concepts from the quantum integrability of spin chains relevant to this paper.Particularly in subsection 2.1, we review the classification program of 2-D spin chains of difference form through the boost automorphism method [58].Section 3 contains a review of neural networks with a view towards machine learning the R-matrix given an ansatz for the two-particle Hamiltonian.Our methodology for this computation is provided in Section 3.2.We then present our results in section 4. Section 4.1 focuses on hermitian XYZ and XXZ models (section 4.1.1),and prototype examples from the 14 gauge-inequivalent classes of models in [58](section 4.1.2).The latter sub-section also contrasts training behaviour for integrable and non-integrable models.Section 4.2 presents a preliminary search strategy for new models which we illustrate within a toy-model setting: rediscovering the two integrable subclasses of 6-vertex Hamiltonians.Section 5 discusses ongoing and future research directions.

An Overview of Spin Chains and Quantum Integrability
Quantum integrability, like its classical counterpart, hinges on the presence of tower of conserved charges in involution, i.e. operators that mutually commute.In this paper we will consider quantum integrable spin chains and the goal of this section is to introduce such systems, and provide a brief overview of the R-matrix construction in their context.
The Hilbert space of the spin chain is a L-fold tensor product The Hamiltonian H of a spin chain with nearest-neighbour interaction is a sum of two-site Hamiltonians H i,i+1 : where we assume periodic boundary conditions : H L,L+1 ≡ H L,1 .The chrestomathic example of the integrable spin-chain is spin-1/2 XYZ model : where α = {x, y, z} and S α i are Pauli matrices acting in the two-dimensional space V i = C2 of i-th site.In particular case when J x = J y it reproduces XXZ model, while in the case of three equal coupling constants J x = J y = J z = J the Hamiltonian reduces to the XXX spin chain.These famous magnet models are just a few examples of integrable spin chains and now we turn to the general construction.
The central element for the whole construction and proof of quantum integrability is the R-matrix operator R ij (u) which acts in the tensor product V i ⊗ V j of two spin sites 2 and satisfies the Yang-Baxter equation: where the operators on the left and right sides act in the tensor product assumed to be an analytic function of the spectral parameter u.Further, in order to guarantee locality of the interaction in (2.1), it must reduce to the permutation operator P ij when evaluated at u = 0, i.e.
This condition will be referred to as regularity in the following sections.We next turn to defining the monodromy matrix T a (u).This matrix, denoted by , acts on the spin chain plus an auxiliary spin site labeled by a with Hilbert space as V a ∼ C d .It is defined as a product of R-matrices R a,i (u) acting on the auxiliary site and one of the spin chain sites and is given by (2.5) The transfer matrix T (u) ∈ End( obtained by taking a trace over the auxiliary vector space V a : (2.6) From the Yang-Baxter equation one can derive the following RT T relation constraining monodromy matrix entries This condition can be used to prove that the transfer matrices commute at different values of the momenta The above condition implies that the transfer matrix T (u) encodes all the commuting charges Q i as series-expansion in u : (2.9) Hence we have3 (2.10) The Hamiltonian density H i,i+1 introduced earlier in equation (2.1) can be generated from the R-matrix using where P i,i+1 is the permutation operator between sites i, i+1.Also, we emphasize that while the charges are conventionally computed in Equation (2.10) at u = 0, this computation can equally well be done at generic values of u to extract mutually commuting charges.The only difference is we no longer recover the Hamiltonian directly as one of the commuting charges.
Yang-Baxter equation (2.3) should be supplemented with certain analytical properties of R-matrix.
For example, as was already mentioned, we assume that the R-matrix is a holomorphic function of spectral parameter u and equal to the permutation matrix at u = 0 (2.4).Furthermore, one can impose extra physical constraints like braided unitarity crossing symmetry 4 , and possibly additional global symmetries.We shall also impose restrictions on the form of the resulting Hamiltonian.These restrictions may follow from requirements such as hermiticity and from symmetries of the spin chain.In addition, given a solution for the Yang-Baxter equation, one can generate a whole family of solutions by acting with the following transformations : 1. Similarity transformation : where Ω ∈ Aut(V ) is a basis transformation.
It transforms the commuting charges as 2. Rescaling5 of the spectral parameter : u → cu, ∀ c ∈ C.This leads to a scaling in the charges as Multiplication by any scalar holomorphic function f (u) preserving regularity condition : R(u) → f (u)R(u), f (0) = 1.This degree of freedom can be used to set one of the entries of R-matrix to one or any other fixed function.
4. Permutation, transposition and their composition: P R(u)P, R(u) T , P R T (u)P .They transform the commuting charges as well.The Hamiltonian H is transformed to P HP, P H T P, H T respectively.
In general, one should always be careful of these redundancies when comparing a trained solution against analytic results.Following [58], we shall fix the above symmetries when presenting our results in section 4.1.2and appendix A. We look at gauge-equivalent solutions as well, by introducing similarity transformations in 4.1.2.

Reviewing two-dimensional R-matrices of the difference form
We will illustrate the work of our neural network using two-dimensional spin chains as a playground.
The regular difference-form integrable models in this context have recently been classified using the Boost operator in [58].Here, we present a brief overview of the methods and results of this paper.
Boost automorphism method allows one to find integrable Hamiltonians by reducing the problem to a set of algebraic equations.Let us focus on a spin chains with two-dimensional space V = C 2 and nearest-neighbour Hamiltonian (2.1).One formally defines the boost operator B [65] as which generates higher charges Q n , n ≥ 3, from the Hamiltonian Q 2 via action by commutation: This was used in [58] to successfully classify all 2-dimensional integrable Hamiltonians by solving the system of algebraic equations arising from imposing vanishing conditions on commutators between Q i , upto some finite value of i. Surprisingly it turns out that for the considered models, the vanishing of the first non-trivial commutator [Q 2 , Q 3 ] = 0 is a sufficient condition to ensure the vanishing of all other commutators.Then making an ansatz for the R-matrices and solving Yang-Baxter equation in the small u limit, the authors constructed the corresponding R-matrices and confirmed the integrability of the discovered Hamiltonians.The solutions can be organized into two classes: XYZ-type, and non-XYZ type, distinguished by the non-zero entries appearing in the Hamiltonian.
Generically, all non-zero entries would be complex valued.Hermiticity, for the actual XYZ model and its XXZ and XXX limits, places additional constraints.Integrability also imposes additional algebraic constraints between the non-zero entries, none of which involve complex conjugation in contrast to hermiticity.Amongst the XYZ type models, there are 8 distinct solutions each corresponding to some set of algebraic constraints among the matrix elements of H.In particular, there is one is fed via the Input layer (green) to a series of three Fully Connected layers (purple) containing four neurons each and finally feeds into the Output layer of three neurons (orange).Every neuron in a given layer receives inputs from all neurons in the preceding layer, and in turn, its output is passed as input to all neurons in the next layer.

Neural Networks for the R-matrix
This section reviews several essential facts about neural networks before presenting our own networkdesign for deep-learning and the associated custom loss functions.Further details regarding the network architecture and training schedule can be found in Appendix B.

An overview of Neural Networks
The central computation in this paper is the utilization of neural networks to construct R-matrices that correspond to given integrable spin chain Hamiltonians.We therefore furnish a lightning overview of neural networks in this section, along with the details of our implementation of the neural network solver for Yang-Baxter equations.We will focus on dense neural networks, also known as multi-layer perceptrons (MLPs), schematically displayed in Figure 1.These networks consist of an input layer a in ∈ R n0 , followed by a series of fully connected layers and terminate in an output layer Data is read in to the network at the input layer and the output is collected at the output layer.There are L fully connected layers in this network, where the -th layer contains n neurons.Each neuron a (l) m in a l-th fully connected layer receives inputs from all the neurons in the previous (l − 1)-th layer and the output of the neuron is in turn fed as an input to neurons in the succeeding layer: . . .
where w (l) ∈ M(n l , n l−1 , R) is a weight matrix, b (l) ∈ R n l -bias vector, h(z) is in general a non-linear, non-polynomial function known as the activation function acting component-wise : In (3.1) we also identify a (0) = a in and a (L+1) = a out with input and output layers respectively.Introducing shorthand notation for the affine transformations in equation (3.1) as A ( ) (a ( −1) ) ≡ w ( ) a ( −1) + b ( ) , the neural network a out (a in ) : R n0 → R n L+1 can be expressed as compositions of affine transformations, and activation functions: The output function of the neural network is tuned by tuning the weights and biases.It is by now well established that such neural networks are a highly expressive framework capable of approximation of extremely complex functions, and indeed there exists a series of mathematical proofs which attest to their universal approximation property, e.g.[30][31][32]66,67].This property, along with the feature learning capability of deep neural networks is the key driver to the automated search for R-matrices which we have implemented here.
while allowing for the possibility of outliers.A popular class of loss functions are where q = 1 corresponds to the mean average error and q = 2 to the mean square error, respectively.We will shortly see that in contrast to the above classic supervised learning set-up, our loss functions impose constraints on the neural network output functions rather than train on a dataset of input/output values for the functions R (u) directly sample R (u) at various values of u for training.

Machine Learning the R-Matrix
We are now ready to describe our proposed methodology for constructing R-matrices R (u) by optimizing a neural network using appropriate loss functions.An R-matrix has elements R ij (u) at least some of which are non-zero.In the following, we shall focus solely on the R ij (u) which are not identically zero as functions of u.We also restrict the training to the real values of spectral parameter u ∈ Ω = (−1, 1) and exclusively use holomorphic activations function in order to guarantee the holomorphy of the resulting We have decomposed the matrix element R ij (u) into a ij (u) + i b ij (u) in order to learn complex-valued functions R ij while training with real MLPs on the real interval Ω.In this paper, purely for uniformity, we have modeled each such a ij (u) and b ij (u) using an MLP containing two hidden layers of 50 neurons each and one linear activated output neuron.We emphasize that the identification of a ij (u) and b ij (u) to real and imaginary parts of R ij (u) is only valid over the real line, and these functions separately continue into holomorphic functions over the complex plane whose sum R ij (u) is holomorphic by construction.Now, R ij (u) is required to solve the the Yang-Baxter equation (2.3) subject to (2.4).We may also place constraints on the corresponding two-particle H given by (2.11).These criteria are encoded into loss functions which the R-matrix R ij (u) aims to minimize by training.For example, in order to train 3) for all values of spectral parameter u from the set Ω ⊂ C we introduce the following loss function : In practice of course, one cannot scan across the space of all functions and is restricted to a hypothesis class.Here the hypothesis class is implicitly defined by the design of the neural network, the choice of numbers of layers, number of neurons in each layer, as well as the activation function.Varying the weights and biases of the neural network allows us to scan across this hypothesis space.While in general the exact R-matrix may not belong to this hypothesis class, and the loss function would then be strictly positive, deep learning may allow us to approach the desired functions R → R to a high degree of accuracy.In summary, if we restrict to a hypothesis class which does not include an actual solution of the Yang-Baxter equation then where ideally would be small, indicating that we have obtained a good approximation to the true solution.We expect that scanning across wider and wider hypothesis classes would bring closer and closer to zero.Further, while the RTT equation (2.7) follows from the Yang-Baxter equation (2.3), it can also be imposed separately as a loss function on the network in order to improve the training : Next, we have constraints that must be imposed on the R-matrix at u = 0. Following equation (2.4) and equation (2.11) in previous section, we require that 7 where H is the two particle Hamiltonian.They both can be encoded in the loss function as ) Here, we should mention that we have some flexibility in the manner in which we implement the Hamiltonian constraint L H . Firstly, one can fix the exact numerical values for the entries of H and learn corresponding R-matrix.We will also consider extensions of this loss function where we supply only algebraic constraints restricting the search space for target Hamiltonians to those with certain symmetries or belonging to certain gauge-equivalence classes.In general, such Hamiltonian constraints give us the requisite control to converge to the different classes of integrable Hamiltonians, and we will name such regime as a exploration by attraction.
In the same spirit, when working with the XYZ spin chain or its XXZ and XXX limits, we also require that the two-particle Hamiltonian computed from R (u) is hermitian, i.e., We impose this condition by means of the loss function where H ij are the matrix elements of H.We shall therefore train our neural network with the loss function where putting the coefficients λ α , for α = {RT T, H, †}, to zero removes the corresponding loss term from being trained.
The loss function (3.16) produces a very complicated landscape and the NN should approach its minimum during the training.Usually, this search is performed with gradient based optimization methods.
One might be skeptical about being stuck in some local minimum instead of finding the global minimum of such complicated loss function in a very high dimensional hypothesis space.However, recent analysis revealed that deep NNs end up having all their local minima almost at the same value as the global one [68], [69].In other words, there are many different configurations of weights and biases resulting into a function of similar accuracy as the one corresponding to the global minimum.There are also many 7 It is tempting to consider a variation of our method which involves residual learning a la the ResNet family of networks [64].As opposed to learning deviations from identity, which is typically the approach adopted in the ResNet architecture, we may define where R (u) is the target function of the neural network, which we design to identically output R (0) = 0.While this is possible in principle, in practice it turns out that since the neural network is learning a function in the vicinity of P , which trivially minimizes the Yang-Baxter equation and all other constraints imposed, it almost invariably collapses to the trivial solution and learns R (u) = 0 across all u.It would nonetheless be interesting to identify such architectures that successfully learn non-trivial R-matrices and this is in progress.
saddle points and some of them have big plateau and just a small fraction of descendent directions, making them practically indistinguishable from the local minima.However, most of their losses are close to the global minimum as well.Those with significantly higher losses have a bigger number of descendent directions and thus can be escaped during the learning [68], [69].
We find that the training converges to yield simultaneously low values for each of the above losses as applicable.Further, while the hyper-parameters {λ} are tunable experimentally, setting them all to 1 is a useful default.However, for fine-tuning the training it is also useful to tune these parameters to reflect the specific task at hand.We provide the requisite details in Section 4 where we discuss specific training methodologies and the corresponding results.We will also discuss there a new loss function which is useful to fine-tune the training to access new integrable Hamiltonians H in the neighbourhood of previously known integrable Hamiltonians H o , we will call such regime as a exploration by repulsion.
As a final observation on the choice of activation functions, we note that at the level of the discussion above, any holomorphic activation function such as sigmoid, tanh, and the sinh would suffice.In practice we find that the training converges faster and more precisely using the swish activation [62].
This is given by swish (3.18) We have provided some comparison tests across activation functions in Appendix B.2.

Results
We present our results for learning R-matrices within the restricted setting of two dimensional spin chains of difference form.Our analysis will be divided into three parts.First, we will learn hermitian XYZ model and its well-known XXZ and XXX limits, comparing our deep-learning results against the analytic plots.Then we remove hermiticity and reproduce all 14 classes of solutions from [58].The last set of experiments demonstrates how our Neural Network in the Explorer mode can search for Integrable models exploring the space of Hamiltonians.

Specific integrable spin chains
In this sub-section we look at specific physical models, by imposing tailored conditions on the Hamiltonian derived from the training R-matrix.This includes constraints on the Hamiltonian entries at u = 0, and hermiticity of the Hamiltonian.

Hermitian models: XYZ spin chain and its isotropic limits
Imposing hermiticity on the 8-vertex Hamiltonian, we learn the classic XYZ integrable spin chain and its symmetric XXZ limit.We start with the following 8-vertex model ansatz for the R-matrix and impose the loss functions for YBE, hamiltonian constraint, regularity, and hermiticity (see equation 3.16).The target Hamitonians comprise of a 2-parameter family H XY Z (J x , J y , J z ) given by where we have set J x +J y to be equal to 2. The symmetric limit of XXZ model is realised for J x = J y = 1.
A useful reparametrisation for these models is in terms of (η, m) [70] 3) The analytic solution for the XYZ R-matrix is given in terms of Jacobi elliptic functions as where ω = 2 sn(2η | m), and m is the elliptic modular parameter8 .Our model consistently learns the R-matrices for the XYZ model for generic values of the free parameters η, m. Figure 2 gives the time evolution of the different loss terms during training.Figure 3 plots the R-matrix component ratios with respect to R 12 in terms of the spectral parameter, and compares them with the corresponding analytic functions for a generic choice of deformation parameters η = π 3 and m = 0.6.Letting m = 0, we recover the XXZ models for generic values of η.

Two-dimensional classification
Here, we lift the hermiticity constraint on the Hamiltonian, thus allowing for more generic integrable models.As we shall see below, the neural network successfully learns all the 14 classes [58] of differenceform integrable (not necessarily Hermitian) spin chain models with 2-dimensional space at each site.The R-matrices corresponding to each of these classes are written down explicitly in appendix A. Towards the end of this sub-section, we also present results for learning solutions in generic gauge obtained by similarity transformation of integrable Hamiltonians from the aforementioned 14 classes.We shall discuss the results in two parts: XYZ type models, and non-XYZ type models.
The first set of Hamiltonians under consideration are generalisations of the XYZ model (discussed in the previous sub-section), with at most 8 non-zero elements in its Hamiltonian density where the coefficients can take generic complex values.The XYZ model corresponds to the subset with As discussed in section 2.1, these models can be further sub-divided into four, six, seven and eight vertex models.On the other hand, there are 6 distinct classes of non-XYZ type solutions.Here we will discuss the training results for one example each from the XYZ and non-XYZ type models, since the training behaviour is similar within these two types.Rest of the models will be presented in Appendix A. Figure 4 plots the R-matrix components as ratios with respect to R 00 for a generic 6-vertex model with d 1 = d 2 = 0, and a 1 = a 2 .The figure also includes the absolute and relative errors with respect to the corresponding analytic R-matrix (see equation (A.2)).
From the non-XYZ classes, we will focus on the following 5-vertex Hamiltonians For integrability, we require the additional condition Training the Hamiltonian constraint (3.13) for generic values a 1 = 0.5, a 2 = 0.3, a 3 = 0.9, a 4 = 1.5, a 5 = 0.4 satisfying the above integrability condition, we get over 0.1% accuracy for training over ∼100 epochs.
Figure 5 plots the trained R-matrix components and absolute errors with respect to the analytic Rmatrices in equation (A.20), for the above choice of target Hamiltonian.We have also surveyed more general solutions beyond the representative solutions of the 14 classes a la [58], by changing the gauge    For comparison with analytic formulae, we normalised our results by taking ratios with respect to a fixed component R 00 , i.e. we plot Rij R00 .As a result of starting from the XXZ model, the R-matrix R XXZt in the general gauge has following highly symmetric form Thus we only plot the entries R 00 , R 01 , R 03 , R 10 , R 11 , R 12 , R 30 .Since there exists overall normalisation ambiguity, we should only compare ratio of R-matrix entries with the analytic solution written in the same gauge.
These are the models with Hamiltonian H 6v,1 , H 6v,2 in appendix A. Figure 7a which measures the relative error in the approximate solution of the Yang-Baxter equation.This metric is evaluated for the trained R-matrix for both integrable and non-integrable models in Figure 8 (for H 6v,1 model), and Figure 9 (for H class−4 model).We see that the normalized error can be up to two orders of magnitude larger for the non-integrable case.Note that irrespective of the choice of Hamiltonian, there are two lines along u = v and v = 0 on which the Yang-Baxter equation is trivially satisfied, due to  regularity.This metric also can detect anomalous situations when the learned solution once satisfied the Hamiltonian constraint at u = 0 quickly evolves to a true solution of Yang-Baxter equation producing relatively small YB loss (3.7).In this case we will see the big spike in (4.13) around zero which will indicate the fakeness of the found solution.
The above consideration shows that one can define the metrics which together indicate the closeness of the given system to the integrable Hamiltonian.However, the final conclusion in the binary form of "integrable/nonintegrable" regarding the given spin chain can be made only asymptotically, namely increasing the number of neurons, density of points and training time one can get the normalized YB loss (4.13) uniformly decreasing to zero for integrable Hamiltonians while for nonintegrable case it will be bounded from below by some positive value.Also let's stress that such problem is specific for the solver mode once we stick to a given Hamiltonian, while in the case of relaxed Hamiltonian restrictions as we will see in the next section, the neural network moves to the true solution of the Yang-Baxter equation.

Explorer: new from existing
In this section we will present two kinds of experiments that illustrate how the neural network presented above can be used to scan the landscape of two-dimensional spin-chains for integrable models.The training schedule adopted in this section is visualized in Figure 10 and relies essentially on two new ingredients which distinguish it from the previous solver framework.These are warm-start and repulsion.
We will illustrate each by an example.In the first case we shall simply use warm-start, and in the second, we shall combine warm-start with repulsion.Finally, we shall use unsupervised learning methods such as t-SNE and Independent Component Analysis to identify distinct classes of Hamiltonians within the set of integrable models thus discovered.Collectively, these strategies make up our explorer framework.
The first key new ingredient is a warm-start initialization.As mentioned previously, the standard solver framework of the previous section uses He initialization [64] to instantiate the weights and biases of    of integrable Hamiltonians, we explore it using repulsion to identify new integrable models.
the neural network.In warm-start initialization, we use the knowledge of integrable systems previously discovered by the neural network to find new systems in its vicinity.The idea, at least intuitively, is that it should be possible to find new integrable systems more efficiently than with the random initialization by exploring the vicinity in weight-space of previously determined solutions using an iterative procedure such as gradient-based optimization.On doing so, we find a significant acceleration in training convergence, with new solutions being discovered typically in about 5 epochs of training after warm-start initialization.
For definiteness, we consider the hermitian XYZ model discussed earlier in Section 4.1.1.This has a two-parameter family of solutions, corresponding to independent choices for the parameters η and m of the Jacobi elliptic function, as seen from Equation (4.4).The XXZ model is embedded into this space as the m = 0 subspace of solutions.
We now describe how the above strategy can be used to quickly generate the cluster of XYZ Rmatrices starting from a particular one which we choose from XXZ subclass.We begin with pre-training our neural network using the solver mode of the previous section, but with the learning rate of the      Our next key new ingredient for the Explorer mode is repulsion, which is added to the previous strategy of warm-start initialization.In principle, it should allow us to rediscover all 14 classes of integrable spin chains.However, for sake of simplicity, we will illustrate it now with a toy-model example and return to the general analysis later [71].Namely, we consider the class of 6-vertex Hamiltonians with unrestricted a 1 and a 2 .It includes both integrable 6-vertex classes H 6v,1 , H 6v,2 (A.2, A.4) as well as nonintegrable models.In order to mimic the general situation when all integrable classes intersect at zero, we begin by pre-training the neural network to a Hamiltonian belonging to the intersection of the classes H 6v,1 and H 6v,2 , i.e. whose matrix element satisfy the constraints a 1 = a 2 and a 1 + a 2 = b 1 + b 2 simultaneously.The results mentioned in this paper correspond to setting Having arrived at this model, we would like to navigate to neighboring models not by specifying target values of the Hamiltonian, but by scanning the neighborhood of the current model.To do so, we employ a two step strategy.First, we navigate to two 9 new 6-vertex integrable Hamiltonians by random scanning the vicinity of the current model without giving specific target values.We shall use these new models as our warm-start points.From each of them, we navigate away by using the repulsion loss term (3.17) for 1 epoch, followed by training for another 5 epochs.Note in this step, we still train within the restricted class of 6-vertex models by fixing the corresponding entries of the R-matrix to zero.We repeat this schedule 25 times starting from either of the saved models.This way, we generate fifty 6-vertex integrable Hamiltonians with over 1% accuracy 10 .The training curve displaying how the Yang-Baxter loss evolves is shown in Figure 12.
The learnt models are classified into two classes using clusterisation methods as shown in Figure 13.

Conclusions and Future directions
In this paper we constructed a neural network for solving the Yang-Baxter equation (2.3) in various contexts.Firstly, it can learn the R-matrix corresponding to a given integrable Hamiltonian or search for an integrable spin chain and the corresponding R-matrix from a certain class specified by imposed symmetries or other restrictions.We refer to this as the solver mode.Next, in the explorer mode, it can search for new integrable models by scanning the space of Hamiltonians.
We demonstrated the use of our neural network on two-dimensional spin chains of difference form.
In the solver mode, the network successfully learns all fourteen distinct classes of R-matrices identified in [58] to accuracies of the order of 0.01 − 0.1%.We demonstrated the work of the Explorer mode, restricting the search to the space of spin chains containing both classes of 6-vertex models as well as nonintegrable Hamiltonians.Starting from the hamiltonian at the intersection of two classes , Explorer found 50 integrable Hamiltonians which after clusterisation clearly fall into two families corresponding to 9 We stop the scanning once we found a representative from each of two classes because we know that there are only two integrable families here.In general case one of course should generate sufficiently many points in order to find all classes.
We will return to this subtle point later in [71] 10 If we further train the individual models for more epochs, we can improve the accuracy of the obtained solution to similar levels as obtained in the examples presented in Section 4.1.
two integrable classes of 6-vertex model.Working in the explorer mode, we find that warm-starting our training from the vicinity of a previously learnt integrable model greatly speeds up convergence, allowing us to identify typically about 50 new integrable models in the same time that random initialization takes to converge to a single model.
The main focus of this paper was creating the neural network architecture and demonstrating its robustness in various solution generating frameworks using known integrable models as a testing ground.
However, we expect that this program can be extended to various scenarios such as the exploration and classification of the space of integrable Hamiltonians in dimensions greater than two.This would be of great interest since the general classification of models is currently limited to two dimensions.Our experiments with exploration and clustering are a promising starting point in this regard.In our setup the strategy is quite straightforward [71].Because all integrable families of Hamiltonians can be multiplied by arbitrary scalar, we should only scan the Hamiltonians on the unit sphere which is compact.Scanning over sufficiently dense set of points on the sphere will allow us to identify integrable Hamiltonians from various classes.Then we can use the Explorer to reconstruct the whole corresponding families and perform clusterisation in order to identify them.On another footing, it would also be interesting to extend our study to R-matrices of non-difference form as these are particularly relevant to the AdS/CFT correspondence [72][73][74][75].
While our network learns a numerical approximation to the R-matrix, it can also be useful for the reconstruction of analytical solutions using symbolic regression [29,76].Alternately, one may try to use the learnt numerical solution for the reconstruction of the symmetry algebra such as the Yangian and then arrive at the analytical solution.Remarkably, machine learning is already proving helpful in the analysis of symmetry in physical systems.In particular, one may verify the presence of a conjectured symmetry or even automate its search using machine learning [19,[43][44][45][46][47]77].It would be very interesting to explicate the interplay of our program in this broader line of investigation.
In addition, the flexibility of our approach would also allow us to implement various additional symmetries or other restrictions, both at the level of the R-matrix and the Hamiltonian.It would therefore be very interesting to develop an 'R-matrix bootstrap' in the spirit of the two-dimensional S-matrix bootstrap and analyze the interplay between various symmetries.For example, all 14 families of R-matrices considered in this paper satisfy the condition of braided unitarity (2.12) and it would be interesting to rediscover them from the use of braided unitarity and other symmetries without imposing the Yang-Baxter condition, similar to how integrable two-dimensional S-matrices have been identified in the S-matrix Bootstrap approach [51,53,78].
With mild modifications, we can adapt our architecture to the analysis of Yang-Baxter equation for the integrable S-matrices in two dimensions.The only new feature to implement is the analytic structure in the s-plane.It can be naturally realized with the use of holomorphic networks.
Learning solutions for different classes with the same architecture, we noticed that the number of epochs needed to reach the same precision varies for different classes while being roughly the same for the Hamiltonians from the same classes.Thus, it would be very tempting to use the training of losses to define the complexity of spin chains.Ideally, we should be able to go beyond the class of integrable models and see that they sit at the minima of complexity, matching common beliefs that the integrable models are the "simplest" ones.

A Classes of 2D integrable spin chains of difference form
In this appendix, we list the 14 gauge-inequivalent integrable Hamiltonians of difference form and the corresponding R-matrices.Amongst the XYZ type models, the simplest solution is a diagonal 4-vertex model with Hamiltonians and R-matrices as follows: In 6-vertex models, we have two distinct classes depending on whether the Hamiltonian entries H 00 and H 33 are equal or not.In the first case, the R-matrix R 6v,1 (u) and its associated Hamiltonian H 6v,1 are given by where Figure 4 gives a representative training vs actual plot for this class.
For the case H 00 = H 33 , the R-matrix R 6v,2 (u) is given by where solution distinguished by the Hamiltonian entries H 00 , H 33 being equal or not.In the first case, we have where Figure 17 plots the predicted R-matrix components as ratios with respect to the (12) component against the above analytic results, and their differences for a generic choice of parameters 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.0 0.5 0.0 0.5  In the second case for H 00 = H 33 , we have where Figure 18 plots the predicted R-matrix components as ratios with respect to the (12) component against the above analytic results, and their differences for a generic choice of parameters a where .11)with Hamiltonian coefficients given by for free parameters b 1 , η, m, δ 1 , δ 2 .Figure 19 plots the predicted R-matrix components as ratios with respect to the (12) component against the above analytic results, and their differences for a generic The second class of 8-vertex XYZ-type solution has Hamiltonian H 8v,2 and R-matrix R 8v,2 (u) defined as follows with the Hamiltonian coefficients given by The second class of 8-vertex XYZ-type solution has Hamiltonian H 8v,3 and R-matrix R 8v,3 (u) defined as follows where In the class-2 solution above, the non-zero R-matrix components are explicitly given by and class 6 respectively, with generic Hamiltonian parameters.Also we note that allowing for complex parameters results in generically complex R-matrices.We compare the predictions against the actual formulae by taking ratios with respect to the real part of the (00) component for classes 2-5, and ( 12) component for class 6.

B Designing the Neural Network
This appendix contains an extensive overview of the architecture of our neural network solver, as well as details of the hyperparameters with which the network is trained.Our starting point is the close analogy between our problem of machine learning R-matrices by imposing constraints and the design of the Siamese Neural Networks [79,80].These were designed to function in settings where the canonical supervised learning approach of (3.5) for classification becomes infeasible due to the large number of target classes {y} and the paucity of training examples {x α } corresponding to each class y α .In such a situation, one may instead define a similarity relation and train the neural network to learn a function φ (x) : R D → R d such that the Euclidean distance between representatives φ (x) of two input vectors x 1 , x 2 that are similar to each other is small, while the distance between dissimilar data is large.Schematically, This is visualized in Figure 25.
There are many loss functions by which such networks may be trained, see for example [79][80][81][82].For definiteness, we mention the contrastive loss function of [79,80], given by where Y = 1 if x 1 ∼ x 2 and Y = 0 otherwise.Clearly this loss function causes the network to learn a function φ such that similar inputs x are clustered together while dissimilar inputs are pushed at least a distance r o apart.This therefore realizes our naive criterion for φ laid out in Equation (B.2).We also see very explicitly that the loss function in Equation (B.3) does not directly depend on the values y in contrast to Equation (3.5).Instead, the network is trained to learn a function φ (x) which obeys a property which is not given point-wise for each input x but instead is expressed as a non-linear constraint (B.2) on φ (x) evaluated at two points x 1 and x 2 .

B.1 The Neural Network Architecture and Forward Propagation
We now provide some more details about our implementation of R (u) and the training done to converge to solutions of the Yang-Baxter equation (2.3) consistent with additional requirements such as regularity (2.4).As already mentioned in Section 3.2, each matrix element R ij is decomposed into the sum a ij +i b ij which are individually modeled by MLPs.In principle each MLP is independent of the rest and can be individually designed.We shall however take all MLPs to contain two hidden layers of 50 neurons each, followed by a single output neuron which is linear activated 11 .The possible activations for the hidden 11 One might also construct an alternate formulation of the neural network where a single MLP of the kind shown in Figure 1 accepts a real input u and outputs all the requisite real scalar functions that comprise R (u).So far we have observed that such a network does not perform as well as our current formulation of independent neural networks for each real function.Nonetheless, it is possible that this formulation may eventually prove competitive with our current one and the question remains under investigation currently.).We also evaluate R (0) and R (0) thus completing the forward propagation.Next, we compute the losses (3.7), (3.12) and (3.13) as well as possibly (3.15).The loss function is trained on by using the Adam optimizer [63], with an initial learning rate η of 10 −3 which is annealed to 10 −8 in steps of 10 −1 by monitoring the saturation in the Yang-Baxter loss computed for the validation set over 5-10 epochs.
The effect of this annealing in the learning rate is also visible in the training histories in Figures 2 and 27 where the step-wise falls in the losses correspond to the drops in the learning rate.Across the board, training converges in about 100 epochs and is terminated by early stopping.

B.2 Comparing different activation functions
We now turn to a brief comparison of the performance of different activation functions with the above set up.Again for uniformity, we will use one activation throughout for all the MLPs a ij and b ij , but for the output neuron which is linearly activated.We then compared the performance of this neural network architecture while learning the Hamiltonian Baxter and Hamiltonian losses after saturation is also mentioned.We observed that the swish activation converges sooner and to lower losses.This is stable across multiple runs.See also Figure 27.
which is 6-vertex Type 1 in the classification of [58], see Equation (A.2) above.The neural network was trained with the loss functions (3.7), (3.13) and (3.12) and setting λ H and λ reg to 1 each.The training was carried out for 200 epochs on observing that the networks did not perform better on training for longer.Further, we set a batch size of 16 and optimized using Adam with a starting learning rate of 10 −3 which was annealed to 10 −8 using the saturation in the Yang-Baxter loss over the validation set as the criterion as mentioned above.We conducted this training using the activations sigmoid, tanh, swish, all of which are holomorphic, as well as elu and relu.The last two are not holomorphic but have been included for completeness.The evolution of the Yang-Baxter and the Hamiltonian loss for all these activations is shown in Figure 27 and Table 1.On the whole, we see that the swish activation tends The precise numbers are given in Table 1.
to outperform the others quite significantly.While these are the results of a single run, we found that the result is consistent across several runs and tasks, leading us to adopt the swish activation uniformly across the board for all the analyses shown in this paper.

B.3 Proof of Concept:
Training with a single hidden neuron.
As a final observation we present a simple proof of concept of our approach of solving the Yang-Baxter equation along with other constraints by optimizing suitable loss functions.Here, instead of attempting to deep learn the solution, we use a single layer of neurons for each R-matrix function and pick an activation function by the form of the known analytic solution.In effect, rather than rely on the feature learning properties of a deep MLP as we have done in the rest of our paper, we ourselves provide activation functions which should furnish a natural basis to express the known analytic solutions in.  .presupposed our knowlege of the exact solution -in effect, the true R-matrix lay within our hypothesis class -the model trained to losses of the order of 10 −8 which is several orders of magnitude below the typical end of training losses we observed in the standardized framework.Further, we also obtain an excellent performance even out of the domain of training, which is usually not the case in machine learning.
Neural networks learn a target function f (x) of the input data x by optimizing a cost function L which provides a measure of the discrepancy between the actual and desired properties of the function f .The parameters w and b are then tuned to minimize this discrepancy.The best known and the most canonical examples of this are the supervised learning problems where the neural network is supplied with data D = {(x, y)} consisting of pairs of input vectors x with their expected output values y.The neural network then tunes its weights and biases to minimize the cost function.Having done so, the output function f thus learned by the neural network obeys where ||...|| is a matrix norm defined as ||A|| = n α,β=1|A αβ | for an complex-valued n×n matrix A. During the forward propagation we sample a mini-batch of u and v values, from which the corresponding u − v is constructed.Along this paper, the spectral parameters u and v run over the discrete set of 20000 points randomly chosen from the interval Ω.6 The loss function L Y BE is positive semi-definite, vanishing only when R (u) solves the Yang-Baxter equation.In principle, one may imagine a scan across the space of all functions in which case, solutions of the Yang-Baxter equation would minimize the loss (3.7) to zero.

Figure 2 :
Figure 2: The evolution of training losses for the XYZ model, shown on the log scale.The losses tend to fall in a step-wise manner, which corresponds approximately to the learning rate schedule the network is trained with.

9 )= v 11 v 12 v
If R(u) satisfies Yang-Baxter equation, so does R Ω (u).A generic similarity matrix Ω Ω 21 v 22 (4.10)with non-zero off-diagonal entries v 12 , v 21 = 0, results in conjugated R-matrices and Hamiltonians with all 16 non-zero entries.We trained 16-vertex Hamiltonians resulting from XYZ model in the general gauge and recovered the corresponding R-matrix with a relative error of order O(0.1%).Generic XYZ type models, as well as non-XYZ type models gave similar results for different gauges.Figure6plots the learnt R-matrix components for XXZ model with η = π 3 conjugated by the matrix Ω

Figure 6 : 3 and 1 ,
Figure 6: (a) Predicted R-matrix component ratios w.r.t.R 00 , for conjugated XXZ model with η = π 3 compares the training for a generic Hamiltonian with coefficients satisfying none of the above conditions against the training for H 6v,1 -type model.We see that while the Hamiltonian constraint (3.13) saturates to similarly low values in both cases, the Yang-Baxter loss saturates at approximately one order of magnitude higher.Similar behavior holds for the non-XYZ type models as well.The training for a generic class-4 Hamiltonian with coefficients a 1 = 0.5, a 2 = 0.3, a 3 = 0.4, a 4 = 0.9 (see Equation A.19) and a non-integrable deformation is shown in Figure7b.One can further discriminate between integrable and non-integrable models by checking the pointwise values of the Yang-Baxter losses in the two cases.Let us define the metric

Figure 7 :
Figure 7: Comparing the training history of the Type XYZ and non-XYZ models against corresponding non-integrable Hamiltonians.There is approximately an order of magnitude difference between the Yang-Baxter losses for the integrable case vs the non-integrable case after the training saturates, indicated by the gray region in the graph.The step-wise drops in the loss functions approximately correspond to the learning rate schedule.The presented Hamiltonians are the same as on Fig.8 and Fig.9

Figure 9 :
Figure 9: (a) The normalized Yang-Baxter error (4.13) plotted in the logarithmic scale at the end of training for the Hamiltonian H class−4 from (A.19), with a 1 = 0.5, a 2 = 0.3, a 3 = 0.4, a 4 = 0.9, (b) Non-integrable deformation with same Hamiltonian parameters as in the integrable case, except for H 13 = −0.9.

Figure 10 :
Figure 10: Visualizing the Explorer scheme.We start with random initializations, marked by lightning symbols, and perform solver learning represented by red curve arrows.Once we reach an submanifold

Adam optimizer set to 10 − 3 .
The pre-training is stopped when all losses saturate below O 10 −3 , which typically requires about 50 epochs of training.We carried out this pre-training setting arbitrary reference values of η, but with m fixed to zero.The results shown here correspond to η = π 4 .The weights thus obtained correspond to our warm-start values.Then we shift the target Hamiltonian values to correspond to η → η + δη, where δη are randomly chosen O 10 −1 numbers, and m can take on non-zero values as well.We then retrain the model with a smaller learning rate, 10 −4 for a few epochs until all loss terms fall to O 10 −4 , which typically takes about 5 epochs, upon which we update the target Hamiltonian by updating η and m and continue training.This strategy generates about 10 XYZ models within the same time-scale (i.e. about 100 to 200 epochs of training) as we earlier needed for a single model.For best results, while we randomly update η, we systematically anneal the modular parameter m to upwards of (a) Evolution of Yang-Baxter Loss (b) Evolution of Hamiltonian Loss

Figure 11 :
Figure 11: The convergence to XYZ models from XXZ models trained with different parameters.XXZ was trained for 50 epochs at η = π 3 and m = 0.Then, it was trained for 5 more epochs at η = π 4 and η = π 6 , still with m = 0.After that, 5 non-zero values of m were used for each XXZ model, and we trained for another 15 epochs.Loss spikes occurred when the target hamiltonian values were reset.The final training was run in parallel for convenience, but it can be run sequentially.

Figure 12 :
Figure 12: Time evolution of the Yang Baxter loss as the neural network explores the space of integrable Hamiltonians of 6-vertex models H 6v1,6v2 by repulsion.The loss evolves together until the 50 th epoch after which it fragments slightly as the training converges to the two warm-start points on the 60th epoch.For the remaining epochs the losses fragment completely as the neural network seeks out different new Hamiltonians and is terminated when the loss reaches the neighborhood of 1 × 10 −4 .

FastICAFigure 13 :
Figure 13: Clustering of Hamiltonians from the 2 classes of gauge-inequivalent 6-vertex models obtained by Explorer using repulsion from solution at intersection of both classes.

Figure 14 :
Figure 14: The 6-vertex models learnt by exploration.The graph visualizes the obtained Hamiltonians by plotting their values along the a 1 + a 2 − b 1 − b 2 and the a 1 − a 2 axes.The models H 6v,1 lie along the y-axis and the models H 6v,2 along the x-axis with an error margin of order 10 −3 as shown in the telescoped inset plots.

Figure 14 plots
Figure 14 plots the trained models in terms of coordinates defined by the integrability conditions of the Hamiltonians H 6v,1 , H 6v,2 .Models lying near the two axes were classified correctly into the two classes in Figure 13 with 100% accuracy.

d 2 d 1 eFigure 21 :
Figure 21 plots the predicted R-matrix components as ratios with respect to the (12) component against the above analytic results, and their differences for a generic choice of parameters a 1 = 1, b 1 = −0.45,d 1 = 0.6, d 2 = 0.75.

26 )
Amongst the above non-XYZ type models, we have already looked into the training for Class 1 model in section 4.1.Figure 22, 23, 24 plot the training vs actual R-matrix components for classes 2,3,4, class 5,

Figure 26 :
Figure 26: Visualizing the forward propagation of the neural network R (u).This has a very strong parallel to Figure 25, with the function R (u) playing the role of the map φ.The only difference is that R (u) also has additional constraints on R (0) and R (0) which are unique to our problem.

Table 1 :
Performance of different activation functions on learning the Hamiltonian (B.4).The saturation epoch is the approximate epoch after which the model did not train further.The final values of the Yang-

Figure 27 :
Figure 27: The evolution of the Yang-Baxter loss (left) and the Hamiltonian loss (right) for a variety of activation functions when training for 200 epochs.The swish activation tends to outperform the others.

For definiteness, consider the XXZ model at η = π 3 .
The non-zero entries in this R-matrix areR 00 (u) = sin (u + η) , R 11 (u) = sin (u − η) = R 22 (u) , R 12 (u) = sin (η) = R 21 (u) , (B.5)as may be observed by setting m = 0 in Equation (4.4).We define the networks a ij and b ij to have a single hidden layer of a solitary neuron activated by the sin function.This means that the functions learnt by the network are simply of the forma ij = W • sin (W • u) ,(B.6) and similarly for the b ij .The W and W are the weight and bias of the hidden and the output neuron respectively and the composition W • u is shorthand for the affine transformation w u + b.Next, we train the network imposing the losses (3.7), (3.12), (3.13) and (3.15), each with weight λ = 1, and theAdam optimizer with our standard learning rate scheduling.Figure28plots the trained XXZ R-matrix components for u lying in the range (-10,10).Note that since we trained with an activation function that

Figure 28 :
Figure 28: The figure on the left shows the XXZ model R-matrix with η = π 3 obtained by trained on a single sin activated hidden neuron in the range u ∈ (−1, 1) shown in gray.The solution remains valid outside the training domain as well.The figure on the right shows the corresponding training curves.