Machine learning using magnetic stochastic synapses

The impressive performance of artificial neural networks has come at the cost of high energy usage and CO$_2$ emissions. Unconventional computing architectures, with magnetic systems as a candidate, have potential as alternative energy-efficient hardware, but, still face challenges, such as stochastic behaviour, in implementation. Here, we present a methodology for exploiting the traditionally detrimental stochastic effects in magnetic domain-wall motion in nanowires. We demonstrate functional binary stochastic synapses alongside a gradient learning rule that allows their training with applicability to a range of stochastic systems. The rule, utilising the mean and variance of the neuronal output distribution, finds a trade-off between synaptic stochasticity and energy efficiency depending on the number of measurements of each synapse. For single measurements, the rule results in binary synapses with minimal stochasticity, sacrificing potential performance for robustness. For multiple measurements, synaptic distributions are broad, approximating better-performing continuous synapses. This observation allows us to choose design principles depending on the desired performance and the device's operational speed and energy cost. We verify performance on physical hardware, showing it is comparable to a standard neural network.


INTRODUCTION
The meteoric rise of artificial intelligence (AI) as a part of modern life has brought many advantages.However, as AI programs become increasingly more complex, their energy footprint becomes larger 1,2 , with the training of one of today's state-of-the-art natural language processing models now requiring similar energy consumption to the childhood of an average American citizen 3 .Several nontraditional computing architectures aim to reduce this energy cost, including non-CMOS technologies [4][5][6][7] .However, competitive performance with non-CMOS technologies requires overcoming the latent advantage of years of development in CMOS.
In biological neural networks, synapses are considered all-or-none or graded and non-deterministic, unlike the fully analogue synapses modelled in artificial networks 8 .Inspired by biology, several approaches have considered networks with binary synapses and neurons, with the view that binary operations are simpler to compute and thus lower energy [9][10][11][12] .However, while these binarised neural networks are more robust to noise, they suffer from lower performance than analogue versions.In contrast, networks with stochastic synapses provide sampling mechanisms for probabilistic models 13 and can rival analogue networks at the expense of long sampling times [14][15][16][17][18][19] .Adapted training methods are required to provide higher performance for a lower number of samples, while implementations require hardware that can a) These authors contributed equally to this work.
natively (with low energy cost) provide the stochasticity required.Magnetic architectures are one possible route for unconventional computing.They have long promised a role in computing logic following the strong interest in the field stemming from the data storage market 6,7,[20][21][22][23][24][25][26] .The non-volatility of magnetic elements naturally allows for the data storage, while ultra-lowpower control mechanisms, such as spin-polarised currents or applied strain 27,28 offer routes towards energyefficient logic-in-memory computing.Ongoing developments have shown how to manipulate magnetic domains to both move data and process it 22,[29][30][31] .However, magnetic domain wall logic is limited by stochastic effects, particularly when compared to the low error tolerance environment of CMOS computing 32,33 .
Here we propose a methodology where, rather than seeking to eliminate stochastic effects, they become a crucial part of our computing architecture.As a proof of concept, we demonstrate how a nanowire is usable as a stochastic magnetic synapse able to perform handwritten digit recognition using multiplexing of one of the hardware synapses.
We have developed a learning rule that can effectively train artificial neural networks made of such "noisy" synapses by considering the synaptic distribution.Suppose we allow a single measurement to identify the state of the synapse.In that case, the learning rule will adjust its parameter, i.e. the field at which the wall is propagated, to reduce the synaptic stochasticity.If we allow multiple measurements, the gradient rule will find parameters that allow for a broad synaptic distribution, mimicking a continuous synapse and improving performance.Without the stochasticity, the operation would be limited to binary operations, which lack the resolution power of analogue synapses.With stochasticity, we have a flexible system tunable between quick-runtime approximation and long-run-time performance.Our learning rule provides efficient network training despite the high or variable noise environment and differs from other stochastic neural network computing schemes that employ mean-field-based learning rules 14,16,19 .Here, the inclusion of the network variance allows the training to find better solutions in low sampling regimes, providing a trade-off between operational speed/energy cost and test accuracy.
We have verified the model performance experimentally by transferring the trained weights to a network utilising such a hardware synapse, with excellent agreement between the experimental performance and that of a simulated network.Our observations allow for a design framework where we can identify the number of required measurements (and hence energy requirements) for a given desired accuracy and vice versa.
This work opens up the prospect of utilising the low-energy-cost benefits of spintronic-based logic [5][6][7]34 . In articular, it enables the use of domain wall-based nanowire devices 24,31,35,36 whilst transforming the hitherto hindrance of noisy operation 32,33 into the basis of a high-performance stochastic machine learning paradigm.

Hardware stochastic synapse
Our proposed elementary computation unit is a binary stochastic synapse based on a ferromagnetic nanowire with two favourable magnetic orientations.The transitions between regions of differing magnetisation orientation are known as domain walls (DWs).While different forms of DWs exist, here they form a 'vortex' pattern with a cyclical magnetisation texture.Our synapse was a 400 nm wide, 54 nm thick permalloy nanowire with notches patterned halfway along its length to create an artificial defect site.Figure 1.a shows an SEM image of the system, with the inset enlarging the notch.DWs were nucleated at the left-hand side of the wire (false-coloured blue) by applying a voltage pulse across a gold current line (false-coloured orange).
The operation of this system as a stochastic synapse is described schematically in figure 1.b.A vortex DW 37 can be injected into the wire by applying a current pulse in the line.This corresponds to presenting the synapse with an input of 1, while no DW injection corresponds to an input of 0. An applied magnetic field is used to propagate the DW along the length of the wire.If the propagation field is sufficiently high, the DW does not pin at the defect site and can pass to the end of the wire, resulting in an output of 1.If the propagation field is low, the DW is pinned at the notch, resulting in an output of 0. For intermediate values of the field, the behaviour becomes stochastic but with a well defined pinning probability.We can consider the field control as controlling the weight in a binary synapse with detecting a DW on the right hand side of the nanowire as the output of the synapse.
As the propagation field is tuned, the probability of the DW passing changes.Figure 1.c shows this passing probability, as measured using the focused Magneto-Optical Kerr effect (FMOKE), as a function of the propagation field.The probability of passing behaves in a sigmoidlike manner, and the orange dashed line shows a fit using a logistic sigmoid function f (h ij ) (see methods).
Therefore, a binary stochastic synapse is determined by where f (h ij ) is the DW passing probability function, h ij is the propagation field for the synapse connecting input neuron j with output neuron i.Through this definition our synapses are purely excitatory, which corresponds to the physical representation of a magnetic DW being pinned or not, rather than the complementary binary scheme with values {−1, 1}, which is not naturally represented by the physical system.Compared to binary synapses, neural networks with analogue or graded synapses tend to perform better due to the wider range of states 38,39 .Here, we adopt a scheme similar to that of stochastic computing, where the average of a series of binary measurements or samples are used to represent a value.Thus, we allow for K ≥ 1 measurements to identify the state of a synapse and denote the equivalent mean weight as where K is the total number of samples taken and the superscript (k) indicates the individual sampling of the synaptic weights as per eqn. 1.The mean synapse has 1 + K states, e.g. for K = 1 the two states will be 0 and 1, while for K = 2 the states will be 0, 0.5, and 1.It follows that for K → ∞, wij will be equivalent to a sigmoidally-shaped continuous synapse, bounded between 0 and 1.An example demonstrating the average weight as a function of the number of samples can be seen in figure 1.d, where we plot eq. 2 for K = 1 (purple squares), 4 (blue diamonds) and 128 (green circles).Each example is calculated by sampling w ij the desired number of times with a fixed h ij that was selected randomly.In each case only discrete levels are available but when K = 128 the sampling is sufficient to provide an almost continuous representation.In this way, our proposed binary stochastic synapse can be used to construct neural networks that will approach a bounded analogue network when multiple samples are taken.Physically, this is achieved by repeated operation of the hardware devices to accumulate the average values.The inset shows detail of the artificial notch.The field (green) and current (white) axes are marked.b, Schematic of the operating principle of the stochastic synapse.The current line allows input (xj) of 1 (current pulse, DW injected) or 0 (no pulse, no DW).Field inline with the NW drives (if present) the DW through the system: high fields pass the DW through the notch and produce an output of 1, low fields result in the notch blocking the DW and an output of 0. Intermediary fields (not shown) provide intermediate probabilities of passing the notch.c, Experimentally measured probability of an injected domain wall passing the notch.Tuning the propagation field can control this probability across the whole range in a logistic sigmoid-like fashion.Points are averages of 1000 samples, x error bars represent precision in choice of propagation field, and y error bars are given p(1 − p)/ √ 1000.The logistic sigmoid fit is given in methods.The nucleation field with no input (no injection, xj = 0) is (10.74 ± 0.07) mT.Therefore, below 10 mT the passing probability for no input is zero.d, Average synaptic weight, as defined in eq. 2. Depending on the number of samples, i.e. repetitions of the operation in b, the effective synaptic weight varies from purely binary (one repetition/sample, K = 1) to almost continuous (K = 128 samples).

Stochastic network
We embed these synapses in an artificial neural network where the output of neuron i is given by where j is an index over the input dimension.
We trained the network as a classifier for a problem of C classes with C independent neurons (perceptrons), where each neuron represented one class.This task was based on the well-known MNIST dataset but with each image downsampled to give images with a shape of 14 by 14 pixels instead of the standard 28 by 28.This was necessary to reduce the time of the operation when running on the prototype experimental hardware (see methods).In figure 2.a we depict the perceptron that corresponds Here, the weighted inputs are summed to give the neuron's activity (as in eq.3).In the case of the MNIST task, there is a neuron for each of the 10 classes (numbers "0" to "9"; y0 to y9).When trained, the neuron for the class corresponding to the correct input, here y0, should have the highest activity.Each mean weight in our network ( wij) is the average of multiple measurements (K ≥ 1) of the output of a synapse with individual weight wij set by its trained propagation field (see eqs. 1 & 2).A clear distinction should be made here from traditional neural networks that these weights are stochastic and will vary for each run of the network.The individual weights take the value "1" with the probability f (hij) (DW passing probability, as characterised in figure 1.c) or "0" otherwise.The mean weights, therefore, take values from the distributions shown in figure 1.d.b, The architecture of the hardware network.For the purpose of demonstrating successful performance, only the stochastic synapses are run directly on the hardware.The perceptrons are stored on a computer, which requests results ("1" or "0") from the magnetic stochastic synapses for a given synaptic parameter (trained propagation field, hij).After this is repeated for each synapse, summations are performed to predict the correct class of the input.c, Idealised operation of single synapses in materia for the neuron y0.The data path is shown for two inputs, or pixels, for the case of a correct image for the class ("0") and an incorrect image ("5").The value of the weight control, the propagation field hij, is expected to be correlated with pixels in images from the correct class: where the pixels are "on" for correct images, high values of the weight control are expected; when "off", low values.If the input pixel value is "0", the synapse is bypassed as the result is "0" by construction.However, if it is "1", a result is requested from the hardware using the corresponding weight control.As shown in the top graph, high propagation fields result in the DW directly passing the notch (only a single step is seen) which is interpreted as an output of "1".As in the lower graph, low propagation fields result in a two step procedure where the DW initially pins at the notch before depinning at a higher driving field.This is interpreted as a "0".In practice, the results from the synapses will vary stochastically reflecting the passing probability f (hij).
to class "0".If we present to the neuron a representative of its corresponding class (in this case an image of the digit "0"), the neuron should produce a high activity for recognising the input as zero.
The experimental process is shown in figure 2.b.For ease of demonstration, only a single hardware synapse is used, with operations serialise in time.Potential devices would have multiple synapses running in parallel with a summation performed during the measurement.The perceptron parameters are stored on a computer, which sends the input and synaptic parameter to the external hardware synapse and requests the result.The process is repeated until K samples per synapse (see eq. 2) are collected.Summation of the results takes place on the computer with an additional bias term applied.To avoid redundant measurements, pixels corresponding to inputs of "0" (white pixels in our example image) were omitted, since the output is deterministically "0" by design.A synapse receiving a black pixel (x j = 1) will produce "1" if the field is set at a high value or "0" if the field is set at a low value, see figure 2.c.Intermediate field values will produce outputs that vary scholastically, reflecting the passing probability f (h ij ).

Analysis of the stochastic learning rule
We now sketch the derivation of the learning rule that we apply to the synapses of the neural network.Each synapse w (k) ij is an independent sample from a Bernoulli distribution, and therefore the sum of these samples will follow a Poisson-Binomial distribution.The mean, µ i , and variance, σ 2 i , for each output neuron (calculated by eq. 3) are given by: For a detailed calculation of these values see the supplementary material.
Since the number of inputs and the sampling process means this sum will be over a large number of events, the Poisson-Binomial distribution can be approximated as a Gaussian 40 .Using this approximation, the neuronal output can be re-parameterised so that the stochasticity is only in a term with no dependence on the trainable parameters.In this way, we write where ỹi denotes the approximation of neuronal output y i and ξ i is a sample from a Gaussian distribution with zero mean and unit variance.
If we assume that we are in a supervised learning framework and that E is the error function we would like to minimise (e.g.square mean error or cross-entropy), then E is a function of the pattern p we present to the network, which defines the desirable output target.E is also a function of the output neurons, represented by vector y, which also depends on p.The learning rule will update the values of the applied field to each synapse h ij by ∆h ij according to the following "online" gradient rule: where η is a small positive number representing the learning rate.We calculate the derivative of ∂ ỹi ∂hij from eq. 6, 4, 5.We also calculate the value ξ i using eq.6, computing µ i and σ i from eq. 4 and 5 and ỹi from eq. 3 (setting ỹi = y i ).It follows that for K → ∞, σ → 0 and we obtain a "mean-field" gradient rule that takes into account the mean but not the variance of the output neurons.
We have tested the performance of this rule on the downsampled MNIST dataset.During training, the number of repeats (samples) K is set as a parameter of the network, which we define as K train , and as such modifies how the training progresses.The variance of the output has an important effect on the classification procedure; if the variance is high then mis-classification will be more likely, especially in classes that have similar mean values for each neuron.Therefore, during supervised training the network aims to minimise this variance.When K is low, this happens through changing the weights, controlled through the magnetic fields h ij , so that the probabilities are close to either 1 or 0 (high or low applied field), as this minimises the single sample variance in eqn. 5.This leads to a solution that is almost a deterministic binary network.However, if K is large then the variance is reduced by the factor 1/K and therefore the system can tolerate higher synaptic variance than in the case of K = 1.Thus, a pseudo-analogue solution can be found.
Figure 3 describes the effect of the learning rule on the network synapses.We plot the distribution of the propagation fields, h ij , over all the neurons from 5 independent models before training (figure 3 In figure 3.e-f we show the distributions of the neuronal output when presented with the same image repeatedly for the three training cases above and find that the neuronal distribution reflects the synaptic distribution.We now consider the case where during testing a different number of samples are drawn when calculating eqn.2,  As we will show, this leads to higher performance when test sampling (K test ) is small, but capped high performance when test sampling is allowed to rise, in contrast to the large K train case.Figure 3.g compares the average variance during testing with K test = 1 samples (circles) and K test = K train samples (squares) as a function of the number of samples used during training, K train .As discussed before, when training with 1 sample the variance is kept low by having passing probabilities close to 0 or 1.However, when more samples are used during training, the variances for a single sample can increase as the variance of the averaged samples decreases.
This behaviour of minimising the variance to reduce miss-classification arises due to the variance term in the "stochastic" learning rule.Other rules that only consider the mean term 14,19 cannot find these deterministic solutions when using a stochastic network.Figure 3.h shows the test accuracy with K test = K train as a function of K train samples for our stochastic learning rule (squares) vs the mean field rule (circles) averaged over five independently trained models.For both rules, increasing the number of samples leads to an improvement in the test accuracy as more levels are possible for the synapse averages (see fig. 1.d).However, in all cases, the stochastic learning rule out performs the mean-field rule, with convergence when a large number of samples (K ≥ 8) is used for training and testing.The dashed line shows the performance for a fully mean-field network, where effectively an infinite number of samples are taken (i.e continuous but bounded synapses), and represents the best possible accuracy for such a network given the task.
Hardware and operational principles.
We now proceed to demonstrate our neural networks working on physical hardware and not only within simulation.Figures 4.a and b shows the test accuracy computed when the synaptic operation has been simulated (lines) and processed using the hardware (points) for models trained with either a, K train = 1 or b, K train = 128 and tested with increasing numbers of samples (K test ).Due to the throughput of our prototype device, we only demonstrate experimental results up to K test = 8.
The simulation and hardware results show excellent agreement and highlight different behaviours in models trained with different sampling levels.In the case trained on one repeat, the network is deterministic and as such the accuracy does not significantly improve when we average over more samples during testing.On the other hand, while the model trained with 128 repeats shows a lower performance with only one testing repeat, the accuracy improves as we increase the number of samples during testing.This arises from the increased stochasticity at low sampling levels and resultant increased precision at high sampling levels.This behaviour is corroborated by the corresponding neuronal distributions (3.e) and (3.f), which show that the neuronal variance when training with one sample and testing with one sample is much lower than in the case of training with K = 128 samples and testing with one sample.It is akin to majority voting, where classifiers have to be diverse to improve performance (see 41 and references therein).Here, performance increase increases with increasing K test (number of voters) when the neuronal distribution has a high variance.
Figure 4.c allows further interrogation of the majority voting behaviour.It presents (using the now verified simulation model) a colour plot of the test accuracy as a function of the number of training and test samples.This variation in performance when testing using a different number of repeats raises an essential tradeoff in speed vs accuracy.To a first approach, the results follow the behaviour of stochastic computing: fast approximation with increasing accuracy over time if required.This trend is matched on average with the extra repeats, implying an extra time and energy cost to accumulate the samples, but providing a boost in accuracy.However by utilising our learning rule's ability to enable low sampling deterministic solutions we can outperform the naive stochastic computing reasoning in the low sampling limit (as also seen in figure 3.h).If fixing K train , this leads to a competition between low repeat performance and ultimate high repeat accuracy, i.e if a model is trained on a high number of repeats, but uses a low number of repeats during testing (inference) time, then the accuracy will be sub-optimal.Similarly, if the model is trained on a low number of repeats, but tested on many, the ultimate accuracy suffers.One possibility is to always tie K test = K train , but this requires multiple trained weights.It is, therefore, constructive to utilise data such as Figure 4.c as a guide on training and testing the synapse depending on the desired accuracies and operational times.Whilst maintaining the simplicity of a single set of trained weights (fixing K train ), a horizontal range of testing values can be chosen to achieve the desired accuracies and energy cost envelope.
Analysing the performance over the space of training vs test repeats for the MNIST task, we find that in most cases, testing with a similar number of repeats to the training performs well.A significant outlier to this was that training with two samples consistently outperformed training with one across all levels of testing, including testing with one sample.We attribute this to the smaller step sizes in the parameter space with two samples compared to one, which allows for a better solution while the variance is still very low and remains small when testing with one sample.

DISCUSSION
Neuromorphic devices are a promising route to developing low-energy-cost machine learning systems, seeking to overcome one of the chief drawbacks of traditional neural networks.Stochastic, binary neural networks have shown promise in this regard due to their reduced energy cost and simple implementation [9][10][11][12][13] .Multiple sampling of these networks allows their performance to rival analogue networks [14][15][16][17][18][19] .Outstanding problems, however, have been providing training rules to achieve high performance even at low sampling rates (where calculations can be performed faster and at less energy cost) and identifying hardware implementations that can natively provide the stochasticity required.We have developed a learning methodology for stochastic binary neural networks that we verify experimentally, using the behaviour of magnetic domain walls in nanowires as stochastic synapses.Stochasticity has traditionally been considered a limit-ing factor in nanomagnetic logic devices 32,33 , but here is a functional aspect that drives learning.We have shown performance of the hardware network comparable to a standard neural network and demonstrated high performance at low sampling thanks to the novel learning rule.
Experimentally, we have observed that a DW injected into a nanowire with an artificial pinning site can be stochastically pinned and tuned by using an applied magnetic field.We have then demonstrated that this tunable stochastic pinning can create synapses for a neural network device.Due to the nature of the physical system, these synapses behave as binary stochastic synapses.Our fundamental ingredient for training such a network is a learning rule that considers the variance of the stochastic output of the network.This training method considers taking multiple samples (K train /K test ) of the network output to compute a sample average and deviation.A low number of samples leads toward a predominantly deterministic binary solution and is fast to compute but has lower performance than a high number of samples that approximates a standard "analogue" network and require more time (and energy).This trade-off allows flexibility in designing the network based on the required performance or operating speed.
Key is that the learning rule developed here has allowed us to find a range of operating regimes because the stochastic part of the output is considered.Other binary stochastic computing approaches, such as Hirtzlin et al. 19 , train using the expectation of the network (which we call mean-field and is equivalent to K → ∞) and leads to a reduced accuracy when fewer samples are used during inference (testing).The Gaussian approximation was also used by Esser et al. 42 to train a network with binary stochastic synapses on the IBM TrueNorth neurosynaptic system but the contribution from the variance term is considered to be negligible.The contribution from the variance term in our rule allows for weights to be trained that operate better in the low sampling regime compared to the mean-field versions.
Other learning methods where the variance is taken in account stem from the Likelihood-Ratio framework [43][44][45] , which is related to policy gradient methods in reinforcement learning 46 .While these methods consider the stochasticity of the neurons and synapse, they depend heavily on the choice of baseline values for the loss which require complex approximation methods.Additionally, the reparameterisation method applied here allows for a direct feedback of the error signal to the synaptic field parameters and fits within existing backpropagation-based learning methodologies.
Overall, the stochastic learning rule presented in this paper has shown tunability in both high and low sampling regimes and can be implemented simply within backpropagation-style codes.The ability, due to consideration of the variance of the output, to tune between low-sampling deterministic binary and highsampling stochastic "analogue like" behaviour lends itself to the flexibility of our system between operational speed/energy cost and test accuracy.
The magnetic DW synapse that we have demonstrated here is a proof of principle component and as such it important to look towards changes that would be necessary for a more "production ready" neuromorphic device.Optimised devices would likely look towards spintorque driven domain wall motion 31 alongside the use of local nanomagnetic elements to encode the weights.It is also possible to envisage our learning methodology applied to networks built of alternative magnetic elements with similar stochastic properties, such as magnetic tunnel junctions, amongst others 12,14,[47][48][49] .Elsewhere, DW devices have been used as neurons 50 or activation functions 51 and magnetic elements in general have been demonstrated in a range of alternative low energy computation schemes 16,47,[52][53][54] that exploit the stochasticity of magnetic devices.Our fundamental element, the magnetic stochastic synapse, could fit within such paradigms where efficient production of random bits is key.It is important, however, to state that the key result here is demonstrating performance as run on experimental hardware, enabled by our stochastic learning rule.Further optimisation is a matter of future research and engineering development.
Whilst the single layer network demonstrated here can only solve linearly separable problems, it can be extended in a number of ways.Retaining the single layer simplicity and looking towards an all magnetic architecture, it has potential applications in the field of reservoir computing.In reservoir computing, a fixed reservoir performs a non-linear spatial-temporal transformation of an input sequence such that the output representation is linearly separable.The advantage of RC is that the reservoir transform can be offloaded to a physical system with appropriate properties and there has been considerable recent interest in developing magnetic (spintronics) based physical reservoir computing 7,[55][56][57][58][59][60][61][62] .There is potential to connect our magnetic DW based neural network to these reservoirs to create a complete hardware reservoir computing system.There is also the more traditional route of scaling our current approach towards multi-layer networks as the learning rule is compatible with backpropagation.An open research question in this avenue is whether the sampling procedure should apply at a local or global scale of the network.One approach is implementation of multi-layers using nanowire interconnects and logic gates, but if we look away from the limitation of all magnetic architectures, it is also possible to envisage hybrid magnetic-CMOS application specific integrated circuits (as in Ref. 63) that might provide a route to larger scale network hardware.However, details of these implementations are beyond the scope of this current work.
In conclusion, we have developed a training methodology for binary stochastic synapses that considers the network's stochasticity during learning and resampling of the stochastic output allows for a trade-off between device run time and desired accuracy.This approach has been demonstrated on a proof of concept magnetic domain wall-based stochastic synapse with excellent agreement between hardware and model during inference.

Device Fabrication
The devices were fabricated using two-stage electron beam lithography with the CSAR-62 resist.Nanowires were deposited in the first stage using thermal evaporation of permalloy (Ni 81 Fe 19 ) to a thickness of 54 nm (base pressure, 7 × 10 −7 mbar; process pressure, ∼5 × 10 −5 mbar; rate, 0.5 A s −1 ).Current lines and connection pads were deposited in the second stage as Ti/Au (Nominally 10 nm/ 200 nm via thermal evaporation).Samples were electrically connected to PCB devices using silver DAG.

Device operation
The device operation procedes as in figure 2.An AVTECH pulse generator was used to apply 30 volt, 100-nanosecond pulses along the current line (resistance 290 Ω).An electromagnet was used to apply fields along the wire lengths.A National Instruments DAQ card was used to control timing between these two, with pulses being triggered at particular times during repeated sinusoidal field sequences.The field at which the pulse is triggered is the propagation field.On the fly calibration of timing enabled correction of any drift between the trigger and field sequence (due to heating) to 0.1 mT.
A focused-MOKE magnetometer (spot size ∼ 5 micrometer) was used to measure the NW response.Hysteresis loops were obtained with the laser spot positioned over the notch.Single steps in the hysteresis loop indicate the domain wall passing the notch (an output of 1).Double steps indicate a two-stage pinning/depinning process (an output of 0).An algorithmic method allowed automated evaluation of each hysteresis loop.The number of peaks were calculated in the differentiated Kerr signal; if two peaks were present then the DW had been pinned.To eliminate false positives, the steps in the raw signal corresponding to the peaks were required to be greater than 24 % of the total signal change.This was optimised experimentally to allow for peak detection even with a slightly off centre laser spot (unequal step sizes), but to minimises erroneous detections arising from noise.

Domain passing probability
The probability of a domain wall not being pinned by the artificial defect site was observed to have a sigmoidlike behaviour.A functional form of this probability was used to simulate magnetic stochastic synapses for computational training of the networks.We fitted this probability using where d = 0.0219 is a finite passing probability at low field, h 0 = 4.63 mT is the field centre and ∆ = 2.73 mT −1 is the sigmoid width.We note that this exact form of the fitting function is not necessary for the stochastic learning rule used to train the network.

Stochastic learning rule
For a network comprised of binary stochastic synapses the value of each neuron can be approximated by a Gaussian given by equation (6), where the mean and variance of each neurons is defined by equations ( 4) and ( 5) respectively.Using this approximation, a gradient based learning rule can be derived as the random variable no longer has a dependence on the model parameters.The parameters of this network are the magnetic fields which determine the passing probability of the synapse so a gradient descent update is given by = −η ∂E(ỹ) ∂ ỹi The gradient of the mean and variance with respect to the magnetic fields are where f (h) = ∂f (h)/∂h is the derivative of the passing probability function.Combining this result into equation (10) gives the update rule In this form the rule contains the mean field component multiplied by a factor that depends on the variance.While for the derivation of the rule we have specified that ξ i is a Guassian random variable with zero mean and unit variance, during training it is calculated exactly from the forward phase using ξ i = (y i − µ i )/σ i , so if the neuron output is higher than the mean it will be positive while if it is lower it will be negative.This combines with the 1 − 2f (h ij )x j to determine whether the factor increases the weight update or reduces it.

Model training details
As a benchmark we use the MNIST dataset 64 but to reduce the number of synaptic operations for the experimental hardware it was downsampled by using the Max-Pool operation with a filter size of 2x2.This created a set of 14 x 14 pixel images which were mapped to a binary input by thresholding the pixel intensity at 0.5.
The training part of the dataset was randomly split into a 50,000 training and 10,000 validation subsets.A real valued bias was applied output of the simulated binary synapses and these values were converted into a probability using the Softmax function with the loss against the image labels measured using Cross-Entropy loss.Training was performed using mini-batches of 50 images, and iterated until the validation loss did not decrease over 20 epochs.The model with the lowest validation error before the end of training was returned as the trained model.The Adam optimiser was used with a learning rate η = 0.001 for K ≥ 2 and η = 0.01 for K = 1, determined based on the lower validation error.

On device machine learning testing
For the demonstration of our stochastic network in materia we have used an automated control system to inject a domain wall into the magnetic nanowire at the desired magnetic field given by the synaptic weights.We first optimised the synaptic magnetic fields for our network models in simulation for the cases of K train = 1 and 128, using the method detailed below.For each K train , we trained 5 models before selecting the model that had the lowest error on the validation dataset.We then transferred these to the hardware with the control software loading the pixel binary values (x j ) from the test dataset and using the simulation trained magnetic fields (h ij ) to control the magnetic synapses.As detailed in figure 2, if the pixel value was 1 the control system would determine whether the domain wall has pinned or passed the defect site and return a 0 or 1 respectively.The result of this synaptic operation was then passed back to the program running the neural network inference, which computed the neuron values to predict the correct class of the test data.

FIG. 1 .
FIG.1.Characterisation of notched permalloy nanowire (NW) as a stochastic synapse.a, False-coloured SEM image of the permalloy NW (blue) and current injection line (orange).The inset shows detail of the artificial notch.The field (green) and current (white) axes are marked.b, Schematic of the operating principle of the stochastic synapse.The current line allows input (xj) of 1 (current pulse, DW injected) or 0 (no pulse, no DW).Field inline with the NW drives (if present) the DW through the system: high fields pass the DW through the notch and produce an output of 1, low fields result in the notch blocking the DW and an output of 0. Intermediary fields (not shown) provide intermediate probabilities of passing the notch.c, Experimentally measured probability of an injected domain wall passing the notch.Tuning the propagation field can control this probability across the whole range in a logistic sigmoid-like fashion.Points are averages of 1000 samples, x error bars represent precision in choice of propagation field, and y error bars are given p(1 − p)/ √ 1000.The logistic sigmoid fit is given in methods.The nucleation field with no input (no injection, xj = 0) is (10.74 ± 0.07) mT.Therefore, below 10 mT the passing probability for no input is zero.d, Average synaptic weight, as defined in eq. 2. Depending on the number of samples, i.e. repetitions of the operation in b, the effective synaptic weight varies from purely binary (one repetition/sample, K = 1) to almost continuous (K = 128 samples).

}FIG. 2 .
FIG.2.Stochastic network operation.a, Sketch of the stochastic perceptron.Each input value from an image is fed via a mean weight to the neurons for each class.Here, the weighted inputs are summed to give the neuron's activity (as in eq.3).In the case of the MNIST task, there is a neuron for each of the 10 classes (numbers "0" to "9"; y0 to y9).When trained, the neuron for the class corresponding to the correct input, here y0, should have the highest activity.Each mean weight in our network ( wij) is the average of multiple measurements (K ≥ 1) of the output of a synapse with individual weight wij set by its trained propagation field (see eqs.1 & 2).A clear distinction should be made here from traditional neural networks that these weights are stochastic and will vary for each run of the network.The individual weights take the value "1" with the probability f (hij) (DW passing probability, as characterised in figure1.c)or "0" otherwise.The mean weights, therefore, take values from the distributions shown in figure1.d.b, The architecture of the hardware network.For the purpose of demonstrating successful performance, only the stochastic synapses are run directly on the hardware.The perceptrons are stored on a computer, which requests results ("1" or "0") from the magnetic stochastic synapses for a given synaptic parameter (trained propagation field, hij).After this is repeated for each synapse, summations are performed to predict the correct class of the input.c, Idealised operation of single synapses in materia for the neuron y0.The data path is shown for two inputs, or pixels, for the case of a correct image for the class ("0") and an incorrect image ("5").The value of the weight control, the propagation field hij, is expected to be correlated with pixels in images from the correct class: where the pixels are "on" for correct images, high values of the weight control are expected; when "off", low values.If the input pixel value is "0", the synapse is bypassed as the result is "0" by construction.However, if it is "1", a result is requested from the hardware using the corresponding weight control.As shown in the top graph, high propagation fields result in the DW directly passing the notch (only a single step is seen) which is interpreted as an output of "1".As in the lower graph, low propagation fields result in a two step procedure where the DW initially pins at the notch before depinning at a higher driving field.This is interpreted as a "0".In practice, the results from the synapses will vary stochastically reflecting the passing probability f (hij).
.a), after training with K train = 1 sample (figure 3.b) and after training with K train = 128 samples (figure 3.c).The final distributions confirms the theoretical expectation that K train = 1 leads to a binary network (low variance) while K train = 128 approximates a standard perceptron with a continuous distribution of synaptic weight (high variance).

FIG. 3 .
FIG.3.Analysis of the stochastic learning rule.a-c, Probability density histograms of synaptic magnetic field parameters over 5 independent models.a shows the distribution before training, where all the fields are initialised so that the passing probability (shown in orange, right hand axis) is 0.5.b and c show the distributions when trained using 1 or 128 samples respectively.With 1 sample, the distribution is bimodal with peaks at fields with probabilities close to 0 or 1.While when training with 128 samples, the distribution is focused on the central region of the passing probability function.d-f, Distribution of the neuron values y when an image of a zero is shown 10,000 times independently for neurons either identifying the correct (output 0 in this example) or incorrect (outputs 1-9) classification.In d the model is untrained so all outputs have the same distribution, while in e and f the distribution is split into the correct output neuron and the incorrect output neurons when training with 1 and 128 samples respectively.The top row shows Ktest = 1 while the bottom row shows Ktest = 128.Using more samples during testing reduces the variance and therefore the chance of mis-classification.g, The standard deviation averaged over all the neurons when increasing number of samples are used in training, with Ktest = 1 (circles) and Ktest = Ktrain (squares).This summarises the conclusions from the distribution plots in d-f.More samples during training allows the standard deviation for a single sample to increase as the standard deviation over all samples is reduced.However, testing with Ktest < Ktrain results in an increased overall standard deviation.h, Accuracy on the test set against number of samples during training when using the stochastic (dark green circles) or the mean-field (light green squares) learning rules.The points show the accuracy averaged over 5 independently trained models, while the shaded region indicates 1 standard deviation.The stochastic learning rule maintains a higher test accuracy when the number of samples is low.

FIG. 4 .
FIG.4.Hardware verification and choice of sampling.a,b Comparison of the testing accuracy computed using either the physical hardware (points) or from simulation (curves).The test data set is restricted to the first 600 images with an approximately equal balance of digits.In both, training was done using the model, and hardware testing was limited to eight samples due to throughput limitations of the prototype device.a shows the accuracy when the network was trained with only 1 sample while b was trained with 128 samples.As before, training with 1 sample reaches an almost deterministic solution, so repeated sampling during testing does not improve the accuracy.Training with 128 shows an increase in accuracy as more samples are used during testing, reaching higher peak performance (albeit with a lower initial base).The dashed line shows the performance on a standard neural network and the dotted-dashed is for a full mean-field binary stochastic network.In both cases the hardware performance shows excellent agreement with the model calculations.The model accuracy is averaged over 5 independent tests with the same trained weights, with the shaded area showing 1 standard deviation.This can be taken to represent the variability in performance for a given task due to the inherent stochasticity of the network.Naturally, it decreases as the number of test samples increases and is lower for the, more deterministic, Ktrain = 1 case.The hardware accuracy is from a single run over the 600 images, so the error bars show the standard error of the estimation of the accuracy over the mini-batches.c Test accuracy (as measured with the model) over different combinations of training and testing sampling for the sub-sampled MNIST task.The data is bi-linearly interpolated, which can be considered as averaging over fractions of the data set with different sampling rates.In general, testing with more samples increases accuracy, but, this is limited when Ktest > Ktrain.In particular, in the Ktrain = 1 case, further sampling provides little improvement due to the deterministic weight distributions.Training with 2 samples is better in all test cases than when training with 1, but best overall accuracy is when 128 samples are used in both training and testing.Data such as this provide a guide to choosing training and testing samples depending on desired accuracy and operation times for a given task.
which we define as K test .The top row shows the distribution when K test = 1, while the bottom row shows K test = 128.The untrained neuron values exhibit a Gaussian distribution across all data samples, with fields initialised to give the largest possible variance; see figure 3.d.After training, with K train = 1 or K train = 128 and K test = K train the distributions of the correct and incorrect class neurons minimally overlap.However, if we test the network with K test = 1 after we train it with K train = 128 there is a rather significant overlap (3.f, upper panel) suggesting a high probability of missclassification.In all cases, when testing with K test = 128 (bottom row) the variance is reduced 1/128 as given in eqn. 5 and allows for better resolution of the mean values.In the case K train = 128, the learning rule has exploited this additional sampling and variance reduction by better utilising a continuous range of weights to boost performance.However, when K test < 128 (as in 3.f, upper panel), the increase in variance decreases the probability of correct classification.In the other training case (K train = 1, 3.e, upper panel), the learning rule adapts the weights to find a low variance, almost deterministic binary, solution.Further sampling during testing (3.e, lower panel) reduces this variance further, as expected, but doesn't significantly change the overlap as it has already been optimised for the lower sampling regime.