Parallel learning by multitasking neural networks

Parallel learning, namely the simultaneous learning of multiple patterns, constitutes a modern challenge for neural networks. While this cannot be accomplished by standard Hebbian associative neural networks, in this paper we show how the multitasking Hebbian network (a variation on the theme of the Hopfield model, working on sparse datasets) is naturally able to perform this complex task. We focus on systems processing in parallel a finite (up to logarithmic growth in the size of the network) number of patterns, mirroring the low-storage setting of standard associative neural networks. When patterns to be reconstructed are mildly diluted, the network handles them hierarchically, distributing the amplitudes of their signals as power laws w.r.t. the pattern information content (hierarchical regime), while, for strong dilution, the signals pertaining to all the patterns are simultaneously raised with the same strength (parallel regime). Further, we prove that the training protocol (either supervised or unsupervised) neither alters the multitasking performances nor changes the thresholds for learning. We also highlight (analytically and by Monte Carlo simulations) that a standard cost function (i.e. the Hamiltonian) used in statistical mechanics exhibits the same minima as a standard loss function (i.e. the sum of squared errors) used in machine learning.


Introduction
Typically, Artificial Intelligence has to deal with several inputs occurring at the same time: for instance, think about automatic driving, where it has to distinguish and react to different objects (e.g., pedestrians, traffic lights, riders, crosswalks) that may appear simultaneously.Likewise, when a biological neural network learns, it is rare that it has to deal with one single input per time1 : for instance, while trained in the school to learn any single letter, we are also learning about the composition of our alphabets.In this perspective, when stating that neural networks operate in parallel, some caution in potential ambiguity should be paid.To fix ideas, let us focus on the Hopfield model [34], the harmonic oscillator of associative neural networks accomplishing pattern recognition [8,25]: its neurons indeed operate synergistically in parallel but with the purpose of retrieving one single pattern per time, not several simultaneously [8,10,36].A parallel processing where multiple patterns are simultaneously retrieved cannot be accessible to the standard Hopfield networks as long as each pattern is fully informative, namely its vectorial binary representation is devoid of blank entries.On the other hand, when a fraction of entries can be blank [14] multiple-pattern retrieval is potentially achievable by the network.Intuitively, this can be explained by noticing that the overall number of neurons making up the networks -and thus available for information processing -equals the length of the binary vectors codifying the patterns to be retrieved, hence, as long as these vectors contain information in all their entries, there is no free room for dealing with multiple patterns.Conversely, the multitasking neural networks, introduced in [2], are able to overcome this limitation and have been shown to succeed in retrieving multiple patterns simultaneously, just by leveraging the presence of lacunae in the patterns stored by the network.The emerging pattern-recognition properties have been extensively investigated at medium storage (i.e., on random graphs above the percolation threshold) [23], at high storage (i.e., on random graphs below the percolation threshold) [24] as well as on scale-free [44] and hierarchical [3] topologies.
However, while the study of the parallel retrieval capabilities of these multi-tasking networks is nowadays over, the comprehension of their parallel learning capabilities just started and it is the main focus of the present paper.
In these regards it is important to stress that the Hebbian prescription has been recently revised to turn it from a storing rule (built on a set of already definite patterns, as in the original Amit-Gutfreund-Sompolinksy (AGS) theory) into a genuine learning rule (where unknown patterns have to be inferred by experiencing solely a sample of their corrupted copies), see e.g., [5,13,27] 2 .In this work we merge these extensions of the bare AGS theory and use definite patterns (equipped with blank entries) to generate a sparse data-set of corrupted examples, that is the solely information experienced by the network: we aim to highlight the role of lacunae density and of the data-set size and quality on the network performance, in particular deepening the way the network learns simultaneously the patterns hidden behind the supplied examples.In this investigation we focus on the low-storage scenario (where the number of definite patterns grows sub-linearly with the volume of the network) addressing both the supervised and the unsupervised setting.
The paper is structured as follows: the main text has three Sections.Beyond this Introduction provided in Section 1, in Section 2 we revise the multi-tasking associative network; once briefly sum-marized its parallel retrieval capabilities (Sec.2.1), we introduce a simple data-set the network has to cope with in order to move from the simpler storing of patterns to their learning from examples (Sec.2.2).Next, in Section 3 we provide an exhaustive statistical mechanical picture of the network's emergent information processing capabilities by taking advantage of Guerra's interpolation techniques: in particular, focusing on the Cost function (Sec.3.1), we face the big-data limit (Sec.3.1.1)and we deepen the nature of the phase transition the network undergoes as ergodicity breaking spontaneously takes place (Sec.3.1.2).Sec.3.2 is entirely dedicated to provide phase diagrams (namely plots in the space of the control parameters where different regions depict different global computational capabilities).Further, before reaching conclusions and outlooks as reported in Sec. 4, in Sec.3.3 we show how the network's Cost function (typically used in Statistical Mechanics) can be sharply related to standard Loss functions (typically used in Machine Learning) to appreciate how parallel learning effectively lowers several Loss functions at once.In the Appendices we fix a number of subtleties: in Appendix A we provide a more general setting for the sparse data-sets considered in this research 3 , while in Appendix B we inspect the relative entropies of these data-sets and finally in Appendix C we provide a revised version of the Signal-to-Noise technique (that allows to evaluate computational shortcuts beyond providing an alternative route to obtain the phase diagrams).Appendices D and E give details on calculations, plots and proofs of the main theorems.

A preliminary glance at the emergent parallel retrieval capabilities
Hereafter, for the sake of completeness, we briefly review the retrieval properties of the multitasking Hebbian network in the low-storage regime, while we refer to [2,4] for an extensive treatment.Definition 1.Given N Ising neurons σ i = ±1 (i = 1, ..., N ), and K random patterns ξ µ (µ = 1, ..., K), each of length N , whose entries are i.i.d.from where δ i,j is the Kronecker delta and d ∈ [0, 1], the Hamiltonian (or cost function) of the system reads as 2) The parameter d tunes the "dilution" in pattern entries: if d = 0 the standard Rademacher setting of AGS theory is recovered, while for d = 1 no information is retained in these patterns: otherwise stated, these vectors display, on average, a fraction d of blank entries.

Definition 2. In order to assess the network retrieval performance we introduce the K Mattis magnetizations
ξ µ i σ i , µ = 1, ..., K, (2.3) which quantify the overlap between the generic neural configuration σ and the µ th pattern.
Note that the cost function (2.2) can be recast as a quadratic form in m µ , namely where the term K/2 in the r.h.s.stems from diagonal terms (i = j) in the sum at the r.h.s. of eq.2.2 and in the low-load scenario (i.e., K grows sub-linearly with N ) can be neglected in the thermodynamic limit (N → ∞).
As we are going to explain, the dilution ruled by d is pivotal for the network in order to perform parallel processing.It is instructive to first consider a toy model handling just K = 2 patterns: let us assume, for simplicity, that the first pattern ξ 1 contains information (i.e., no blank entries) solely in the first half of its entries and the second pattern ξ 2 contains information solely in the second half of its entries, that is ), ξ 2 = (0, ..., 0 Unlike the standard Hopfield reference (d = 0), where the retrieval of one pattern employs all the resources and there is no chance to retrieve any other pattern, not even partially (i.e., as m 1 → 1 then m 2 ≈ 0 because patterns are orthogonal for large N values in the standard random setting), here nor m 1 neither m 2 can reach the value 1 and therefore the complete retrieval of one of the two still leaves resources for the retrieval of the other.In this particular case, the minimization of the cost function 2 is optimal when both the magnetizations are equal to one-half, that is when they both saturate their upper bound.In general, for arbitrary dilution level d, the minimization of the cost function requires the network to be in one of the following regimes • hierarchical scenario: for values of dilution not too high (i.e., d < d c , vide infra), one of the two patterns is fully retrieved (say m 1 ≈ 1 − d) and the other is retrieved to the largest extent given the available resources, these being constituted by, approximately, the N d neurons corresponding to the blank entries in ξ 1 (thus, m 2 ≈ d(1 − d)), and so on if further patterns are considered.
• parallel scenario: for large values of dilution (i.e., above a critical threshold d c ), the magnetizations related to all the patterns raise and the signals they convey share the same amplitude.
In general, in this type of neural network, the pure state ansatz4 m = (1, 0, 0, ..., 0), that is σ i = ξ 1 i for i = 1, ..., N , barely works and parallel retrieval is often favored.In fact, for K ≥ 2, at relatively low values of pattern dilution d 1 and in the zero-noise limit β → ∞, one can prove the validity of the so-called hierarchical ansatz [2] as we briefly discuss: one pattern, say ξ 1 , is perfectly retrieved and displays a Mattis magnetization m 1 ≈ (1 − d); a fraction d of neurons is not involved and is therefore available for further retrieval, with any remaining pattern, say ξ 2 , which yields m 2 ∼ (1−d)d; proceeding iteratively, one finds m ℓ = d ℓ−1 (1−d) for ℓ = 1, ..., K and the overall number K of patterns simultaneously retrieved corresponds to the employment of all the resources.Specifically, K can be estimated by setting , due to discreteness: for any fixed and finite d, this implies K ≲ log N , which can be thought of as a "parallel low-storage" regime of neural networks.It is worth stressing that, in the above mentioned regime of low dilution, the configuration leading to m This organization is stable until a critical dilution level d c is reached where m 1 ∼ k>1 m k [2], beyond that level the network undergoes a rearrangement and a new organization called parallel ansatz supplants the previous one.Indeed for high values of dilution (i.e d → 1) it is immediate to check that the ratio among the various intensities of all the magnetizations stabilizes to the value one, hence in this regime all the magnetizations are raised with the same strength and the network is operationally set in a fully parallel retrieval mode: the parallel retrieval state simply reads m = ( m) (1, 1, 1, 1, • • • ).This picture is confirmed by the plots shown in Fig. 1 and obtained by solving the self-consistency equations for the Mattis magnetizations related to the multitasking Hebbian network equipped with K = 2 patterns that read as [2] ) where β ∈ R + denotes the level of noise.We remark that these hierarchical or parallel organizations of the retrieval, beyond emerging naturally These plots confirm that the picture provided by statistical mechanics is actually dynamically reached by the network.We initialize the network sharply in a pattern as a Cauchy condition (represented as the dotted blue Dirac delta peaked at the pattern in the second columns) and, in the first column, we show the stationary values of the various Mattis magnetizations pertaining to different patterns, while in the second column we report their histograms achieved by sampling 1000 independent Monte Carlo simulations: starting from a sequential retrieval regime, the network ends up in a multiple retrieval mode, hierarchical vs parallel depending on the level of dilution in the patterns.
within the equilibrium description provided by Statistical Mechanics, are actually the real stationary states of the dynamics of these networks at work with diluted patterns as shown in Figure 2.

From parallel storing to parallel learning
In this section we revise the multitasking Hebbian network [2,4] in such a way that it can undergo a learning process instead of a simple storing of patterns.In fact, in the typical learning setting, the set of definite patterns, hereafter promoted to play as "archetypes", to be reconstructed by the network is not available, rather, the network is exposed to examples, namely noisy versions of these archetypes.
As long as enough examples are provided to the network, this is expected to correctly form its own representation of the archetypes such that, in further expositions to a new example related to a certain archetype, it will be able to retrieve it and, since then, suitably generalize it.
This generalized Hebbian kernel has recently been introduced to encode unsupervised [5] and supervised [13] learning processes and, in the present paper, these learning rules are modified in order to deal with diluted patterns.
First, let us define the data-set these networks have to cope with: the archetypes are randomly drawn from the distribution (2.1).Each archetype ξ µ is then used to generate a set of M µ perturbed versions, denoted as η µ,a with a = 1, ..., M µ and η µ,a ∈ {−1, 0, +1} N .Thus, the overall set of examples to be supplied to the network is given by η = {η µ,a } a=1,...,Mµ µ=1,...,K .Of course, different ways to sample examples are conceivable: for instance, one can require that the position of blank entries appearing in ξ µ is preserved over all the examples {η µ,a } a=1,...,Mµ , or one can require that only the number of blank entries N i=1 δ ξ µ i ,0 is preserved (either strictly or in the average).Here we face the first case because it requires a simpler notation, but we refer to Appendix A for a more general treatment.

Definition 3. The entries of each examples are depicted following
for i = 1, . . ., N and µ = 1, . . ., K. Notice that r µ tunes the data-set quality: as r µ → 1 examples belonging to the µ-th set collapse on the archetype ξ µ , while as r µ → 0 examples turn out to be uncorrelated with the related archetype ξ µ .
As we will show in the next sections, the behavior of the system depends on the parameters M µ and r µ only through the combination , therefore, as long as the ratio µ is µ-independent, the theory shall not be affected by the specific choice of the archetype.Thus, for the sake of simplicity, hereafter we will consider r and M independent of µ and we will pose ρ := 1−r 2 M r 2 .Remarkably, ρ plays as an information-content control parameter [13]: to see this, let us focus on the µ-th pattern and i-th digit, whose related block is ), the error probability for any single entry is and, by applying the majority rule on the block, we get , that quantifies the amount of information needed to describe the original message ξ µ i given the related block η µ i , we get which is monotonically increasing with ρ.Therefore, with a slight abuse of language, in the following ρ shall be referred to as data-set entropy.The available information is allocated directly in the synaptic coupling among neurons (as in the standard Hebbian storing), as specified by the following supervised and unsupervised generalization of the multitasking Hebbian network: Definition 4. Given N binary neurons σ i = ±1, with i ∈ (1, ..., N ), the cost function (or Hamiltonian) of the multitasking Hebbian neural network in the supervised regime is (2.11) Definition 5. Given N binary neurons σ i = ±1, with i ∈ (1, ..., N ), the cost function (or Hamiltonian) of the multitasking Hebbian neural network in the unsupervised regime is  (2.11), that is missing in eq.(2.12)).
We investigate the model within a canonical framework: we introduce the Boltzmann-Gibbs measure where Z (sup,unsup) is the normalization factor, also referred to as partition function, and the parameter β ∈ R + , rules the broadness of the distribution in such a way that for β → 0 (infinite noise limit) all the 2 N neural configurations are equally likely, while for β → ∞ the distribution is delta-peaked at the configurations corresponding to the minima of the Cost function.
The average performed over the Boltzmann-Gibbs measure is denoted as Beyond this average, we shall also take the so-called quenched average, that is the average over the realizations of archetypes and examples, namely over the distributions (2.1) and (2.9), and this is denoted as (2.16) Definition 6.The quenched statistical pressure of the network at finite network size N reads as (2.17) In the thermodynamic limit we pose We recall that the statistical pressure equals the free energy times −β (hence they convey the same information content).

Definition 7.
The network capabilities can be quantified by introducing the following order parameters, for µ = 1, . . ., K, We stress that, beyond the fairly standard K Mattis magnetizations m µ , which assess the alignment of the neural configuration σ with the archetype ξ µ , we need to introduce also K empirical Mattis magnetizations n µ , which compare the alignment of the neural configuration with the average of the examples labelled with µ, as well as K × M single-example Mattis magnetizations n µ,a , which measure the proximity between the neural configuration and a specific example.An intuitive way to see the suitability of the n µ 's and of the n µ,a 's is by noticing that the cost functions H (sup) and H (unsup) can be written as a quadratic form in, respectively, n µ and n µ,a ; on the other hand, the m µ 's do not appear therein explicitly as the archetypes are unknowns in principle.Finally, notice that no spin-glass order parameters is needed here (since we are working in the lowstorage regime [8,25]).
3 Parallel Learning: the picture by statistical mechanics

Study of the Cost function and its related Statistical Pressure
To inspect the emergent capabilities of these networks, we need to estimate the order parameters introduced in Equations (2. 19) and analyze their behavior versus the control parameters K, β, d, M, r.To this task we need an explicit expression of the statistical pressure in terms of these order parameters so to extremize the former over the latter.In this Section we carry on this investigation in the thermodynamic limit and in the low storage scenario by relying upon Guerra's interpolating techniques (see e.g., [17,[30][31][32]): the underlying idea is to introduce an interpolating statistical pressure whose extrema are the original model (which is the target of our investigation but we can be not able to address it directly) and a simple one (which is usually a one-body model that we can solve exactly).We then start by evaluating the solution of the latter and next we propagate the obtained solution back to the original model by the fundamental theorem of calculus, integrating on the interpolating variable.Usually, in this last passage, one assumes replica symmetry, namely that the order-parameter fluctuations are negligible in the thermodynamic limit as this makes the integral propagating the solution analytical.In the low-load scenario replica symmetry holds exactly, making the following calculation rigorous.In fact, as long as K/N → 0 while N → ∞, the order parameters self-average around their means [19,46], that will be denoted by a bar, that is lim lim where P N,K,β,d,M,r denotes the Boltzmann-Gibbs probability distribution for the observables considered.We anticipate that the mean values of these distributions are independent of the training (either supervised or unsupervised) underlying the Hebbian kernel.Before proceeding, we slightly revise the partition functions (2.14) by inserting an extra term in their exponents because it allows to apply the functional generator technique to evaluate the Mattis magnetizations.This implies the following modification, respectively in the supervised and unsupervised settings, of the partition function Definition 8. Given the interpolating parameter t ∈ [0, 1], the auxiliary field J and the constants {ψ µ } µ=1,...,K ∈ R to be set a posteriori, Guerra's interpolating partition function for the supervised and unsupervised multitasking Hebbian networks is given, respectively, by ψµ nµ,a(σ) . (3.4) More precisely, we added the term J µ i ξ µ i σ i that allows us to to "generate" the expectation of the Mattis magnetization m µ by evaluating the derivative w.r.t.J of the quenched statistical pressure at J = 0.This operation is not necessary for Hebbian storing, where the Mattis magnetization is a natural order parameter (the Hopfield Hamiltonian can be written as a quadratic form in m µ , as standard in AGS theory [8]), while for Hebbian learning (whose cost function can be written as a quadratic form in n µ , not in m µ as the network does not experience directly the archetypes) we need such a term for otherwise the expectation of the Mattis magnetization would not be accessible.This operation gets redundant in the M → ∞ limit, where m µ and n µ become proportional by a standard Central Limit Theorem (CLT) argument (see also Sec. 3.1.1and [13]).Clearly, Z (sup,unsup) K,β,d,M,r (η; J) and these generalized interpolating partition functions, provided in eq.s (3.3) and (3.4) respectively, recover the original models when t = 1, while they return a simple onebody model at t = 0. Conversely, the role of the ψ µ 's is instead that of mimicking, as close as possible, the true post-synaptic field perceived by the neurons.
These partition functions can be used to define a generalized measure and a generalized Boltzmann-Gibbs average that we indicate by ω Of course, when t = 1 the standard Boltzmann-Gibbs measure and related averages are recovered.
Analogously, we can also introduce a generalized interpolating quenched statistical pressures as Definition 9.The interpolating statistical pressure for the multitasking Hebbian neural network is introduced as and, in the thermodynamic limit, Obviously, by setting t = 1 in the interpolating pressures we recover the original ones, namely ), which we finally evaluate at J = 0.
We are now ready to state the next Theorem 1.In the thermodynamic limit (N → ∞) and in the low-storage regime (K/N → 0), the quenched statistical pressure of the multitasking Hebbian network -trained under supervised or unsupervised learning -reads as where η µ,a i , and the values nµ must fulfill the following self-consistent as these values of the order parameters are extremal for the statistical pressure A (sup,unsup) K,β,d,M,r (J = 0).
Corollary 1.By considering the auxiliary field J coupled to m µ and recalling that lim N →∞ m µ = mµ , we can write down a self-consistent equation also for the Mattis magnetization as mµ = ∂ J A (sup,unsup) For the proof of Proposition 1 and of Corollary 1 we refer to Appendix E.1.
We highlighted that the expressions of the quenched statistical pressure for a network trained with or without the supervision of a teacher do actually coincide: intuitively, this happens because we are considering only a few archetypes (i.e.we work at low load), consequently, the minima of the cost function are well separated and there is only a negligible role of the teacher in shaping the landscape to avoid overlaps in their basins of attractions.Clearly, this is expected to be no longer true in the high load setting and, indeed, it is proven not to hold for non-diluted patterns, where supervised and unsupervised protocols give rise to different outcomes [5,13].By a mathematical perspective the fact that, whatever the learning procedure, the expression of the quenched statistical pressure is always the same, is a consequence of standard concentration of measure arguments [17,45] as, in the N → ∞ limit, beyond eq.(3.2), it is also The self-consistent equations (3.9) have been solved numerically for several values of parameters and results for K=2 and for K = 3 are shown in Fig. 3 (where also the values of the cost function is reported) and Fig. 4 respectively.We also checked the validity of these results by comparing them with the outcomes of Monte Carlo simulations, finding an excellent asymptotic agreement; further, in the large M limit, the magnetizations eventually converge to the values predicted by the theory developed in the storing framework, see eq. (2.6).Therefore, in both the scenarios, the hierarchical or parallel organization of the magnetization's amplitudes are recovered: beyond the numerical evidence just mentioned, in Appendix D an analytical proof is provided.

Low-entropy data-sets: the Big-Data limit
As discussed in Sec.2.2, the parameter ρ quantifies the amount of information needed to describe the original message ξ µ given the set of related examples {η µ,a } a=1,...,M .In this section we focus on the case ρ ≪ 1 that corresponds to a highly-informative data-set; we recall that in the limit ρ → 0 we get a data-set where either the items (r → 1) or their empirical average (M → ∞, r finite) coincide with the archetypes, in such a way that the theory collapses to the standard Hopfield reference.As explained in Appendix E.2, we start from the self-consistent equations (3.8)-(3.9)and we exploit the Central Limit Theorem to write ηµ ∼ ξ µ 1 + λ µ √ ρ , where λ µ ∼ N (0, 1).In this way we reach the simpler expressions given by the next Proposition 1.In the low-entropy data-set scenario, preserving the low storage and thermodynamic limit assumptions, the two sets of order parameters of the theory, mµ and nµ become related by the , where all the magnetizations acquire the same values.Further note how, by increasing the entropy in the data-set (e.g. for ρ = 0.1 and ρ = 0.4), the domain of validity of the parallel regime enlarges (much as increasing β in the network, see Fig. 1).The vertical blue lines mark the transitions between these two regimes as captured by Statistical Mechanics: it corresponds to switching from the white to the green regions of the phase diagrams of Fig. 6.
following equations where and Z ∼ N (0, 1) is a standard Gaussian variable.Furthermore, to lighten the notation and assuming d ̸ = 1 with no loss of generality, we posed The regime ρ ≪ 1, beyond being an interesting one (e.g., it can be seen as a big data M → ∞ limit of the theory), offers a crucial advantage because of the above emerging proportionality relation between n and m (see eq. 3.10).In fact, the model is supplied only with examples -upon which the n µ 's are defined -while it is not aware of archetypes -upon which the m µ 's are defined -yet we can use this relation to recast the self-consistent equation for n into a self-consistent equation for m such that its numerical solution in the space of the control parameters allows us to get the phase diagram of such a neural network more straightforwardly.Further, we can find out explicitly the thresholds for learning, namely the minimal amount of examples (given the level of noise r, the amount of archetype to handle K, etc.) that guarantee that the network can safely infer the archetype from the supplied data-set.To obtain these thresholds we have to deepen the ground state structure of the network, that is, we now handle the Eqs.
Once reached a relatively simple expression for mµ , we can further manipulate it and try to get information about the existence of a lower-bound value for M , denoted with M ⊗ , which ensures that the network has been supplied with sufficient information to learn and retrieve the archetypes.
Next, we introduce a confidence interval, ruled by Θ, and we require that In order to quantify the critical number of examples M µ ⊗ needed for a successful learning of the archetype µ we can exploit the relation where in our case Thus, using the previous relation in (3.16), the following inequality must hold (3.19) and we can write the next Proposition 2. In the noiseless limit β → ∞, the critical threshold for learning M ⊗ (in the number of required examples) depends on the data-set noise r, the dilution d, the amount of archetypes to handle K (and of course on the amplitude of the chosen confidence interval Θ) and reads as and in the plots (see Figure 5) we use Θ = 1/ √ 2 as this choice corresponds to the fairly standard To quantify these thresholds for learning, in Fig. 5 we report the required number of examples to learn the first archetype (out of K = 2, 3, 4, 50 as shown in the various panels) as a function of the dilution of the network.⊗ when approaching the critical dilution level dc(K) = d1, as predicted by the parallel Hebbian storage limit [2,4]: this is the crossover between the two multi-tasking regimes, hierarchical vs parallel, hence, solely at the value of dilution d1, there is no sharp behavior to infer and, correctly, the network can not accomplish learning.This shines by looking at (3.20) where the critical amount of examples to correctly infer the archetype is reported: its denominator reduces to 1 − 2d + d K and, for µ = 1, it becomes zero when d → d1.

Ergodicity breaking: the critical phase transition
The main interest in the statistical mechanical approach to neural networks lies in inspecting their emerging capabilities, that typically appear once ergodicity gets broken: as a consequence, finding the boundaries of validity of ergodicity is a classical starting point to deepen these aspects.To this task, hereafter we provide a systematic fluctuation analysis of the order parameters: the underlying idea is to check when, starting from the high noise limit (β → 0, where everything is uncorrelated and simple Probability arguments apply straightforwardly), these fluctuations diverge, as that defines the onset of ergodicity breaking as stated in the next Theorem 2. The ergodic region, in the space of the control parameters (β, d, ρ) is confined to the half-plane defined by the critical line whatever the entropy of the data-set ρ.
Proof.The idea of the proof is the same we used so far, namely Guerra interpolation but on the rescaled fluctuations rather that directly on the statistical pressure.
The rescaled fluctuations ñ2 ν of the magnetizations are defined as We remind that the interpolating framework we are using, for t ∈ (0, 1), is defined via and it is a trivial exercise to show that, for any smooth function F (σ) the following relation holds: such that by choosing F = ñ2 µ we can write thus we have where the Cauchy condition ⟨ñ 2 µ ⟩ t=0 reads Evaluating ⟨ñ 2 µ ⟩ t for t = 1, that is when the interpolation scheme collapses to the Statistical Mechanics, we finally get namely the rescaled fluctuations are described by a meromorphic function whose pole is that is the critical line reported in the statement of the theorem.

Stability analysis via standard Hessian: the phase diagram
The set of solutions for the self-consistent equations for the order parameters (3.10) describes as candidate solutions a plethora of states whose stability must be investigated to understand which solution is preferred as the control parameters are made to vary: this procedure results in picturing the phase diagrams of the network, namely plots in the space of the control parameters where different regions pertain to different macroscopic computational capabilities.
Remembering that A K,β,d,M,r ( n) = −βf K,β,d,M,r ( n) (where f K,β,d,M,r ( n) is the free energy of the model), in order to evaluate the stability of these solutions, we need to check the sign of the second derivatives of the free energy.More precisely, we need to build up the Hessian, a matrix A whose elements are Then, we evaluate and diagonalize A at a point ñ, representing a particular solution of the selfconsistency equation (3.10): the numerical results are reported in the phase diagrams provided in Fig. 6.We find straightforwardly where we set In order to lighten the notation we will use T Kβ,ρ (n, z) = T K .We can now inspect the domain of stability of each possible solution of the self-consistency equation by plugging the structure of the candidate solution in (3.31).

3.2.1
Ergodic state: n = nd,ρ,β (0, . . ., 0) In this case the structure of the solution has the form n = m = 0 thus the Hessian matrix is diagonal and it reads as As all its eigenvalues are equal to if we require the matrix to be positively definite we must have Therefore, as d > 1−β −1 , the ergodic solution is stable: this scenario is reported in the phase diagrams provided in Fig. 6 as the yellow region.
We stress that this result on the ergodic region is in plain agreement with the inspection on ergodicity breaking provided in Theorem 2.

3.2.2
Pure state: n = nd,ρ,β (1, 0, . . ., 0) In this case the structure of the solution has the form mµ = nµ = 0 for µ > 1, thus the only self-consistency equation different from zero is where T = tanh β nξ µ (1 + z √ ρ) .It is easy to check that A becomes diagonal, with Notice that these eigenvalues do not depend on K since T does not depend on K. Requiring the positivity for all the eigenvalues, we get the region in the plane (d, β −1 ), where the pure state is stable: this correspond the blue region in the phase diagrams reported in Fig. 6.We stress that these pure state solutions, namely the standard Hopfield-type ones, in the ground state (β −1 → 0) are never stable whenever d ̸ = 0 as the multi-tasking setting prevails.Solely at positive values of β, this single-pattern retrieval state is possible as the role of the noise is to destabilize the weakest magnetization of the hierarchical displacement, vide infra).

Parallel state
In this case the structure of the solution has the form of a symmetric mixture state corresponding to the unique self consistency equation for all µ = 1, . . .K, namely where In this case, the diagonal terms of A are instead the off-diagonal ones are In general the following relationship holds This matrix has always only two kinds of eigenvalues, namely a − b and a + (K − 1)b, thus, for the stability of the parallel state, after computing (3.39) and (3.40), we have only to check for which point in the (d − β −1 ) plane both a − b and a + (K − 1)b are positive.The region, in the phase diagrams of Fig. 6, where the parallel regime is stable is depicted in green.

Hierarhical state
In this case the structure of the solution has the hierarchical form n = nd,ρ,β (( ..) and the region left untreated so far in the phase diagram, namely the white region in the plots of Fig. 6, is the room left to such hierarchical regime.

From the Cost function to the Loss function
We finally comment on the persistence -in the present approach-of quantifiers related to the evaluation of the pattern recognition capabilities of neural networks, i.e. the Mattis magnetization, also as quantifiers of a good learning process.The standard Cost functions used in Statistical Mechanics of neural networks (e.g., the Hamiltonians) can be related one-to-one to standard Loss functions used in Machine Learning (i.e. the squared sum error functions), namely, once introduced the two Loss functions thus minimizing the former implies minimizing the latter such that, if we are extremizing w.r.t. the neurons we are performing machine retrieval (i.e.pattern recognition), while if we extremize w.r.t. the weights we perform machine leaning: indeed, at least in this setting, learning and retrieval are two faces of the same coin (clearly the task here, from a machine learning perspective, is rather simple as the network is just asked to correctly classify the examples and possibly generalize).
In Fig. 7 we inspect what happens to these Loss functions -pertaining to the various archetypesas the Cost function gets minimized: we see that, at difference with the standard Hopfield model (where solely one Loss function per time diminishes its value), in this parallel learning setting several Loss functions (related to different archetypes) get simultaneously lowered, as expected by a parallel learning machine.

Conclusions
Since the AGS milestones on Hebbian learning dated 1985 [9,10], namely the first comprehensive statistical mechanical theory of the Hopfield model for pattern recognition and associative memory, attractor neural networks have experienced an unprecedented growth and the bulk of techniques developed for spin glasses in these four decades (e.g.replica trick, cavity method, message passage, interpolation) acts now as a prosperous cornucopia for explaining the emergent information processing capabilities that these networks show as their control parameters are made to vary.In these regards, it is important to stress how nowadays it is mandatory to optimize AI protocols (as machine learning for complex structured data-sets is still prohibitively expensive in terms of energy consumption [38]) and, en route toward a Sustainable AI (SAI), statistical mechanics may still pave a main theoretical strand: in particular, we highlight how the knowledge of the phase diagrams related to a given neural architecture (that is the ultimate output product of the statistical mechanical approach) allows to set "a-priori" the machine in the optimal working regime for a given task thus unveiling a pivotal role of such a methodology even for a conscious usage of AI (e.g. it is useless to force a standard Hopfield model beyond its critical storage capacity) 6 .Focusing on Hebbian learning, however, while the original AGS theory remains a solid pillar and a paradigmatic reference in the field, several extensions are required to keep it up to date to deal with modern challenges: the first generalization we need is to move from a setting where the machine stores already defined patterns (as in the standard Hopfield model) toward a more realistic learning procedure where these patterns are unknown and have to be inferred from examples: the Hebbian storage rule of AGS theory quite naturally generalizes toward both supervised and an unsupervised learning prescriptions [5,13].This enlarges the space of the control parameters from α, β (or K, N , β) of the standard Hopfield model toward α, β, ρ (or K, N , β, M , r) as we now deal also with a data-set where we have M examples of mean quality r for each pattern (archetype) or, equivalently, we speak of a data-set produced at given entropy ρ.Once this is accomplished, the second strong limitation of the original AGS theory that must be relaxed is that patterns share the same length and, in particular, this equals the size of the network (namely in the standard Hopfield model there are N neurons to handle patterns, whose length is exactly N for all of them): a more general scenario is provided by dealing with patterns that contain different amounts of information, that is patterns are diluted.Retrieval capabilities of the Hebbian setting at work with diluted patterns have been extensively investigated in the last decade [2,3,23,24,28,35,42,44,47] and it has been understood how, dealing with patterns containing sparse entries, the network automatically is able to handle several of them in parallel (a key property of neural networks that is not captured by standard AGS theory).However the study of the parallel learning of diluted patterns was not addressed in the Literature and in this paper we face this problem, confining this first study to the low storage regime, that is when the number of patterns scales at most logarithmically with the size of the network.Note that this further enlarges the space of the control parameters by introducing the dilution d: we have several control parameters because the network information processing capabilities are enriched w.r.t. the bare Hopfield reference 7 .We have shown here that if we supply to the network a data-set equipped with dilution, namely a sparse data-set whose patterns contain -on average-a fraction d of blank entries (whose value is 0) and, thus, a fraction (1 − d) of informative entries (whole values can be ±1), then the network spontaneously undergoes parallel learning and behaves as a multitasking associative memory able to learn, store and retrieve multiple patterns in parallel.Further, focusing on neurons, the Hamiltonian of the model plays as the Cost function for neural dynamics, however, moving the attention to synapses, we have shown how the latter is one-to-one related to the standard (mean square error) Loss function in Machine Learning and this resulted crucial to prove that, by experiencing a diluted data-set, the network lowers in parallel several Loss functions (one for each pattern that is learning from the experienced examples).For mild values of dilution, the most favourite displacement of the Mattis magnetizations is a hierarchical ordering, namely the intensities of these signals scale as power laws w.r.t.their information content m K ∼ d K • (1 − d), while at high values of dilution a parallel ordering, where all these amplitudes collapse to the same value, prevails: the phase diagrams of these networks properly capture these different working regions.Remarkably, confined to the low storage regime (where glassy phenomena can be neglected), the presence (or its lacking) of a teacher does not alter the above scenario and the threshold for a secure learning, namely the minimal required amount of examples (given the constraints, that is the noise in the data-set r, the amount of different archetypes K to cope with, etc.) M ⊗ that guarantees that the network is able to infer the archetype and thus generalize, is the same for supervised and unsupervised protocols and its value has been explicitly calculated: this is another key point toward sustainable AI.Clearly there is still a long way to go before a full statistical mechanical theory of extensive parallel processing will be ready, yet this paper acts as a first step in this direction and we plan to report further steps in a near future.

A A more general sampling scenario
The way in which we add noise over the archetypes to generate the data-set in the main text (see eq. (2.9)) is a rather peculiar one as, in each example, it preserves the number but also the positions of lacunae already present in the related archetype.This implies that the noise can not affect the amplitudes of the original signal, i.e. i (η µ,a i ) 2 = i (ξ µ i ) 2 holds for any a and µ, while we do expect that with more general kinds of noise the dilution, this property is not preserved sharply.
Here we consider the case where the number of blank entries present in ξ µ is preserved on average in the related sample {η µ,a } a=1,...,M but lacunae can move along the examples: this more realistic kind of noise gives rise to cumbersome calculations (still analytically treatable) but should not affect heavily the capabilities of learning, storing and retrieving of these networks (as we now prove).
Specifically here we define the new kind of examples ηµ,a i (that we can identify from the previous ones η µ,a i by labeling them with a tilde) in the following way Definition 10.Given K random patterns ξ µ (µ = 1, ..., K), each of length N , whose entries are i.i.d.from we use these archetypes to generate M ×K different examples {η µ,a i } a=1,...,M whose entries are depicted following for i = 1, . . ., N and µ = 1, . . ., K, where we pose with r, s ∈ [0; 1] (whose meaning we specify soon, vide infra).
Equation (A.2) codes for the new noise, the values of the coefficients presented in (A.3) have been chosen in order that all the examples contain on average the same fraction d of null entries as the original archetypes.To see this it is enough to check that the following relation holds for each a = 1, . . ., M , i = 1, . . ., N and µ = 1, . . ., K Defined the data-set, the cost function follows straightforwardly in Hebbian settings as Definition 11.Once introduced N Ising neurons σ i = ±1 (i = 1, ..., N ) and the data-set considered in the definition above, the Cost function of the multitasking Hebbian network equipped with notpreserving-dilution noise reads as and ρ is the generalization of the data-set entropy, defined as: Definition 12.The suitably re-normalized example's magnetizations n µ read as En route toward the statistical pressure, still preserving Guerra's interpolation as the underlying technique, we give the next Definition 13.Once introduced the noise β ∈ R + , an interpolating parameter t ∈ (0, 1), the K + 1 auxiliary fields J and ψ µ (µ ∈ (1, ..., K)), the interpolating partition function related to the model defined by the Cost function (A.5) reads as (A.9) and the interpolating statistical pressure A β,K,M,r,s,d = lim

r,s,d induced by the partition function (A.9) reads as
where E = E ξ E (η|ξ) .
Remark 3. Of course, as in the model studied in the main text still with Guerra's interpolation technique, we aim to find an explicit expression (in terms of the control and order parameters of the theory) of the interpolating statistical pressure evaluated at t = 1 and J = 0.
We thus perform the computations following the same steps of the previous investigation: the t derivative of interpolating pressure is given by and computing the one-body term We get the final expression as N → ∞ such that we can state the next Theorem 3. In the thermodynamic limit (N → ∞) and in the low load regime (K/N → 0), the quenched statistical pressure of the multitasking Hebbian network equipped with not-preserving-dilution noise, whatever the presence of a teacher, reads as where ηµ,a i and the values nµ must fulfill the following self-consistent equations that extremize the statistical pressure A β,K,M,r,s,d (J = 0) w.r.t.them.
Furthermore, the simplest path to obtain a self-consistent equation also for the Mattis magnetization m µ is by considering the auxiliary field J coupled to m µ namely mµ = We do not plot these new self-consistency equations as, in the large M limit, there are no differences w.r.t.those obtained in the main text.

B On the data-set entropy ρ
In this appendix, focusing on a single generic bit, we deepen the relation between the conditional entropy H(ξ µ i |η µ i ) of a given pixel i regarding archetype µ and the information provided by the dataset regarding such a pixel, namely the block η µ,1 i , η µ,2 i , . . ., η µ,M i to justify why we called ρ the data-set entropy in the main text.As calculations are slightly different among the two analyzed models (the one preserving dilution position provided in the main text and the generalized one given in the previous appendix) we repeat them model by model for the sake of transparency.

B.1 I: multitasking Hebbian network equipped with not-affecting-dilution noise
Let us focus on the µ-th pattern and the i-th digit, whose related block is the error probability for any single entry is and, by applying the majority rule on the block, it is reduced to

B.2 II: multitasking Hebbian network equipped with not-preserving-dilution noise
Let us focus on the µ-th pattern and the i-th digit, whose related block is the error probability for any single entry is By applying the majority rule on the block, it is reduced to where

C.1 Stability analysis via signal-to-noise technique
The standard signal-to-noise technique [8] is a powerful method to investigate the stability of a given neural configuration in the noiseless limit β → ∞: by requiring that each neuron is aligned to its field (the post-synaptic potential that it is experiencing, i.e. h i σ i ≥ 0 ∀i ∈ (1, ..., N )) this analysis allows to correctly classify which solution (stemming from the self-consistent equations for the order parameters) is preferred as the control parameters are made to vary and thus it can play as an alternative route w.r.t the standard study of the Hessian of the statistical pressure reported in the main text (see Sec. 3.2).In particular, recently, a revised version of the signal-to-noise technique has been developed [11,12] and in this new formulation it is possible to obtain the self-consistency equations for the order parameters explicitly so we can compare directly outcomes from signal-to-noise to outcomes from statistical mechanics.By comparison of the two routes that lead to the same picture, that is statistical mechanics and revised signal-to-noise technique, we can better comprehend the working criteria of these neural networks.
We suppose that the network is in the hierarchical configuration prescribed by eq.(2.6), that we denote as σ = σ * , and we must evaluate the local field h i (σ * ) acting on the generic neuron σ i in this configuration to check that h i (σ * )σ * i > 0 is satisfied for any i = 1, . . ., N : should this be the case, the configuration would be stable, vice versa unstable.Focusing on the supervised setting with no loss of generality (as we already discussed that the teacher essentially plays no role in the low storage regime) and selecting (arbitrarily) the hierarchical ordering as a test case to be studied, we start by re-writing the Hamiltonian (2.11) as where the local fields h i appear explicitly and are given by The updating rule for the neural dynamics reads as that, in the zero fast-noise limit β → +∞, reduces to To inspect the stability of the hierarchical parallel configuration, we initialize the network in such configuration, i.e., σ (1) = σ * , then, following Hinton's prescription [21,48] 8 the one-step n-iteration σ (2) leads to an expression of the magnetization that reads as Next, using the explicit expression of the hierarchical parallel configuration (2.6), we get i (σ * ) ; (C.6) by applying the central limit theorem to estimate the sums appearing in the definition of h where Thus, Eq. (C.5) becomes For large values of N , the arithmetic mean coincides with the theoretical expectation, thus  therefore, we can rewrite Eq. (C.9) as While we carry out the computations of κ As shown in Fig. 9, as a critical amount of perceived examples is collected, this expression is in very good agreement with the estimate stemming from the numerical solution of the self-consistent equations and indeed we can finally state the last Theorem 4. In the zero fast-noise limit (β → +∞), if the neural configuration is a fixed point of the dynamics described by the sequential spin upload rule η µ,a i σ i must satisfy the following self equations where we set ηµ = (M r) −1 M a η µ,a .
Remark 4. The empirical evidence that, via early stopping criterion, we still obtain the correct solution proves a posteriori the validity of Hinton's recipe in the present setting and it tacitly candidate statistical mechanics as a reference also to inspect computational shortcuts.
Proof.The local fields h i can be rewrite using the definition of n µ as in this way the upload rule can be recast as Computing the value of the n µ -order parameters at the (n + 1) step of uploading process we get If σ(ξ) is a fixed point of our dynamics, we must have σ(n+1) ≡ σ(n) and n For large value of N , the arithmetic mean coincides with the theoretical expectation, thus therefore, (C.21) reads as where we used E η = E ξ E (η|ξ) .

Corollary 2. Under the hypothesis of the previous theorem, if the neural configuration coincides with the parallel configuration
then the order parameters n µ (σ) must satisfy the following self equation Proof.We only have to replace in (C.17) the explicit form of σ * and we get the proof.
(C.28) where we used Solving numerically this set of equations we construct the plots presented in Fig. 10.

D.2 K = 3
Moving on the case of K = 3 by following the same steps of the previous subsection, we get where we used In order to lighten the presentation, we reported only the expression of m1 and n1 , the related expressions of m2 ( m3 ) and n2 (n 3 ) can be obtained by making the simple substitutions m1 ←→ m2 ( m3 ), and n1 ←→ n2 (n 3 ) in (D.3).The numerical solution of the previous set of equations is depicted in Fig. 11.

E.1 Proof of Theorem 1
In this subsection we show the proof Proposition 1.In order to prove the aforementioned proposition, we put in front of it the following Lemma 1.The t derivative of interpolating pressure is given by Since the computation is lengthy but not cumbersome we decided to omit it.
Proposition 3. In the low load regime, in the thermodynamic limit the distribution of the generic order parameter X is centres at its expectation value X with vanishing fluctuations.Thus, being Remark 5. We stress that afterwards we use the relations which are computed with brute force with Newton's Binomial.Now, using these relations, if we fix the constants as where E = E ξ E (η|ξ) .Finally, putting inside (E.6) (E.8) and (E.5), we reach the thesis.

E.2 Proof of Proposition 1
In this subsection we show the proof Proposition 1.
Proof.For large data-sets, using the Central Limit Theorem we have where Z µ is a standard Gaussian variable Z µ ∼ N (0, 1).Replacing Eq. (E.9) in the self-consistency equation for n, namely Eq. (3.8), and applying Stein's lemma 9 in order to recover the expression for mµ , we get the large data-set equation for nµ , i.e.Eq. (3.10).
We will use the relation 9 This lemma, also known as Wick's theorem, applies to standard Gaussian variables, say J ∼ N (0, 1), and states that, for a generic function f (J) for which the two expectations E (Jf (J)) and E (∂ J f (J)) both exist, then (E.17) to reach this result, we have also used the relation where z is a Gaussian variable N (0, 1) and the truncated expression nµ = mµ /(1 + ρ) for the first equation in (E.17).

Figure 1 .
Figure 1.Numerical solutions of the two self-consistent equations (2.7) and (2.8) obtained for K = 2, see [2], as a function of d and for different choices of β: in the d → 0 limit the Hopfield serial retrieval is recovered (one magnetization with intensity one and the other locked at zero), for d → 1 the network ends up in the parallel regime (where all the magnetizations acquire the same value), while for intermediate values of dilution the hierarchical ordering prevails (both the magnetizations are raised, but their amplitude is different).
is the one which minimizes the cost function.The hierarchical retrieval state m = (1 − d) 1, d, d 2 , d 3 , • • • can also be specified in terms of neural configuration as[2]

Figure 2 .
Figure 2. We report two examples of Monte Carlo dynamics until thermalization within the hierarchical (upper plots, dilution level d = 0.2) and parallel (lower plots, dilution level d = 0.8) scenarios respectively.These plots confirm that the picture provided by statistical mechanics is actually dynamically reached by the network.We initialize the network sharply in a pattern as a Cauchy condition (represented as the dotted blue Dirac delta peaked at the pattern in the second columns) and, in the first column, we show the stationary values of the various Mattis magnetizations pertaining to different patterns, while in the second column we report their histograms achieved by sampling 1000 independent Monte Carlo simulations: starting from a sequential retrieval regime, the network ends up in a multiple retrieval mode, hierarchical vs parallel depending on the level of dilution in the patterns.

Figure 3 .
Figure 3. Snapshots of cost function (upper plots) -where we use the label E for energy-and magnetizations (lower plots) for data-sets generated by K = 2 archetypes and at different entropies, in the noiseless limit β → ∞.Starting by ρ = 0.0 we see that the hierarchical regime (black lines) dominates at relatively mild dilution values (i.e., the energy pertaining to this configuration is lower w.r.t. the parallel regime), while for d → 1 the hierarchical ordering naturally collapse to the parallel regime (red lines), where all the magnetizations acquire the same values.Further note how, by increasing the entropy in the data-set (e.g. for ρ = 0.1 and ρ = 0.4), the domain of validity of the parallel regime enlarges (much as increasing β in the network, see Fig.1).The vertical blue lines mark the transitions between these two regimes as captured by Statistical Mechanics: it corresponds to switching from the white to the green regions of the phase diagrams of Fig.6.

Figure 4 .
Figure 4. Behaviour of the Mattis magnetizations as more and more examples are supplied to the network.Monte Carlo numerical checks (colored dots, N = 6000) for a diluted network with r = 0.1 and K = 3 are in plain agreement with the theory: solutions of the self-consistent equation for the Mattis magnetizations reported in the Corollary 1 are shown as solid lines.As dilution increases, the network behavior departs from a Hopfield-like retrieval (d = 0.1) where just the blue magnetization is raised (serial pattern recognition) to the hierarchical regime (d = 0.25 and d = 0.55) where multiple patterns are simultaneously retrieved with different amplitudes, while for higher values of dilution the network naturally evolves toward the parallel regime (d = 0.75) where all the magnetizations are raised and with the same strength.Note also the asymptotic agreement with the dotted lines, whose values are those predicted by the multitasking Hebbian storage [2].

Figure 5 .
Figure 5.We plot the logarithm of the critical number of example (required to raise the first magnetization) M 1 ⊗ at different loads K = 2, 3, 4, 5 and as a function of the dilution of the networks, for different noise values of the data-set (as shown in the legend).Note the divergent behavior of M 1⊗ when approaching the critical dilution level dc(K) = d1, as predicted by the parallel Hebbian storage limit[2,4]: this is the crossover between the two multi-tasking regimes, hierarchical vs parallel, hence, solely at the value of dilution d1, there is no sharp behavior to infer and, correctly, the network can not accomplish learning.This shines by looking at(3.20)where the critical amount of examples to correctly infer the archetype is reported: its denominator reduces to 1 − 2d + d K and, for µ = 1, it becomes zero when d → d1.

. 35 )Figure 6 .
Figure 6.Phase diagram in the dilution-noise (d − β −1 ) plane for different values of K and ρ.We highlight that different regions -marked with different colors-represent different operational behavior of the network:in yellow the ergodic solution, in light-blue the pure state solution (that is, solely one magnetization different from zero), in white the hierarchical regime (that is, several magnetizations differ from zero and they all assume different values) and in light-green the parallel regime (several magnetization differ from zero but their amplitude is the same for all).

Figure 7 .
Figure 7. Left: Parallel minimization of several (mean square-error) Loss functions L± = ||ξ µ ± σ|| 2 (each pertaining to a different archetype) as the noise in the data-set r is varied.Here: M = 25, N = 10000.The horizontal gray dashed lines are the saturation level of the Loss functions, namely 1 − d 2 − (1 − d)d µ−1 .We get r⊗ (the vertical black line) by the inversion of (3.20).Right: Parallel minimization of several (mean square-error) Loss function L± = ||ξ µ ± σ|| 2 (each pertaining to a different archetype) as the data-set size M is varied: as M grows the simultaneous minimization of more than one Loss functions takes place, at difference with learning via standard Hebbian mechanisms where one Loss function -dedicate to a single archetype-is minimized per time.Orange and blue lines pertain to Loss functions of other patterns that, at these levels of dilution and noise, can not be minimized at once with the previous ones.

Figure 8 .
Figure 8.Comparison of the numerical solution of the self consistency equations related to the Mattis magnetization in the two models: upper panel is due to the first model (reported in the main text), lower panel reports on the second model (deepened here).Beyond a different transient at small M the two models behave essentially in the same way.

1 ,
. . ., N , we are able to split, mimicking the standard signal-to-noise technique, a signal contribution (κ

Figure 9 .
Figure 9. Signal to Noise numerical inspection of the Mattis magnetizations for a diluted network with r = 0.1 and K = 3 in the hierarchical regime (at levels of pattern's dilution d < dc as reported in their titles):we highlight the agreement, as the saturation level is reached, among Signal to Noise analysis (orange dots) and the value of the magnetization of the first pattern found by the mechanical statistical approach (reported as solid red line).The dashed lines represent the Hebbian storing prescriptions at which the values of the magnetizations converge.The vertical black line depicts the critical amount of examples M⊗ that must be experienced by the network to properly depict the archetypes: note that this value is systematically above, in M, that the point where all the bifurcations happened, hence all the magnetizations stabilized on their hierarchical displacements.
in Appendix C.2, here we report only their values, which are κ in Sec.C.1, we will present only tha case of µ = 1.

Figure 10 .
Figure 10.Numerical resolution of the system of equations (D.1) for K = 2: we plot the behaviour of the magnetization m versus degree of dilution d for fixed r = 0.2 and different value of β (from right to left β = 1000, 6.66, 3.33) and ρ (from top to bottom ρ = 0.8, 0.2, 0.0).We stress that for ρ = 0.0 we recover the standard diluted model presented in Fig.1.

Figure 11 .
Figure 11.Numerical solution of the system of equations (D.3) for K = 3: we plot the behavior of the magnetization m versus the degree of dilution d for fixed r = 0.2 and different value of β (from left to right β = 1000, 6.66, 3.33) and ρ (from top to bottom ρ = 0.8, 0.2, 0.0).