Ising model selection using ℓ 1-regularized linear regression: a statistical mechanics analysis*

Xiangming Meng; Tomoyuki Obuchi; Yoshiyuki Kabashima

doi:10.1088/1742-5468/ac9831

1. Introduction

The advent of massive data across various scientific disciplines has led to the widespread use of undirected graphical models, also known as Markov random fields (MRFs), as a tool for discovering and visualizing dependencies among covariates in multivariate data [1]. The Ising model, originally proposed in statistical physics, is one special class of binary MRFs with pairwise potentials and has been widely used in different domains such as image analysis, social networking, gene network analysis [2–7]. Among various applications, one fundamental problem of interest is called Ising model selection, which refers to recovering the underlying graph structure of the original Ising model from independent, identically distributed (i.i.d.) samples. A variety of methods have been proposed [8–18], demonstrating the possibility of successful Ising model selection even when the number of samples is smaller than that of the variables. Notably, it has been demonstrated that for the ℓ₁-regularized logistic regression (ℓ₁-LogR) [10, 16] and interaction screening (IS) [14, 15] estimators, $M=\mathcal{O}\left(\mathrm{log}\,N\right)$ samples suffice for an Ising model with N spins under certain assumptions, which is consistent with respect to (w.r.t.) previously established information-theoretic lower-bound [11]. Both ℓ₁-LogR and IS are ℓ₁-regularized M-estimators [19] with logistic and IS objective (ISO) loss functions, respectively.

In this paper, we focus on one simpler linear estimator called ℓ₁-regularized linear regression (ℓ₁-LinR) and theoretically investigate its typical learning performance using the powerful replica method [20–23] from statistical mechanics. The ℓ₁-LinR estimator, also more widely known as least absolute shrinkage and selection operator (LASSO) [24] in statistics and machine learning, is considered here mainly for two reasons. On the one hand, it is one representative example of model misspecification since the quadratic loss of ℓ₁-LinR does not match the true log-conditional-likelihood as ℓ₁-LogR, nor does it have the interaction screening property as IS. On the other hand, as one of the most popular linear estimator, ℓ₁-LinR is more computationally efficient than ℓ₁-LogR and IS, and thus it is of practical importance to investigate its learning performance for Ising model selection. Since it is difficult to obtain results for general graphs, as a first step we consider the random regular (RR) graphs ${\mathcal{G}}_{N,d,{K}_{0}}$ in the paramagnetic phase [23], where ${\mathcal{G}}_{N,d,{K}_{0}}$ denotes the ensemble of RR graphs with constant node degree d and uniform coupling strength K₀ on the edges.

1.1. Contributions

The main contributions are summarized as follows. First, we obtain an accurate estimate of the typical sample complexity of ℓ₁-LinR for Ising model selection for typical RR graphs in the paramagnetic phase, which, remarkably, has the same order as ℓ₁-LogR. Specifically, for a typical RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ , using ℓ₁-LinR with a regularization parameter $0< \lambda < \mathrm{tanh}\left({K}_{0}\right)$ , one can consistently reconstruct the structure with $M > \frac{c\left(\lambda ,{K}_{0}\right)\mathrm{log}\,N}{{\lambda }^{2}}$ samples, where $c\left(\lambda ,{K}_{0}\right)=\frac{2\left(1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)+d{\lambda }^{2}\right)}{1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)}$ . The accuracy of our typical sample complexity prediction is verified by its excellent agreement with experimental results. To the best of our knowledge, this is the first result that provides an accurate typical sample complexity for Ising model selection. Interestingly, as $\lambda \to \mathrm{tanh}\left({K}_{0}\right)$ , a lower bound $M > \frac{2\,\mathrm{log}\,N}{{\mathrm{tanh}}^{2}\left({K}_{0}\right)}$ of the typical sample complexity is obtained, which has the same scaling as the information-theoretic lower bound $M > \frac{{c}^{\prime }\,\mathrm{log}\,N}{{K}_{0}^{2}}$ [11] for some constant c' at high temperatures (i.e. small K₀) since $\mathrm{tanh}\left({K}_{0}\right)=\mathcal{O}\left({K}_{0}\right)$ as K₀ → 0.

Second, we provide a computationally efficient method to precisely predict the typical learning performance of ℓ₁-LinR in the non-asymptotic case with moderate M, N, such as precision, recall, and residual sum of square (RSS). Such precise non-asymptotic predictions of ℓ₁-LinR for Ising model selection have been unavailable even for ℓ₁-LogR [10, 16] and IS [14, 15], nor are they the same as previous asymptotic results of ℓ₁-LinR assuming fixed α ≡ M/N [25–28]. Moreover, although our theoretical analysis is based on a tree-like structure assumption, experimental results on two dimensional (2D) grid graphs also show a fairly good agreement, indicating that our theoretical result can be a good approximation even for graphs with many loops.

Third, while this paper mainly focuses on ℓ₁-LinR, our method is readily applicable to a wide class of ℓ₁-regularized M-estimators [19], including ℓ₁-LogR [10] and IS [14, 15]. Thus, an additional technical contribution is providing a generic approach for precisely characterizing the typical learning performances of various ℓ₁-regularized M-estimators for Ising model selection. Although the replica method from statistical mechanics is non-rigorous, our results are conjectured to be correct, which is supported by their excellent agreement with the experimental results. Additionally, several technical advances we propose in this paper, e.g. the entropy term computation by averaging over the Haar measure and the modification of EOS to address the finite-size effect, might be of general interest to those who use the replica method as a tool for performance analysis.

1.2. Related works

There have been some earlier works on the analysis of Ising model selection (also known as the inverse Ising problem) using the replica method [4–7, 29] from statistical mechanics. For example, in [6], the performance of the pseudo-likelihood (PL) method [30] is studied. However, instead of graph structure learning, [6] focuses on the problem of parameter learning. Then, [7] extends the analysis to the Ising model with sparse couplings using logistic regression without regularization. The recent work [29] analyzes the performance of ℓ₂-regularized linear regression but the techniques invented there are not applicable to ℓ₁-LinR since the ℓ₁-norm breaks the rotational invariance property.

Regarding the study of ℓ₁-LinR (LASSO) under model misspecification, the past few years have seen a line of research in the field of signal processing with a specific focus on the single-index model [27, 31–34]. These studies are closely related to ours but there are several important differences. First, in our study, the covariates are generated from an Ising model rather than a Gaussian distribution. Second, we focus on model selection consistency of ℓ₁-LinR while most previous studies consider estimation consistency except [33]. However, [33] only considers the classical asymptotic regime while our analysis includes the high-dimensional setting where M ≪ N.

As far as we have searched, there is no earlier study of ℓ₁-LinR estimator for Ising model selection, though some are found for Gaussian graphical models [35, 36]. One closely related work [15] states that at high temperatures when the coupling magnitude is approaching zero, both logistic and ISO losses can be approximated by a quadratic loss. However, their claim is only restricted to the very small magnitude near zero while our analysis extends the validity range to the whole paramagnetic phase. Moreover, they evaluate the minimum number of samples necessary for consistently reconstructing 'arbitrary' Ising models, which, however, seems much larger than that actually needed. By contrast, we provide the first accurate assessment of typical sample complexity for consistently reconstructing typical samples of Ising models defined over the RR graphs. Furthermore, [15] does not provide precise predictions of the non-asymptotic learning performance as we do.

2. Background and problem setup

2.1. Ising model

Ising model is one special class of MRFs with pairwise potentials and each variable takes binary values [22, 23], which is one classical model from statistical physics. The joint probability distribution of an Ising model with N variables (spins) $\boldsymbol{s}={\left({s}_{i}\right)}_{i=0}^{N-1}\in {\left\{-1,+1\right\}}^{N}$ has the form

$\begin{equation}{P}_{\text{Ising}}\left(\boldsymbol{s}\vert {\boldsymbol{J}}^{\,\ast }\right)=\frac{1}{{Z}_{\text{Ising}}\left({\boldsymbol{J}}^{\,\ast }\right)}\mathrm{exp}\left\{\sum\limits _{i< j}{J}_{ij}^{\ast }{s}_{i}{s}_{j}\right\},\end{equation} \tag{ 1 }$

where ${Z}_{\text{Ising}}\left({\boldsymbol{J}}^{\ast }\right)={\sum }_{\boldsymbol{s}}\mathrm{exp}\left\{{\sum }_{i< j}{J}_{ij}^{\ast }{s}_{i}{s}_{j}\right\}$ is the partition function and ${\boldsymbol{J}}^{\ast }={\left({J}_{ij}^{\ast }\right)}_{i,j}$ are the original couplings, respectively. In general, there are also external fields but here they are assumed to be zero for simplicity. The structure of Ising model can be described by an undirected graph $G=\left(\mathtt{V},\mathtt{E}\right)$ , where $\mathtt{V}=\left\{0,1,\dots ,N-1\right\}$ is a collection of vertices at which the spins are assigned, and $\mathtt{E}=\left\{\left(i,j\right)\vert {J}_{ij}^{\ast }\ne 0\right\}$ is a collection of undirected edges, i.e. ${J}_{ij}^{\ast }=0$ for all pairs of $\left(i,j\right)\notin \mathtt{E}$ . For each vertex $i\in \mathtt{V}$ , its neighborhood is defined as the subset $\mathcal{N}\left(i\right)\equiv \left\{j\in \mathtt{V}\vert \left(i,j\right)\in \mathtt{E}\right\}$ .

2.2. Neighborhood-based ℓ₁-regularized linear regression (ℓ₁-LinR)

The problem of Ising model selection refers to recovering the graph G (edge set $\mathtt{E}$ ), given M i.i.d. samples ${\mathcal{D}}^{M}=\left\{{\boldsymbol{s}}^{\left(1\right)},\dots ,{\boldsymbol{s}}^{\left(M\right)}\right\}$ from the Ising model. While the maximum likelihood method has nice properties of consistency and asymptotic efficiency, it suffers from high computational complexity. To tackle this difficulty, several local learning algorithms have been proposed, notably the ℓ₁-LogR estimator [10] and IS estimator [14]. Both ℓ₁-LogR and IS optimize a regularized local cost function $\ell \left(\cdot \right)$ for each spin i.e. $\forall \enspace i\in \mathtt{V}$ ,

$\begin{equation}{\hat{\boldsymbol{J}}}_{{\backslash}i}=\underset{{\boldsymbol{J}}_{{\backslash}i}}{\mathrm{arg min}}\left[\frac{1}{M}\sum\limits _{\mu =1}^{M}\ell \left({s}_{i}^{\left(\mu \right)}{h}_{{\backslash}i}^{\left(\mu \right)}\right)+\lambda {{\Vert}{\boldsymbol{J}}_{{\backslash}i}{\Vert}}_{1}\right],\end{equation} \tag{ 2 }$

where ${h}_{{\backslash}i}^{\left(\mu \right)}\equiv {\sum }_{j\ne i}{J}_{ij}{s}_{j}^{\left(\mu \right)}$ , ${\boldsymbol{J}}_{{\backslash}i}\equiv {\left({J}_{ij}\right)}_{j(\ne i)}$ , and ${{\Vert}\cdot {\Vert}}_{1}$ denotes the ℓ₁ norm. Specifically, $\ell \left(x\right)=\mathrm{log}\left(1+{e}^{-2x}\right)$ for ℓ₁-LogR and $\ell \left(x\right)={e}^{-x}$ for IS, which correspond to the minus log conditional distribution [10] and the ISO [14], respectively. Consequently, the problem of recovering the edge set $\mathtt{E}$ is equivalently reduced to local neighborhood selection, i.e. recovering the neighborhood set $\mathcal{N}\left(i\right)$ for each vertex $i\in \mathtt{V}$ . In particular, given the estimates ${\hat{\boldsymbol{J}}}_{{\backslash}i}$ in (2), the neighborhood set of vertex i can be estimated via the nonzero coefficients, i.e.

$\begin{equation}\hat{\mathcal{N}}\left(i\right)=\left\{j\vert {\hat{J}}_{ij}\ne 0,j\in \mathtt{V}{\backslash}i\right\},\quad \forall \enspace i\in \mathtt{V}.\end{equation} \tag{ 3 }$

In this paper, we focus on one simple linear estimator, termed as the ℓ₁-LinR estimator, i.e. $\forall \enspace i\in \mathtt{V}$ ,

$\begin{equation}{\hat{\boldsymbol{J}}}_{{\backslash}i}=\underset{{\boldsymbol{J}}_{{\backslash}i}}{\mathrm{arg min}}\left[\frac{1}{2M}\sum\limits _{\mu =1}^{M}{\left({s}_{i}^{\left(\mu \right)}-{h}_{{\backslash}i}^{\left(\mu \right)}\right)}^{2}+\lambda {{\Vert}{\boldsymbol{J}}_{{\backslash}i}{\Vert}}_{1}\right],\end{equation} \tag{ 4 }$

which, recalling that ${s}_{i}^{\left(\mu \right)}\in \left\{-1,+1\right\}$ , corresponds to a quadratic loss $\ell \left(x\right)=\frac{1}{2}{\left(x-1\right)}^{2}$ in (2). The neighborhood set for each vertex $i\in \mathtt{V}$ is estimated in the same way as (3). Interestingly, the quadratic loss used in (4) implies that the postulated conditional distribution is Gaussian and thus inconsistent with the true one, which is one typical case of model misspecification. Furthermore, compared with nonlinear estimators ℓ₁-LogR and IS, the ℓ₁-LinR estimator is more efficient to implement.

3. Statistical mechanics analysis

In this section, a statistical mechanics analysis of the ℓ₁-LinR estimator is presented for typical RR graphs in the paramagnetic phase. Our analysis is applicable to any M-estimator of the form (2) and please refer to appendix A for a unified analysis, including detailed results for the ℓ₁-LogR estimator.

To characterize the structure learning performance, the precision and recall are considered:

$\begin{equation}Precision=\frac{TP}{TP+FP},\qquad Recall=\frac{TP}{TP+FN},\end{equation} \tag{ 5 }$

where TP, FP, FN denote the number of true positives, false positives, and false negatives in the estimated couplings, respectively. The concept of model selection consistent for an estimator is defined in definition 1, which is also known as the sparsistency property [10].

Definition 1. An estimator is called model selection consistent if both the associated precision and recall satisfy Precision → 1 and Recall → 1 as M → ∞.

Additionally, if one is further interested in the specific values of the estimated couplings, our analysis can also yield the residual sum of squares (RSS) for the estimated couplings. Our theoretical analysis of the learning performance builds on the statistical mechanics framework. Contrary to the probably almost correct (PAC) learning theory [37] in mathematical statistics, statistical mechanics tries to describe the typical (as defined in definition 2) behavior exactly rather than to bound the worst case which is likely to be over-pessimistic [38].

Definition 2. 'Typical' means not just most probable but in addition the probability for situations different from the typical one can be made arbitrarily small as N → ∞ [38].

Similarly, when referring to typical RR graphs, we mean tree-like RR graphs, i.e. when seen from a random node, they look like part of an infinite tree, which are typical realizations from the uniform probability distribution on the ensemble of RR graphs.

3.1. Problem formulation

For simplicity and without loss of generality, we focus on spin s₀. With a slight abuse of notation, we will drop certain subscripts in following descriptions, e.g. J _\i will be denoted as J hereafter which represents a vector rather than a matrix. The basic idea of the statistical mechanical approach is to introduce the following Hamiltonian and Boltzmann distribution induced by the loss function $\ell \left(\cdot \right)$

$\begin{equation}\mathcal{H}\left(\boldsymbol{J}\vert {\mathcal{D}}^{M}\right)=\sum\limits _{\mu =1}^{M}\ell \left({s}_{0}^{\left(\mu \right)}{h}^{\left(\mu \right)}\right)+\lambda M{{\Vert}\boldsymbol{J}{\Vert}}_{1},\end{equation} \tag{ 6 }$

$\begin{equation}P\left(\boldsymbol{J}\vert {\mathcal{D}}^{M}\right)=\frac{1}{Z}{e}^{-\beta \mathcal{H}\left(\boldsymbol{J}\vert {\mathcal{D}}^{M}\right)},\end{equation} \tag{ 7 }$

where $Z=\int d\boldsymbol{J}\,{e}^{-\beta \mathcal{H}\left(\boldsymbol{J}\vert {\mathcal{D}}^{M}\right)}$ is the partition function, and $\beta \left( > 0\right)$ is the inverse temperature. In the zero-temperature limit β → +∞, the Boltzmann distribution (7) converges to a point-wise measure on the estimator (2). The macroscopic properties of (7) can be analyzed by assessing the free energy density $f({\mathcal{D}}^{M})=-\frac{1}{N\beta }\,\mathrm{log}\,Z$ , from which, once obtained, we can evaluate averages of various quantities simply by taking its derivatives w.r.t. external fields [21]. In current case, $f({\mathcal{D}}^{M})$ depends on the predetermined randomness of ${\mathcal{D}}^{M}$ , which plays the role of quenched disorder. As N, M → ∞, $f({\mathcal{D}}^{M})$ is expected to show the self averaging property [21]: for typical datasets ${\mathcal{D}}^{M}$ , $f({\mathcal{D}}^{M})$ converges to its average over the random data ${\mathcal{D}}^{M}$ :

$\begin{equation}f=-\frac{1}{N\beta }{\left[\mathrm{log}\,Z\right]}_{{\mathcal{D}}^{M}},\end{equation} \tag{ 8 }$

where ${\left[\cdot \right]}_{{\mathcal{D}}^{M}}$ denotes expectation over ${\mathcal{D}}^{M}$ , i.e. ${\left[\cdot \right]}_{{\mathcal{D}}^{M}}={\sum }_{{\boldsymbol{s}}^{\left(1\right)},\dots ,{\boldsymbol{s}}^{\left(M\right)}}\left(\cdot \right){\prod }_{\mu =1}^{M}{P}_{\text{Ising}}\left({\boldsymbol{s}}^{\left(\mu \right)}\vert {\boldsymbol{J}}^{\,\ast }\right)$ . Consequently, one can analyze the typical performance of any ℓ₁-regularized M-estimator of the form (2) via the assessment of (8), with ℓ₁-LinR in (4) being a special case with $\ell \left(x\right)=\frac{1}{2}{\left(x-1\right)}^{2}$ .

3.2. Replica computation of the free energy density

Unfortunately, computing (8) rigorously is difficult. For practically overcoming this difficulty, we resort to the powerful replica method [20–23] from statistical mechanics, which is symbolized using the following identity

$\begin{equation}f=-\frac{1}{N\beta }{\left[\mathrm{log}\,Z\right]}_{{\mathcal{D}}^{M}}=-\underset{n\to 0}{\mathrm{lim}}\,\frac{1}{N\beta }\frac{\partial \mathrm{log}{\left[{Z}^{n}\right]}_{{\mathcal{D}}^{M}}}{\partial n}.\end{equation} \tag{ 9 }$

The basic idea is as follows. One replaces the average of log Z by that of the nth power Zⁿ which is analytically tractable for $n\in \mathbb{N}$ in the large N limit, and constructs an analytically continuable expression from $\mathbb{N}$ to $\mathbb{R}$ , then takes the limit n → 0 by using the expression. Although the replica method is not rigorous, it has been empirically verified from extensive studies in disorder systems [22, 23] and also found useful in the study of high-dimensional models in machine learning [39, 40]. For more details of the replica method, please refer to [20–23].

Specifically, with the Hamiltonian $\mathcal{H}\left(\boldsymbol{J}\vert {\mathcal{D}}^{M}\right)$ , assuming $n\in \mathbb{N}$ is a positive integer, the replicated partition function ${\left[{Z}^{n}\right]}_{{\mathcal{D}}^{M}}$ in (9) can be written as

$\begin{equation}{\left[{Z}^{n}\right]}_{{\mathcal{D}}^{M}}=\int \prod\limits _{a=1}^{n}d{\boldsymbol{J}}^{\,a}{e}^{-\beta \lambda M\sum\limits _{a=1}^{n}{{\Vert}{\boldsymbol{J}}^{\,a}{\Vert}}_{1}}{\left\{\sum\limits _{\boldsymbol{s}}{P}_{\text{Ising}}\left(\boldsymbol{s}\vert {\boldsymbol{J}}^{\,\ast }\right)\mathrm{exp}\left[-\beta \sum\limits _{a=1}^{n}\ell \left({s}_{0}{h}^{a}\right)\right]\right\}}^{M},\end{equation} \tag{ 10 }$

where ${h}^{a}={\sum }_{j}{J}_{j}^{a}{s}_{j}$ will be termed as local field hereafter, and a (and b in the following) is index variable of the replicas. The analysis below essentially depends on the distribution of h^a but it is nontrivial. To resolve it, we take a similar approach as [7, 29] and introduce the following ansatz.

Ansatz 1 (A1): $\mathrm{D}\mathrm{e}\mathrm{n}\mathrm{o}\mathrm{t}\mathrm{e}\enspace {\Psi}=\left\{j\vert j\in \mathcal{N}\left(0\right)\right\}$ and $\bar{{\Psi}}=\left\{j\vert j=1,\dots ,N-1,j\notin \mathcal{N}\left(0\right)\right\}$ as the active and inactive sets of spin s₀, respectively, then for a typical RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ in the paramagnetic phase, i.e. $\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)< 1$ , the ℓ₁-LinR estimator in (4) is a random vector determined by random realizations of ${\mathcal{D}}^{M}$ and obeys the following form

$\begin{equation}\hat{{J}_{j}}=\begin{cases}{\bar{J}}_{j}+\frac{1}{\sqrt{N}}{w}_{j},\quad \hfill & j\in {\Psi}\hfill \\ \frac{1}{\sqrt{N}}{w}_{j},\quad \hfill & j\in \bar{{\Psi}}\hfill \end{cases}\end{equation} \tag{ 11 }$

where ${\bar{J}}_{j}$ is the mean value of the estimator and w_j is a random variable which is asymptotically zero mean with variance scaled as $\mathcal{O}\left(1\right)$ .

The consistency of ansatz 1 is checked in appendix B. Under ansatz 1, the local fields h^a can be decomposed as ${h}^{a}={\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+{h}_{w}^{a}$ where ${h}_{w}^{a}\equiv {\sum }_{j}\frac{1}{\sqrt{N}}{w}_{j}^{a}{s}_{j}$ is the 'noise' part. According to the central limit theorem (CLT), ${h}_{w}^{a}$ can be approximated as multivariate Gaussian variables, which, under the replica symmetric (RS) ansatz [21], can be fully described by two order parameters

$\begin{equation}Q\equiv \frac{1}{N}\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{a},\qquad q\equiv \frac{1}{N}\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{b},(a\ne b),\end{equation} \tag{ 12 }$

where ${C}^{{\backslash}0}\equiv \left\{{C}_{ij}^{{\backslash}0}\right\}$ is the covariance matrix of the original Ising model without the spin s₀. Since the difference between C^\0 and that with s₀ is not essential in the limit N → ∞, hereafter the superscript \0 will be discarded.

As shown in appendix A, for quadratic loss $\ell \left(x\right)=\frac{1}{2}{(1-x)}^{2}$ of ℓ₁-LinR, the average free energy density (9) in the limit β → ∞ can be computed as

$\begin{equation}f\left(\beta \to \infty \right)=-\mathtt{E}\mathtt{x}\mathtt{t}\mathtt{r}\left\{-\mathcal{\xi }+S\right\},\end{equation} \tag{ 13 }$

where $\mathtt{E}\mathtt{x}\mathtt{t}\mathtt{r}\left\{\cdot \right\}$ denotes the extremum operation w.r.t. relevant variables and ξ, S denote the energy and entropy terms:

$\begin{equation}S=\underset{\beta \to \infty }{\mathrm{lim}}\underset{n\to 0}{\mathrm{lim}}\,\frac{1}{N\beta }\frac{\partial }{\partial n}\mathrm{log}\,I,\end{equation} \tag{ 14 }$

$\begin{equation}I=\int \prod\limits _{a=1}^{n}d{w}^{a}\prod\limits _{a=1}^{n}{e}^{-\lambda \beta {{\Vert}{w}^{a}{\Vert}}_{1}}\delta \left(\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}{w}_{j}^{a}-NQ\right)\times \prod\limits _{a< b}\delta \left(\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}{w}_{j}^{b}-Nq\right),\end{equation} \tag{ 15 }$

$\begin{equation}\mathcal{\xi }=\frac{\alpha {\mathbb{E}}_{s,z}{\left({s}_{0}-{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}-\sqrt{Q}z\right)}^{2}}{2\left(1+\chi \right)}+\alpha \lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert ,\end{equation} \tag{ 16 }$

where $\alpha \equiv M/N,\chi \equiv {\mathrm{lim}}_{\beta \to \infty }\,\beta \left(Q-q\right)$ , ${\mathbb{E}}_{s,z}(\cdot )$ denotes the expectation operation w.r.t. $z\sim \mathcal{N}(0,1)$ and $({s}_{0},{\boldsymbol{s}}_{{\Psi}})\sim {P}_{\text{Ising}}({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast })\propto {e}^{{s}_{0}{\sum }_{j\in {\Psi}}{J}_{j}^{\ast }{s}_{j}}$ [7]. For different losses $\ell \left(\cdot \right)$ , the free energy results (13) only differ in the energy term ξ, which in general is non-analytical (e.g. logistic loss for ℓ₁-LogR) but can be solved numerically. Please refer to appendix A.3 for more details.

In contrast to the case of ℓ₂-norm in [29], the ℓ₁-norm in (15) breaks the rotational invariance property, i.e. ${{\Vert}{w}^{a}{\Vert}}_{1}\ne {{\Vert}O{w}^{a}{\Vert}}_{1}$ for general orthogonal matrix O, making it difficult to compute the entropy term S. To circumvent this difficulty, we employ an observation that, when considering the RR graph ensemble ${\mathcal{G}}_{N,d,{K}_{0}}$ as the coupling network of the Ising model, the orthogonal matrix O diagonalizing the covariance matrix C appears to be distributed from the Haar orthogonal measure [41, 42]. Thus, it is assumed that I in (15) can be replaced by its average ${\left[I\right]}_{O}$ over the Haar-distributed O:

Ansatz 2 (A2): $\mathrm{D}\mathrm{e}\mathrm{n}\mathrm{o}\mathrm{t}\mathrm{e}\enspace C\equiv {\mathbb{E}}_{\boldsymbol{s}}[\boldsymbol{s}{\boldsymbol{s}}^{T}]$ , where ${\mathbb{E}}_{\boldsymbol{s}}[\cdot ]={\sum }_{\boldsymbol{s}}{P}_{\text{Ising}}(\boldsymbol{s}\vert {\boldsymbol{J}}^{\,\ast })(\cdot ),$ as the covariance matrix of spin configurations s . Suppose that the eigendecomposition of C is C = OΛO^T, where O is the orthogonal matrix, then O can be seen as a random sample generated from the Haar orthogonal measure and thus for typical graph realizations from ${\mathcal{G}}_{N,d,{K}_{0}}$ , I in (15) is equal to the average [I]_O.

The consistency of ansatz (A2) is numerically checked in appendix C. Under ansatz (A2), the entropy term S in (14) can be alternatively computed as $\underset{n\to 0}{\mathrm{lim}}\,\frac{1}{N\beta }\frac{\partial }{\partial n}\,\mathrm{log}{\left[I\right]}_{O}$ , as shown in appendix A. Finally, under the RS ansatz, the average free energy density (9) in the limit β → ∞ reads

$\begin{equation}f\left(\beta \to \infty \right)=-\underset{\mathtt{\Theta }}{\mathrm{E}\mathrm{x}\mathrm{t}\mathrm{r}}\left\{\begin{array}{c}\hfill -\frac{\alpha }{2\left(1+\chi \right)}{\mathbb{E}}_{s,z}\left({\left({s}_{0}-{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}-\sqrt{Q}z\right)}^{2}\right)-\lambda \alpha {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert \hfill \\ \hfill +\left(-ER+F\eta \right){G}^{\prime }\left(-E\eta \right)+\frac{1}{2}EQ-\frac{1}{2}F\chi +\frac{1}{2}KR-\frac{1}{2}H\eta \hfill \\ \hfill -{\mathbb{E}}_{z}\underset{w}{\mathrm{min}}\left\{\frac{K}{2}{w}^{2}-\sqrt{H}zw+\frac{\lambda M}{\sqrt{N}}\left\vert w\right\vert \right\}\hfill \end{array}\right\},\end{equation} \tag{ 17 }$

where $z\sim \mathcal{N}\left(0,1\right)$ , and $G\left(x\right)$ is a function defined as

$\begin{equation}G\left(x\right)=-\frac{1}{2}\,\mathrm{log}\,x-\frac{1}{2}+\underset{{\Lambda}}{\mathtt{E}\mathtt{x}\mathtt{t}\mathtt{r}}\left\{-\frac{1}{2}\int \mathrm{log}\left({\Lambda}-\gamma \right)\rho \left(\gamma \right)d\gamma +\frac{{\Lambda}}{2}x\right\},\end{equation} \tag{ 18 }$

and $\rho \left(\gamma \right)$ is the eigenvalue distribution (EVD) of the covariance matrix C, and $\mathtt{\Theta }$ is a collection of macroscopic parameters $\mathtt{\Theta }=\left\{\chi ,Q,E,R,F,\eta ,K,H,{\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}\right\}$ . For details of these macroscopic parameters and $\rho \left(\gamma \right)$ , please refer to appendices A and F, respectively. Note that in (17), apart from the ratio α ≡ M/N, N and M also appear as $\lambda M/\sqrt{N}$ in the free energy result, which is different from previous results [7, 29, 39]. The reason is that, thanks to the ℓ₁-regularization term $\lambda M{{\Vert}\boldsymbol{J}{\Vert}}_{1}$ , the mean estimates ${\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}$ in the active set Ψ and the noise w in the inactive set $\bar{{\Psi}}$ essentially give different scaling contributions to the free energy density.

Although there are no analytic solutions, these macroscopic parameters in (17) can be obtained by numerically solving the corresponding equations of state (EOS) employing the physics terminology. Specifically, for the ℓ₁-LinR estimator, the EOS can be obtained from the extremization condition in (17) as follows (for EOS of a general M-esimator and ℓ₁-LogR, please refer to appendix A.3):

$\begin{equation}\begin{cases}E=\frac{\alpha }{\left(1+\chi \right)},\quad \hfill \\ F=\frac{\alpha }{{\left(1+\chi \right)}^{2}}\left[{\mathbb{E}}_{s}{\left({s}_{0}-{\sum }_{j\in {\Psi}}{s}_{j}{\bar{J}}_{j}\right)}^{2}+Q\right],\quad \hfill \\ R=\frac{1}{{K}^{2}}\left[\left(H+\frac{{\lambda }^{2}{M}^{2}}{N}\right)\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi }}{e}^{-\frac{{\lambda }^{2}{M}^{2}}{2HN}}\right],\quad \hfill \\ E\eta =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma ,\quad \hfill \\ Q=\frac{F}{{E}^{2}}+R\tilde{\Lambda }-\frac{\left(-ER+F\eta \right)\eta }{\int \frac{\rho \left(\gamma \right)}{{\left(\tilde{\Lambda }-\gamma \right)}^{2}}d\gamma },\quad \hfill \\ K=E\tilde{\Lambda }+\frac{1}{\eta },\quad \hfill \\ \chi =\frac{1}{E}+\eta \tilde{\Lambda },\quad \hfill \\ H=\frac{R}{{\eta }^{2}}+F\tilde{\Lambda }+\frac{\left(-ER+F\eta \right)E}{\int \frac{\rho \left(\gamma \right)}{{\left(\tilde{\Lambda }-\gamma \right)}^{2}}d\gamma },\quad \hfill \\ \eta =\frac{1}{K}\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right),\quad \hfill \\ {\bar{J}}_{j}=\frac{\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left(\mathrm{tanh}\left({K}_{0}\right),\lambda \left(1+\chi \right)\right)}{1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)},\quad j\in {\Psi},\quad \hfill \end{cases}\end{equation} \tag{ 19 }$

where $\tilde{\Lambda }$ satisfying $E\eta =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma$ is determined by the extremization condition in (18) and $\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left(z,\tau \right)=\mathtt{sign}\left(z\right){\left(\left\vert z\right\vert -\tau \right)}_{+}$ is the soft-thresholding function. Once the EOS is solved, the free energy density defined in (8) is readily obtained.

3.3. High-dimensional asymptotic result

One important result of our replica analysis is that, as derived (see appendix A.3) from the free energy result (17), the original high dimensional ℓ₁-LinR estimator (4) is decoupled into a pair of scalar estimators, one for the active set and one for the inactive set, i.e.

where ${z}_{j}\sim \mathcal{N}\left(0,1\right),j\in \bar{{\Psi}}$ are i.i.d. standard Gaussian random variables. The decoupling property asserts that, once the EOS (19) is solved, the asymptotic behavior of ℓ₁-LinR can be statistically described by a pair of simple scalar soft-thresholding estimators (see figures 1(a) and (b)).

**Figure 1.** Equivalent low-dimensional estimators for high-dimensional ℓ₁-LinR obtained from the statistical mechanics analysis. (a), (b) Diagrams of the pair of scalar estimators in equations (20) and (21). (c) A schematic description of the modified estimator in equation (25) which takes into account the finite-size effect.
Download figure:
Standard image High-resolution image

In the high-dimensional setting where N is allowed to grow as a function of M, one important question is that what is the minimum number of samples M required to achieve model selection consistency as N → ∞. Though we obtain a pair of scalar estimators (20) and (21), there are no analytical solutions to EOS (19), making it difficult to derive an explicit condition. To overcome this difficulty, as shown in appendix D, we perform a perturbation analysis of EOS (17) and obtain an asymptotic relation H ≃ F. Then, we obtain that for a RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ , given M i.i.d. samples ${\mathcal{D}}^{M}$ , the ℓ₁-LinR estimator (4) can consistently recover the graph structure G as N → ∞ if

$\begin{equation}M > \frac{c\left(\lambda ,{K}_{0}\right)\mathrm{log}\,N}{{\lambda }^{2}},\quad 0< \lambda < \mathrm{tanh}\left({K}_{0}\right),\end{equation} \tag{ 22 }$

where c(λ, K₀) is a constant value dependent on the regularization parameter λ and coupling strength K₀ and a sharp prediction (as verified in section 5) is obtained as

$\begin{equation}c\left(\lambda ,{K}_{0}\right)=\frac{2\left(1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)+d{\lambda }^{2}\right)}{1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)}.\end{equation} \tag{ 23 }$

For details of the analysis, including the counterpart of ℓ₁-LogR, see appendix D. Consequently, we obtain the typical sample complexity of ℓ₁-LinR for Ising model selection for typical RR graphs in the paramagnetic phase.

The result in (22) is derived for ℓ₁-LinR with a fixed regularization parameter λ. Since the value of λ is upper bounded by $\mathrm{tanh}\left({K}_{0}\right)$ (otherwise false negatives occur as discussed in appendix D), a lower bound of the typical sample complexity for ℓ₁-LinR is obtained as

$\begin{equation}M > \frac{2\,\mathrm{log}\,N}{{\mathrm{tanh}}^{2}\left({K}_{0}\right)}.\end{equation} \tag{ 24 }$

Interestingly, the scaling in (24) is the same as the information-theoretic worst-case result $M > \frac{c\,\mathrm{log}\,N}{{K}_{0}^{2}}$ obtained in [11] at high temperatures (i.e. small K₀) since $\mathrm{tanh}\left({K}_{0}\right)=\mathcal{O}\left({K}_{0}\right)$ as K₀ → 0.

3.4. Non-asymptotic result for moderate M, N

In practice, it is desirable to predict the non-asymptotic performance of the ℓ₁-LinR estimator for moderate M, N. However, the scalar estimator (20) for the active set (see figure 1(a)) fails to capture the fluctuations around the mean estimates. This is because in obtaining the energy term ξ (16) of the free energy density (17), the fluctuations around the mean estimates ${\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}$ are averaged out by the expectation ${\mathbb{E}}_{s,z}\left(\cdot \right)$ . To address this problem, we replace ${\mathbb{E}}_{s,z}\left(\cdot \right)$ in (17) with a sample average by accounting for the finite-size effect, thus obtaining a modified estimator for the active set as follows

$\begin{equation}{\left\{{\hat{J}}_{j}\right\}}_{j\in {\Psi}}=\underset{{J}_{j,j\in {\Psi}}}{\mathrm{arg min}}\left[\frac{{\sum }_{\mu =1}^{M}{\left({s}_{0}^{\mu }-{\sum }_{j\in {\Psi}}{s}_{j}^{\mu }{J}_{j}-\sqrt{Q}{z}^{\mu }\right)}^{2}}{2\left(1+\chi \right)M}+\lambda {\sum }_{j\in {\Psi}}\left\vert {J}_{j}\right\vert \right],\end{equation} \tag{ 25 }$

where ${s}_{0}^{\mu },{s}_{j,j\in {\Psi}}^{\mu }\sim P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\ast }\right),\,{z}^{\mu }\sim \mathcal{N}\left(0,1\right),\mu =1\dots M$ . The modified d-dimensional estimator (25) (see figure 1(c) for a schematic) is equivalent to the scalar one (20) (figure 1(a)) as M → ∞ but it enables us to capture the fluctuations of ${\left\{{\hat{J}}_{j}\right\}}_{j\in {\Psi}}$ for moderate M. Note that due to the replacement of expectation with sample average in the free energy density (17), the EOS (19) also needs to be modified and it can be solved iteratively as sketched in algorithm 1. The details are shown in appendix E.1.

Algorithm 1. Method to solve EOS (19) together with (25).

Consequently, for moderate M, N, the non-asymptotic statistical properties of the ℓ₁-LinR estimator can be characterized by the reduced d-dimensional ℓ₁-LinR estimator (25) (figure 1(c)) and scalar estimator (21) (figure 1(b)) using MC simulations. Denote $\left\{{\hat{J}}_{j}^{t}\right\},t=1,\dots ,{T}_{\text{MC}}$ as the estimates in tth MC simulation, where ${\left\{{\hat{J}}_{j}^{t}\right\}}_{j\in {\Psi}}$ and ${\left\{{\hat{J}}_{j}^{t}\right\}}_{j\in \bar{{\Psi}}}$ are solutions of (25) and (21), and T_MC is the total number of MC simulations. Then, the Precision and Recall are computed as

$\begin{equation}Precision=\frac{1}{{T}_{\text{MC}}}\sum\limits _{t=1}^{{T}_{\text{MC}}}\frac{{{\Vert}{\hat{J}}_{j,j\in {\Psi}}^{t}{\Vert}}_{0}}{{{\Vert}{\hat{J}}_{j,j\in {\Psi}}^{t}{\Vert}}_{0}+{{\Vert}{\hat{J}}_{j,j\in \bar{{\Psi}}}^{t}{\Vert}}_{0}},\qquad Recall=\frac{1}{{T}_{\text{MC}}}\sum\limits _{t=1}^{{T}_{\text{MC}}}\frac{{{\Vert}{\hat{J}}_{j,j\in {\Psi}}^{t}{\Vert}}_{0}}{d},\end{equation} \tag{ 26 }$

where ${{\Vert}\cdot {\Vert}}_{0}$ is the ℓ₀-norm indicating the number of nonzero elements. In addition, the RSS can be computed as $\mathrm{R}\mathrm{S}\mathrm{S}={\sum }_{j}{\left\vert {\hat{J}}_{j}-{J}_{j}^{\ast }\right\vert }^{2}=\frac{1}{{T}_{\text{MC}}}{\sum }_{t=1}^{{T}_{\text{MC}}}{\sum }_{j\in {\Psi}}{\left\vert {\hat{J}}_{j}^{t}-{K}_{0}\right\vert }^{2}+R$ .

4. Discussions

It might seem surprising that, despite apparent model misspecification due to the use of quadratic loss, the ℓ₁-LinR estimator can still correctly infer the structure with the same order of sample complexity as ℓ₁-LogR. Also, our theoretical analysis implies that the idea of using linear regression for binary data is not as outrageous as one might imagine. Here we provide an intuitive explanation of its success with a discussion of its limitations.

On the average, from (4), the condition for the ℓ₁-LinR estimator is given as

$\begin{equation}\langle {s}_{0}{s}_{k}\rangle -\sum\limits _{j\ne 0}\langle {s}_{j}{s}_{k}\rangle {J}_{j}=\lambda \partial \vert {J}_{k}\vert ,\quad k=1,\dots ,N,\end{equation} \tag{ 27 }$

where ⟨⋅⟩ and ∂|J_k| represent average w.r.t. the Boltzmann distribution (7) and the sub-gradient of |J_k|, respectively. In the paramagnetic phase, ⟨s_i s_j⟩ decays in its magnitude exponentially w.r.t. the distance of sites i and j. This guarantees that once the connections J_k of sites in the first nearest neighbor set Ψ are given so that

$\begin{equation}\langle {s}_{0}{s}_{k}\rangle ={\sum }_{j\in {\Psi}}\langle {s}_{j}{s}_{k}\rangle {J}_{j}+\lambda \,\mathrm{sign}({J}_{k}),\quad \forall \enspace k\in {\Psi}\end{equation} \tag{ 28 }$

holds, the other conditions are automatically satisfied by setting all the other connections that are not from Ψ to zero. For appropriate choice of λ, (28) has solutions of $\mathrm{sign}({J}_{k}^{\ast }){J}_{k} > 0,\forall \enspace k\in {\Psi}$ . Namely ∀ k ∈ Ψ, the estimate of J_k has the same sign as the true value ${J}_{k}^{\ast }$ . This implies that on average the ℓ₁-LinR estimator can successfully recover the network structure up to the connection signs if λ is chosen appropriately.

The key of the above argument is that ⟨s_i s_j⟩ decays exponentially fast w.r.t. the distance of two sites, which does not hold after the phase transition. Thus, it is conjectured that the ℓ₁-LinR estimator will start to fail in the network recovery just at the phase transition point. However, it is worth noting that this is in fact not limited to ℓ₁-LinR: ℓ₁-LogR also exhibits similar behavior unless post-thresholding is used, as reported in [43].

5. Experimental results

In this section, we conduct numerical experiments to verify the accuracy of the theoretical analysis. The experimental procedures are as follows. First, a random graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ is generated and the Ising model is defined on it. Then, the spin snapshots are obtained using the Metropolis–Hastings algorithm [44–46] in the same way as [7], yielding the dataset ${\mathcal{D}}^{M}$ . We randomly choose a center spin s₀ and infer its neighborhood using the ℓ₁-LinR (4) and ℓ₁-LogR [10] estimators. To obtain standard error bars, we repeat the sequence of operations 1000 times. The RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with node degree d = 3 and coupling strength K₀ = 0.4 is considered, which satisfies the paramagnetic condition $\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)< 1$ . The active couplings ${\left\{{J}_{ij}\right\}}_{\left(i,j\right)\in \mathtt{E}}$ have the same probability of taking both signs of +1 or −1.³

We first verify the precise non-asymptotic predictions of our method described in section 3.4. Figure 2 (top) shows the replica and experimental results of RSS, Precision, Recall for N = 200 with different values of α ≡ M/N. It can be seen that for both ℓ₁-LinR and ℓ₁-LogR, there is a fairly good agreement between the theoretical predictions and experimental results, even for small N = 200 and small α (equivalently small M), verifying the correctness of the replica analysis. Interestingly, a quantitatively similar behavior between ℓ₁-LinR and ℓ₁-LogR is observed in terms of precision and recall. Regarding RSS, the two estimators actually behave differently, which can be clearly seen in figure 7 in appendix G: the RSS is much smaller for ℓ₁-LogR, which is natural since the estimates of ℓ₁-LogR are closer to the true ones due to the model match. As our theoretical analysis is based on the typical tree-like structure assumption, it is interesting to see if it is applicable to graphs with loops. To this end, we consider the 2D four-nearest neighbor grid with periodic boundary condition, which is one common loopy graph. Figure 2 (bottom) shows the results for a 15 × 15 2D grid with uniform constant coupling K₀ = 0.2. The agreement between the theoretical and numerical results is fairly good, indicating that our theoretical result can be a good approximation even for loopy graphs. More results for different values of N and λ are shown in figures 7 and 8 in appendix G.

Subsequently, the asymptotic result and sharpness of the critical scaling value ${c}_{0}\left(\lambda ,{K}_{0}\right)\equiv \frac{c\left(\lambda ,{K}_{0}\right)}{{\lambda }^{2}}$ in (22) are evaluated. First, figure 3 (left) shows comparison of ${c}_{0}\left(\lambda ,{K}_{0}\right)$ between ℓ₁-LinR and ℓ₁-LogR for the RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ when d = 3, K₀ = 0.4, indicating similar behavior of ℓ₁-LogR and ℓ₁-LinR. Then, we conduct experiments for M = c log N with different values of c around ${c}_{0}\left(\lambda ,{K}_{0}\right)$ , and investigate the trend of Precision and Recall as N increases. When λ = 0.3, figure 3 (middle and right) show the results of Precision and Recall, respectively. As expected, the Precision increases consistently with N when $c > {c}_{0}\left(\lambda ,{K}_{0}\right)$ and decreases consistently with N when $c< {c}_{0}\left(\lambda ,{K}_{0}\right)$ while the Recall increases consistently and approaches to 1 as N → ∞, which verifies the sharpness of the critical scaling value prediction. The results for ℓ₁-LogR, including the case of λ = 0.1 for both ℓ₁-LinR and ℓ₁-LogR, are shown in figures 9 and 10 in appendix G.

**Figure 3.** (Left) Critical scaling value ${c}_{0}\left(\lambda ,{K}_{0}\right)\equiv \frac{c\left(\lambda ,{K}_{0}\right)}{{\lambda }^{2}}$ of ℓ₁-LinR and ℓ₁-LogR for the RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with d = 3, K₀ = 0.4. (Middle, right) Precision and recall for RR graph using ℓ₁-LinR with λ = 0.3. Experimental results are shown for N = 200, 400, 800, 1600, 3200. When $c > {c}_{0}\left(\lambda ,{K}_{0}\right)$ ( ${c}_{0}\left(\lambda =0.3,{K}_{0}\right)\approx 19.41$ in this case), the *Precision* increases consistently with N and approaches 1 as N → ∞ while it decreases consistently with N when $c< {c}_{0}\left(\lambda ,{K}_{0}\right)$ . The *Recall* increases consistently and approach to 1 as N → ∞. For more results, please refer to appendix G.
Download figure:
Standard image High-resolution image

**Figure 3.** (Left) Critical scaling value ${c}_{0}\left(\lambda ,{K}_{0}\right)\equiv \frac{c\left(\lambda ,{K}_{0}\right)}{{\lambda }^{2}}$ of ℓ₁-LinR and ℓ₁-LogR for the RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with d = 3, K₀ = 0.4. (Middle, right) Precision and recall for RR graph using ℓ₁-LinR with λ = 0.3. Experimental results are shown for N = 200, 400, 800, 1600, 3200. When $c > {c}_{0}\left(\lambda ,{K}_{0}\right)$ ( ${c}_{0}\left(\lambda =0.3,{K}_{0}\right)\approx 19.41$ in this case), the *Precision* increases consistently with N and approaches 1 as N → ∞ while it decreases consistently with N when $c< {c}_{0}\left(\lambda ,{K}_{0}\right)$ . The *Recall* increases consistently and approach to 1 as N → ∞. For more results, please refer to appendix G.
Download figure:
Standard image High-resolution image

6. Conclusion

In this paper, we provide a unified statistical mechanics framework for the analysis of typical learning performances of ℓ₁-regularizedM-estimators, ℓ₁-LinR in particular, for Ising model selection on typical paramagnetic RR graphs. Using the powerful replica method, the high-dimensional ℓ₁-regularized M-estimator is decoupled into a pair of scalar estimators, by which we obtain an accurate estimate of the typical sample complexity. It is revealed that, perhaps surprisingly, the misspecified ℓ₁-LinR estimator is model selection consistent using $M=\mathcal{O}\left(\mathrm{log}\,N\right)$ samples, which is of the same order as ℓ₁-LogR. Moreover, with a slight modification of the scalar estimator for the active set to account for the finite-size effect, we further obtain sharp predictions of the non-asymptotic behavior of ℓ₁-LinR (also ℓ₁-LogR) for moderate M, N. There is an excellent agreement between theoretical predictions and experimental results, even for graphs with many loops, which supports our findings. Several key assumptions are made in our theoretical analysis, such as the paramagnetic assumption which implies that the coupling strength should not be too large. It is worth noting that the restrictive paramagnetic assumption is not only limited to ℓ₁-LinR, but also to other low-complexity estimators like ℓ₁-LogR unless post-thresholding is used [43]. These assumptions restrict the applicability of the presented result, and thus overcoming such limitations will be an important direction for future work.

Acknowledgments

This work was supported by JSPS KAKENHI Nos. 17H00764, 18K11463, and 19H01812, and JST CREST Grant No. JPMJCR1912, Japan.

Appendix A.: Free energy density f computation

In this appendix, the detailed derivation of the average free energy density $f=-\frac{1}{N\beta }{\left[\mathrm{log}\,Z\right]}_{{\mathcal{D}}^{M}}$ in (9) using the replica method is illustrated. Our method provides a unified framework for the statistical mechanics analysis of any ℓ₁-regularized M-estimator of the form (2). As a result, for generality, in the following derivations, we first focus on a generic ℓ₁-regularized M-estimator (2) with a generic loss function $\ell \left(\cdot \right)$ . After obtaining the generic results, specific results for both the ℓ₁-LinR estimator (4) with square loss $\ell \left(x\right)=\frac{1}{2}{\left(x-1\right)}^{2}$ and the ℓ₁-LogR estimator with logistic loss $\ell \left(x\right)=\mathrm{log}\left(1+{e}^{-2x}\right)$ are provided. For the IS estimator, the results can be easily obtained by substituting $\ell \left(x\right)={e}^{-x}$ , though the specific results are not shown.

A.1. Energy term ξ of f

The key of replica method is to compute the replicated partition function ${\left[{Z}^{n}\right]}_{{\mathcal{D}}^{M}}$ . According to the definition in (10) and ansatz (A1) in section 3.2, the average replicated partition function ${\left[{Z}^{n}\right]}_{{\mathcal{D}}^{M}}$ can be re-written as

$\begin{align}\hfill {\left[{Z}^{n}\right]}_{{\mathcal{D}}^{M}}& =\int \prod\limits _{a=1}^{n}d{\boldsymbol{J}}^{\,a}{e}^{-\beta \lambda M\sum\limits _{a=1}^{n}\sum\limits _{j}\left\vert {J}_{j}^{a}\right\vert }\times {\left\{\sum\limits _{s}{P}_{\text{Ising}}\left(s\vert {\boldsymbol{J}}^{\,\ast }\right)\mathrm{exp}\left[-\beta \sum\limits _{a=1}^{n}\ell \left({s}_{0}{h}^{a}\right)\right]\right\}}^{M},\hfill \\ \hfill & \approx \int \prod\limits _{a=1}^{n}d{w}^{a}{e}^{-\beta \lambda M\left(\sum\limits _{a=1}^{n}{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert +\sum\limits _{a=1}^{n}\frac{1}{\sqrt{N}}{{\Vert}{w}^{a}{\Vert}}_{1}\right)}\hfill \\ \hfill & \quad \times \left\{\sum\limits _{s}{P}_{\text{Ising}}\left(s\vert {\boldsymbol{J}}^{\,\ast }\right)\prod\limits _{a}\int d{h}_{w}^{a}\delta \left({h}_{w}^{a}-\frac{1}{\sqrt{N}}\sum\limits _{j\in \bar{{\Psi}}}{w}_{j}^{a}{s}_{j}\right)\right.{\left.{\ e}^{-\beta \sum\limits _{a=1}^{n}\ell \left({s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+{h}_{w}^{a}\right)\right)}\right\}}^{\alpha N}\hfill \\ \hfill & =\int \prod\limits _{a=1}^{n}d{w}^{a}{e}^{-\beta \lambda M\left(n{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert +\sum\limits _{a=1}^{n}\frac{{{\Vert}{w}^{a}{\Vert}}_{1}}{\sqrt{N}}\right)}\hfill \\ \hfill & \quad \times \left\{\sum\limits _{{s}_{0},{s}_{{\Psi}}}\int \prod\limits _{a=1}^{n}d{h}_{w}^{a}P\left({s}_{0},{s}_{{\Psi}},{\left\{{h}_{w}^{a}\right\}}_{a}\vert {\boldsymbol{J}}^{\,\ast },{\left\{{w}^{a}\right\}}_{a}\right)\right.{\left.{\ e}^{-\beta \sum\limits _{a=1}^{n}\ell \left({s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+{h}_{w}^{a}\right)\right)}\right\}}^{\alpha N}\hfill \\ \hfill & \approx \int \prod\limits _{a=1}^{n}d{w}^{a}{e}^{-\beta \lambda M\left(n{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert +\sum\limits _{a=1}^{n}\frac{{{\Vert}{w}^{a}{\Vert}}_{1}}{\sqrt{N}}\right)}\hfill \\ \hfill & \quad \times \left\{\right.\sum\limits _{{s}_{0},{s}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\int \prod\limits _{a=1}^{n}d{h}_{w}^{a}{P}_{\text{noise}}\left({\left\{{h}_{w}^{a}\right\}}_{a}\vert {\left\{{w}^{a}\right\}}_{a}\right){{\ e}^{-\beta \sum\limits _{a=1}^{n}\ell \left({s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+{h}_{w}^{a}\right)\right)}\left.\right\}}^{\alpha N},\hfill \end{align} \tag{ 29 }$

where $\left\{\frac{1}{\sqrt{N}}{w}_{j}^{a},j\in {\Psi}\right\}$ in the finite active set Ψ are neglected in the second line when N is large, $P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)={\sum }_{{\boldsymbol{s}}_{\bar{{\Psi}}}}{P}_{\text{Ising}}\left(s\vert {\boldsymbol{J}}^{\,\ast }\right)$ is the marginal distribution of s₀, s _Ψ that can be computed as [7], ${P}_{\text{noise}}\left({\left\{{h}_{w}^{a}\right\}}_{a}\vert {\left\{{w}^{a}\right\}}_{a}\right)$ is the distribution of the 'noise' part ${h}_{w}^{a}\equiv \frac{1}{\sqrt{N}}{\sum }_{j\in \bar{{\Psi}}}{w}_{j}^{a}{s}_{j}$ of the local field. In the last line, the asymptotic independence between ${h}_{w}^{a}$ and (s₀, s _Ψ) are applied as discussed in [7]. Regarding the marginal distribution $P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)$ , in general we have to take into account the cavity fields in the marginal distribution. In the case considered in this paper, however, the paramagnetic assumption simplifies the marginal distribution and finally it is proportional to ${e}^{{s}_{0}{\sum }_{j\in {\Psi}}{J}_{j}^{\enspace \ast }{s}_{j}}$ [7]. When Ψ has a small cardinality d, we can compute the expectation w.r.t. (s₀, s_Ψ) exactly by exhaustive enumeration. For large d, MC methods like the Metropolis–Hastings algorithm [44–46] might be used.

To proceed with the calculation, according to the central limit theorem (CLT), the noise part ${\left\{{h}_{w}^{a}\right\}}_{a=1}^{n}$ can be regarded as Gaussian variables so that ${P}_{\text{noise}}\left({\left\{{h}_{w}^{a}\right\}}_{a}\vert {\left\{{w}^{a}\right\}}_{a}\right)$ can be approximated as a multivariate Gaussian distribution. Under the RS ansatz, two auxiliary order parameters are introduced, i.e.

$\begin{equation}Q\equiv \frac{1}{N}\sum\limits _{i,j\in \bar{{\Psi}}}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{a},\end{equation} \tag{ 30 }$

$\begin{equation}q\equiv \frac{1}{N}\sum\limits _{i,j\in \bar{{\Psi}}}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{b},\quad \left(a\ne b\right),\end{equation} \tag{ 31 }$

where ${C}^{{\backslash}0}=\left\{{C}_{ij}^{{\backslash}0}\right\}$ is the covariance matrix of the original Ising model without s₀. To write the integration in terms of the order parameters Q, q, we introduce the following trivial identities

$\begin{equation}1=N\int dQ\delta \left(\sum\limits _{i,j\ne 0}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{a}-NQ\right),\quad a=1,\dots ,n\end{equation} \tag{ 32 }$

$\begin{equation}1=N\int dq\delta \left(\sum\limits _{i,j\ne 0}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{b}-Nq\right),\quad a< b,\end{equation} \tag{ 33 }$

so that ${\left[{Z}^{n}\right]}_{{\mathcal{D}}^{M}}$ in (29) can be rewritten as

$\begin{align}\hfill {\left[{Z}^{n}\right]}_{{\mathcal{D}}^{M}}& ={e}^{-\beta \lambda Mn{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert }\int dQdq\int \prod\limits _{a=1}^{n}d{w}^{a}{e}^{-\lambda \beta \frac{M}{\sqrt{N}}\sum\limits _{a=1}^{n}{{\Vert}{w}^{a}{\Vert}}_{1}}\hfill \\ \hfill & \quad \times \prod\limits _{a=1}^{n}\delta \left(\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{a}-NQ\right)\prod\limits _{a< b}\delta \left(\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{b}-Nq\right)\hfill \\ \hfill & \quad \times \left\{\sum\limits _{{s}_{0},{s}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\int \prod\limits _{a=1}^{n}d{h}_{w}^{a}{P}_{\text{noise}}\left({\left\{{h}_{w}^{a}\right\}}_{a}\vert {\left\{{w}^{a}\right\}}_{a}\right)\right.{\left.{\ e}^{-\beta \sum\limits _{a=1}^{n}\ell \left({s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+{h}_{w}^{a}\right)\right)}\right\}}^{\alpha N}\hfill \end{align} \tag{ 34 }$

$\begin{equation} =\int dQdqI{e}^{M\,\mathrm{log}\,L},\end{equation} \tag{ 35 }$

where

$\begin{align}\hfill I& \equiv \int \prod\limits _{a=1}^{n}d{w}^{a}{e}^{-\lambda \beta \frac{M}{\sqrt{N}}\sum\limits _{a=1}^{n}{{\Vert}{w}^{a}{\Vert}}_{1}}\prod\limits _{a=1}^{n}\delta \left(\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{a}-NQ\right)\prod\limits _{a< b}\delta \left(\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}^{{\backslash}0}{w}_{j}^{b}-Nq\right),\hfill \end{align} \tag{ 36 }$

$\begin{align}\hfill L& \equiv {e}^{-\beta \lambda n{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert }\sum\limits _{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\int \prod\limits _{a=1}^{n}d{h}_{w}^{a}{P}_{\text{noise}}\left({\left\{{h}_{w}^{a}\right\}}_{a}\vert {\left\{{w}^{a}\right\}}_{a}\right){e}^{-\beta \sum\limits _{a=1}^{n}\ell ({s}_{0}({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+{h}_{w}^{a}))}.\hfill \end{align} \tag{ 37 }$

According to CLT and (30) and (31), the noise parts ${h}_{w}^{a},a=1,\dots ,n$ follow a multivariate Gaussian distribution with zero mean (paramagnetic assumption) and covariances

$\begin{equation}{\left\langle {h}_{w}^{a}{h}_{w}^{b}\right\rangle }^{{\backslash}0}=Q{\delta }_{ab}+\left(1-{\delta }_{ab}\right)q.\end{equation} \tag{ 38 }$

Consequently, by introducing two auxiliary i.i.d. standard Gaussian random variables ${v}_{a}\sim \mathcal{N}\left(0,1\right),z\sim \mathcal{N}\left(0,1\right)$ , the noise parts ${h}_{w}^{a},a=1,\dots ,n$ can be written in a compact form

$\begin{equation}{h}_{w}^{a}=\sqrt{Q-q}{v}_{a}+\sqrt{q}z,\quad a=1,\dots ,n\end{equation} \tag{ 39 }$

so that L in (37) could be written as

$\begin{align}\hfill L& ={e}^{-\beta \lambda n{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert }\sum\limits _{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\int \prod\limits _{a=1}^{n}d{h}_{w}^{a}{P}_{\text{noise}}\left({\left\{{h}_{w}^{a}\right\}}_{a}\vert {\left\{{w}^{a}\right\}}_{a}\right){e}^{-\beta \sum\limits _{a=1}^{n}\ell ({s}_{0}({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+{h}_{w}^{a}))}\hfill \\ \hfill & ={e}^{-\beta \lambda n{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert }\sum\limits _{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\int \mathcal{D}z\prod\limits _{a}\mathcal{D}{v}_{a}\times {e}^{-\beta \sum\limits _{a=1}^{n}\ell \left({s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+\sqrt{Q-q}{v}_{a}+\sqrt{q}z\right)\right)}\hfill \\ \hfill & ={e}^{-\beta \lambda n{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert }\sum\limits _{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\int \mathcal{D}z\times {\left[\underset{A}{\underbrace{\int \mathcal{D}v{e}^{-\beta \ell \left({s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+\sqrt{Q-q}v+\sqrt{q}z\right)\right)}}}\right]}^{n}\hfill \\ \hfill & ={e}^{-\beta \lambda n{\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert }\sum\limits _{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right){\mathbb{E}}_{z}\left({A}^{n}\right),\hfill \end{align} \tag{ 40 }$

where $\mathcal{D}z=\frac{dz}{\sqrt{2\pi }}{e}^{-\frac{{z}^{2}}{2}}$ . As a result, using the replica formula, we have

$\begin{align}\hfill \underset{n\to 0}{\mathrm{lim}}\,\frac{1}{n}\,\mathrm{log}\,L& =-\beta \lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert +\underset{n\to 0}{\mathrm{lim}}\,\frac{\mathrm{log}{\sum }_{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right){E}_{z}\left({A}^{n}\right)}{n}\hfill \\ \hfill & =-\beta \lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert +{\mathbb{E}}_{z}\left[\sum\limits _{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\mathrm{log}\,A\right]\hfill \\ \hfill & =-\beta \lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert +\sum\limits _{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\int \mathcal{D}z\,\mathrm{log}\int \mathcal{D}v{e}^{-\beta \ell \left({s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+\sqrt{Q-q}v+\sqrt{q}z\right)\right)}\hfill \\ \hfill & =-\beta \lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert +\sum\limits _{{s}_{0},{\boldsymbol{s}}_{{\Psi}}}P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\,\ast }\right)\int \mathcal{D}z\,\mathrm{log}\hfill \\ \hfill & \quad \times \int \frac{dy}{\sqrt{2\pi \left(Q-q\right)}}{e}^{-\frac{{\left[y-{s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+\sqrt{q}z\right)\right]}^{2}}{2\left(Q-q\right)}}{e}^{-\beta \ell \left(y\right)},\hfill \end{align} \tag{ 41 }$

where in the last line, a change of variable $y={s}_{0}\left({\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}+\sqrt{Q-q}v+\sqrt{q}z\right)$ is used.

As a result, from (9), the average free energy density in the limit β → ∞ reads

$\begin{align}\hfill f\left(\beta \to \infty \right)& =\underset{\beta \to \infty }{\mathrm{lim}}\,-\frac{1}{N\beta }\left\{\underset{n\to 0}{\mathrm{lim}}\,\frac{\partial }{\partial n}\mathrm{log}\,I+M\,\underset{n\to 0}{\mathrm{lim}}\,\frac{\partial }{\partial n}\mathrm{log}\,L\right\}\hfill \\ \hfill & =-\mathtt{E}\mathtt{x}\mathtt{t}\mathtt{r}\left\{-\mathcal{\xi }+S\right\},\hfill \end{align} \tag{ 42 }$

where $\mathrm{E}\mathrm{x}\mathrm{t}\mathrm{r}\left\{\cdot \right\}$ denotes extremization w.r.t. some relevant variables, and ξ, S are the corresponding energy and entropy terms of f, respectively:

$\begin{equation}S=\underset{n\to 0}{\mathrm{lim}}\,\frac{1}{N\beta }\frac{\partial }{\partial n}\,\mathrm{log}\,I,\end{equation} \tag{ 43 }$

$\begin{align}\hfill I& =\int \prod\limits _{a=1}^{n}d{w}^{a}{e}^{-\lambda \beta \sum\limits _{a=1}^{n}{{\Vert}{w}^{a}{\Vert}}_{1}}\prod\limits _{a=1}^{n}\delta \left(\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}{w}_{j}^{a}-NQ\right)\prod\limits _{a< b}\delta \hfill \\ \hfill & \quad \times \left(\sum\limits _{i,j}{w}_{i}^{a}{C}_{ij}{w}_{j}^{b}-Nq\right),\hfill \end{align} \tag{ 44 }$

$\begin{equation}\mathcal{\xi }=\alpha \lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert +\alpha {\mathbb{E}}_{s,z}\left(\underset{y}{\mathrm{min}}\left[\frac{{\left(y-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }+\ell \left(y\right)\right]\right),\end{equation} \tag{ 45 }$

and the relation ${\mathrm{lim}}_{\beta \to \infty }\,\beta \left(Q-q\right)\equiv \chi$ is used [6, 7]. The extremization in the free energy result (42) comes from saddle point method in the large N limit.

A.2. Entropy term S of f

To obtain the final result of free energy density, there is still one remaining entropy term S to compute, which requires the result of I (44). However, unlike the ℓ₂-norm, the ℓ₁-norm in (44) breaks the rotational invariance property, which makes the computation of I difficult and the methods in [7, 29] are no longer applicable. To address this problem, applying the Haar orthogonal ansatz (A2) in section 3.2, we employ a method to replace I with an average ${\left[I\right]}_{O}$ over the orthogonal matrix O generated from the Haar orthogonal measure.

Specifically, also under the RS ansatz, two auxiliary order parameters are introduced, i.e.

$\begin{equation}R\equiv \frac{1}{N}\sum\limits _{j}{w}_{j}^{a}{w}_{j}^{a},\end{equation} \tag{ 46 }$

$\begin{equation}r\equiv \frac{1}{N}\sum\limits _{j}{w}_{j}^{a}{w}_{j}^{b},\quad \left(a\ne b\right).\end{equation} \tag{ 47 }$

Then, by inserting the delta functions ${\prod }_{a}\delta \left({\left({w}^{a}\right)}^{T}{w}^{a}-NR\right){\prod }_{a< b}\delta \left({\left({w}^{a}\right)}^{T}{w}^{b}-Nr\right)$ , we obtain

$\begin{align}\hfill I& =\int \prod\limits _{a=1}^{n}d{w}^{a}{e}^{-\frac{\lambda \beta M}{\sqrt{N}}\sum\limits _{a=1}^{n}{{\Vert}{w}^{a}{\Vert}}_{1}}\prod\limits _{a=1}^{n}\delta \left({\left({w}^{a}\right)}^{T}C{w}^{a}-NQ\right)\prod\limits _{a< b}\delta \left({\left({w}^{a}\right)}^{T}C{w}^{b}-Nq\right)\hfill \\ \hfill & \quad \times \int dRdr\prod\limits _{a}\delta \left({\left({w}^{a}\right)}^{T}{w}^{a}-NR\right)\prod\limits _{a< b}\delta \left({\left({w}^{a}\right)}^{T}{w}^{b}-Nr\right).\hfill \end{align} \tag{ 48 }$

Moreover, replacing the original delta functions in (48) as the following identities

$\begin{equation*}\begin{cases}\delta \left({\left({w}^{a}\right)}^{T}C{w}^{a}-NQ\right)=\int d\hat{Q}{e}^{-\frac{\hat{Q}}{2}\left({\left({w}^{a}\right)}^{T}C{w}^{a}-NQ\right)},\quad \hfill \\ \delta \left({\left({w}^{a}\right)}^{T}C{w}^{b}-Nq\right)=\int d\hat{q}{e}^{\hat{q}\left({\left({w}^{a}\right)}^{T}C{w}^{b}-Nq\right)},\quad \hfill \end{cases}\end{equation*}$

and taking average over the orthogonal matrix O, after some algebra, the I is replaced with the following average ${\left[I\right]}_{O}$

$\begin{align}\hfill {\left[I\right]}_{O}& =\int dRdrd\hat{Q}d\hat{q}\prod\limits _{a=1}^{n}d{w}^{a}{e}^{-\frac{\lambda \beta M}{\sqrt{N}}\sum\limits _{a=1}^{n}{{\Vert}{w}^{a}{\Vert}}_{1}}\prod\limits _{a}\delta \left({\left({w}^{a}\right)}^{T}{w}^{a}-NR\right)\hfill \\ \hfill & \quad \times \prod\limits _{a< b}\delta \left({\left({w}^{a}\right)}^{T}{w}^{b}-Nr\right)\mathrm{exp}\left\{\frac{Nn}{2}\hat{Q}Q-\frac{Nn}{2}\left(n-1\right)\hat{q}q\right\}\times {\left[{e}^{\frac{1}{2}\,\mathtt{Tr}\left(C{L}_{n}\right)}\right]}_{O},\hfill \end{align} \tag{ 49 }$

$\begin{equation}{L}_{n}=-\left(\hat{Q}+\hat{q}\right)\sum\limits _{a=1}^{n}{w}^{a}{\left({w}^{a}\right)}^{T}+\hat{q}\left(\sum\limits _{a=1}^{n}{w}^{a}\right){\left(\sum\limits _{b=1}^{n}{w}^{b}\right)}^{T}.\end{equation} \tag{ 50 }$

To proceed with the computation, the eigendecompostion of the matrix L_n is performed. After some algebra, for the configuration of w^a that satisfies both ${\left({w}^{a}\right)}^{T}{w}^{a}=NR$ and ${\left({w}^{a}\right)}^{T}{w}^{b}=Nr$ , the eigenvalues and associated eigenvectors of matrix L_n can be calculated as follows

$\begin{equation}\begin{cases}{\lambda }_{1}=-N\left(\hat{Q}+\hat{q}-n\hat{q}\right)\left(R-r+nr\right),\quad \hfill \\ {u}_{1}=\sum\limits _{a=1}^{n}{w}^{a},\quad \hfill \\ {\lambda }_{2}=-N\left(\hat{Q}+\hat{q}\right)\left(R-r\right),\quad \hfill \\ {u}_{a}={w}^{a}-\frac{1}{n}\sum\limits _{b=1}^{n}{w}^{b},\quad a=2,\dots ,n,\quad \hfill \end{cases}\end{equation} \tag{ 51 }$

where λ₁ is the eigenvalue corresponding to the eigenvector u₁ while λ₂ is the degenerate eigenvalue corresponding to eigenvectors u_a, a = 2, ..., n. To compute ${\left[{e}^{\frac{1}{2}\,\mathtt{Tr}\left(C{L}_{n}\right)}\right]}_{O}$ , we define a function $G\left(x\right)$ as

$\begin{align}\hfill G\left(x\right)& \equiv \frac{1}{N}\,\mathrm{log}{\left[\mathrm{exp}\left(\frac{x}{2}\mathtt{Tr}\,C\left(\mathbf{1}{\mathbf{1}}^{T}\right)\right)\right]}_{O}\hfill \\ \hfill & =\underset{{\Lambda}}{\mathtt{E}\mathtt{x}\mathtt{t}\mathtt{r}}\left\{-\frac{1}{2}\int \mathrm{log}\left({\Lambda}-\gamma \right)\rho \left(\gamma \right)d\gamma +\frac{{\Lambda}}{2}x\right\}-\frac{1}{2}\,\mathrm{log}\,x-\frac{1}{2},\hfill \end{align} \tag{ 52 }$

and $\rho \left(\gamma \right)$ is the eigenvalue distribution (EVD) of C. Then, combined with (51), after some algebra, we obtain that

$\begin{align}\hfill \frac{1}{N}\,\mathrm{log}{\left[{e}^{\frac{1}{2}\,\mathtt{Tr}\left(C{L}_{n}\right)}\right]}_{O}& =G\left(-\left(\hat{Q}+\hat{q}-n\hat{q}\right)\left(R-r+nr\right)\right)\hfill \\ \hfill & \quad +\left(n-1\right)G\left(-\left(\hat{Q}+\hat{q}\right)\left(R-r\right)\right).\hfill \end{align} \tag{ 53 }$

Furthermore, replacing the original delta functions in (48) as

$\begin{equation*}\begin{cases}\delta \left({\left({w}^{a}\right)}^{T}{w}^{a}-NR\right)=\int d\hat{R}{e}^{-\frac{\hat{R}}{2}\left({\left({w}^{a}\right)}^{T}{w}^{a}-NR\right)},\quad \hfill \\ \delta \left({\left({w}^{a}\right)}^{T}{w}^{b}-Nr\right)=\int d\hat{r}{e}^{\hat{r}\left({\left({w}^{a}\right)}^{T}{w}^{b}-Nr\right)},\quad \hfill \end{cases}\end{equation*}$

we obtain

$\begin{align}\hfill {\left[I\right]}_{0}& =\int dRdrd\hat{Q}d\hat{q}d\hat{R}d\hat{r}\prod\limits _{a=1}^{n}d{w}^{a}\mathrm{exp}\left\{-\sum\limits _{a=1}^{n}\frac{\lambda \beta M}{\sqrt{N}}{{\Vert}{w}^{a}{\Vert}}_{1}-\frac{\hat{R}+\hat{r}}{2}\right.\hfill \\ \hfill & \quad \times \left.\sum\limits _{a=1}^{n}{\left({w}^{a}\right)}^{T}{w}^{a}+\frac{\hat{r}}{2}\sum\limits _{a,b}{\left({w}^{a}\right)}^{T}{w}^{b}\right\}\mathrm{exp}\left\{\frac{Nn}{2}\hat{R}R-\frac{Nn}{2}\left(n-1\right)\hat{r}r\right.\hfill \\ \hfill & \quad \left.\times +\frac{Nn}{2}\hat{Q}Q-\frac{Nn}{2}\left(n-1\right)\hat{q}q\right\}\times {\left[{e}^{\frac{1}{2}\,\mathtt{Tr}\left(C{L}_{n}\right)}\right]}_{O}.\hfill \end{align} \tag{ 54 }$

In addition, using a Gaussian integral, the following result can be linearized as

$\begin{align*}\hfill & \int \prod\limits _{a=1}^{n}d{w}^{a}\,\mathrm{exp}\left\{-\sum\limits _{a=1}^{n}\frac{\lambda \beta M}{\sqrt{N}}{{\Vert}{w}^{a}{\Vert}}_{1}-\frac{\hat{R}+\hat{r}}{2}\sum\limits _{a=1}^{n}{\left({w}^{a}\right)}^{T}{w}^{a}+\frac{\hat{r}}{2}\sum\limits _{a,b}{\left({w}^{a}\right)}^{T}{w}^{b}\right\}\hfill \\ \hfill & \qquad =\int \prod\limits _{a=1}^{n}d{w}^{a}\,\mathrm{exp}\left\{-\sum\limits _{a=1}^{n}\sum\limits _{i=1}^{N}\frac{\lambda \beta M}{\sqrt{N}}\left\vert {w}_{i}^{a}\right\vert -\frac{\hat{R}+\hat{r}}{2}\sum\limits _{a=1}^{n}\sum\limits _{i=1}^{N}{\left({w}_{i}^{a}\right)}^{2}+\frac{\hat{r}}{2}\sum\limits _{i=1}^{N}{\left(\sum\limits _{a=1}^{n}{w}_{i}^{a}\right)}^{2}\right\}\hfill \\ \hfill & \qquad =\prod\limits _{i}\int \mathcal{D}{z}_{i}\int \prod\limits _{a=1}^{n}d{w}^{a}\,\mathrm{exp}\left\{-\sum\limits _{a=1}^{n}\frac{\lambda \beta M}{\sqrt{N}}\left\vert {w}_{i}^{a}\right\vert -\frac{\hat{R}+\hat{r}}{2}\sum\limits _{a=1}^{n}{\left({w}_{i}^{a}\right)}^{2}+\sqrt{\hat{r}}{z}_{i}\sum\limits _{a}{w}_{i}^{a}\right\}\hfill \\ \hfill & \qquad =\prod\limits _{i}\int \mathcal{D}{z}_{i}{\left\{\int dw\,\mathrm{exp}\left[-\frac{\hat{R}+\hat{r}}{2}{w}_{i}^{2}+\left(\sqrt{\hat{r}}z-\frac{\lambda \beta M}{\sqrt{N}}\mathrm{sign}\left({w}_{i}\right)\right){w}_{i}\right]\right\}}^{n},\hfill \end{align*}$

where $\mathcal{D}{z}_{i}=\frac{d{z}_{i}}{\sqrt{2\pi }}{e}^{-\frac{{{z}_{i}}^{2}}{2}}$ . Consequently, the entropy term S of the free energy density f is computed as

$\begin{align*}\hfill & \underset{n\to 0}{\mathrm{lim}}\,\frac{1}{N}\frac{\partial }{\partial n}\mathrm{log}{\left[I\right]}_{O}=\left(\hat{q}\left(R-r\right)-\left(\hat{Q}+\hat{q}\right)r\right){G}^{\prime }\left(-\left(\hat{Q}+\hat{q}\right)\left(R-r\right)\right)\hfill \\ \hfill & \qquad \qquad +G\left(-\left(\hat{Q}+\hat{q}\right)\left(R-r\right)\right)+\frac{\hat{R}R}{2}+\frac{\hat{r}r}{2}+\frac{\hat{Q}Q}{2}+\frac{\hat{q}q}{2}\hfill \\ \hfill & \quad \qquad +\int Dz\,\mathrm{log}\int dw\,\mathrm{exp}\left[-\frac{\hat{R}+\hat{r}}{2}{w}^{2}+\left(\sqrt{\hat{r}}z-\frac{\lambda \beta M}{\sqrt{N}}\mathrm{sign}\left(w\right)\right)w\right].\hfill \end{align*}$

For β → ∞, according to the characteristic of the Boltzmann distribution, the following scaling relations are assumed to hold, i.e.

$\begin{equation}\begin{cases}\hat{Q}+\hat{q}\quad \hfill & \equiv \beta E\hfill \\ \hat{q}\quad \hfill & \equiv {\beta }^{2}F\hfill \\ \hat{R}+\hat{r}\quad \hfill & \equiv \beta K\hfill \\ \hat{r}\quad \hfill & \equiv {\beta }^{2}H\hfill \\ \beta \left(Q-q\right)\quad \hfill & \equiv \chi \hfill \\ \beta \left(R-r\right)\quad \hfill & \equiv \eta .\hfill \end{cases}\end{equation} \tag{ 55 }$

Finally, the entropy term is computed as

$\begin{equation}S=\left(-ER+F\eta \right){G}^{\prime }\left(-E\eta \right)+\frac{1}{2}EQ-\frac{1}{2}F\chi +\frac{1}{2}KR-\frac{1}{2}H\eta -\end{equation} \tag{ 56 }$

$\begin{equation}\int \underset{w}{\mathrm{min}}\left\{\frac{K}{2}{w}^{2}-\left(\sqrt{H}z-\frac{\lambda M}{\sqrt{N}}\mathrm{sign}\left(w\right)\right)w\right\}Dz.\end{equation} \tag{ 57 }$

A.3. Free energy density result

Combining the results (45) and (57) together, the free energy density for general loss function $\ell \left(\cdot \right)$ in the limit β → ∞ is obtained as

$\begin{equation}f\left(\beta \to \infty \right)=-\underset{\mathtt{\Theta }}{\mathrm{E}\mathrm{x}\mathrm{t}\mathrm{r}}\left\{\begin{array}{c}\hfill -\alpha {\mathbb{E}}_{s,z}\left(\underset{y}{\mathrm{min}}\left[\frac{{\left(y-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }+\ell \left(y\right)\right]\right)-\alpha \lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert \hfill \\ \hfill +\left(-ER+F\eta \right){G}^{\prime }\left(-E\eta \right)+\frac{1}{2}EQ-\frac{1}{2}F\chi \hfill \\ \hfill +\frac{1}{2}KR-\frac{1}{2}H\eta -{\mathbb{E}}_{z}\left(\underset{w}{\mathrm{min}}\left\{\frac{K}{2}{w}^{2}-\left(\sqrt{H}z-\frac{\lambda M}{\sqrt{N}}\mathrm{sign}\left(w\right)\right)w\right\}\right)\hfill \end{array}\right\},\end{equation} \tag{ 58 }$

where the values of the parameters $\Theta =\left\{\chi ,Q,E,R,F,\eta ,K,H,{\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}\right\}$ can be calculated by the extremization condition, i.e. solving the equations of state (EOS). For general loss function $\ell \left(y\right)$ , the EOS for (58) is as follows

$\begin{equation}\begin{cases}\hat{y}\left(s,z,\chi ,Q,J\right)=\underset{y}{\mathrm{arg}\,\mathrm{max}}\left\{-\frac{{\left(y-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }-\ell \left(y\right)\right\}\quad \hfill \\ E=\frac{\alpha }{\sqrt{Q}}{\mathbb{E}}_{s,z}\left({s}_{0}z\frac{d\ell \left(y\right)}{dy}{\vert }_{y=\hat{y}\left(s,z,\chi ,Q,J\right)}\right)\quad \hfill \\ F=\alpha {\mathbb{E}}_{s,z}\left({\left(\frac{d\ell \left(y\right)}{dy}{\vert }_{y=\hat{y}\left(s,z,\chi ,Q,J\right)}\right)}^{2}\right)\quad \hfill \\ R=\frac{1}{{K}^{2}}\left[\left(H+\frac{{\lambda }^{2}{M}^{2}}{N}\right)\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi }}{e}^{-\frac{{\lambda }^{2}{M}^{2}}{2HN}}\right]\quad \hfill \\ E\eta =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma \quad \hfill \\ Q=\frac{F}{{E}^{2}}+R\tilde{\Lambda }-\left(-ER+F\eta \right)\eta \frac{1}{\int \frac{\rho \left(\lambda \right)}{{\left(\tilde{\Lambda }-\lambda \right)}^{2}}d\lambda }\quad \hfill \\ K=E\tilde{\Lambda }+\frac{1}{\eta }\quad \hfill \\ \chi =\frac{1}{E}+\eta \tilde{\Lambda }\quad \hfill \\ H=\frac{R}{{\eta }^{2}}+F\tilde{\Lambda }+\left(-ER+F\eta \right)E\frac{1}{\int \frac{\rho \left(\lambda \right)}{{\left(\tilde{\Lambda }-\lambda \right)}^{2}}d\lambda }\quad \hfill \\ \eta =\frac{1}{K}\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)\quad \hfill \\ {\bar{J}}_{j,j\in {\Psi}}=\underset{{J}_{j,j\in {\Psi}}}{\mathrm{arg}\,\mathrm{min}}\enspace \,\left\{{\mathbb{E}}_{s,z}\left(\left[\frac{{\left(\hat{y}\left(s,z,\chi ,Q,J\right)-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{J}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }+\ell \left(\hat{y}\left(s,z,\chi ,Q,J\right)\right)\right]\right)+\lambda {\sum }_{j\in {\Psi}}\left\vert {J}_{j}\right\vert \right\}\quad \hfill \end{cases}\end{equation} \tag{ 59 }$

where $\tilde{\Lambda }$ satisfying $E\eta =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma$ is determined by the extremization condition in (52) combined with the free energy result (58). In general, there are no analytic solutions for the EOS (59) but it can be solved numerically.

A.3.1. Quadratic loss $\ell \left(y\right)={\left(y-1\right)}^{2}/2$

In the case of square lass $\ell \left(y\right)={\left(y-1\right)}^{2}/2$ for the ℓ₁-LinR estimator, there is an analytic solution to y in $\underset{y}{\mathrm{min}}\left[\frac{{\left(y-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }+\ell \left(y\right)\right]$ and thus the results can be further simplified. Specifically, the free energy can be written as follows

$\begin{equation}f\left(\beta \to \infty \right)=-\underset{\mathtt{\Theta }}{\mathrm{E}\mathrm{x}\mathrm{t}\mathrm{r}}\left\{\begin{array}{c}\hfill -\frac{\alpha }{2\left(1+\chi \right)}{\mathbb{E}}_{s,z}\left[{\left({s}_{0}-{\sum }_{j\in {\Psi}}{s}_{j}{\bar{J}}_{j}-\sqrt{Q}z\right)}^{2}\right]-\alpha \lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert \hfill \\ \hfill +\left(-ER+F\eta \right){G}^{\prime }\left(-E\eta \right)+\frac{1}{2}EQ-\frac{1}{2}F\chi \hfill \\ \hfill +\frac{1}{2}KR-\frac{1}{2}H\eta -{\mathbb{E}}_{z}\left[\underset{w}{\mathrm{min}}\left\{\frac{K}{2}{w}^{2}-\left(\sqrt{H}z-\frac{\lambda M}{\sqrt{N}}\mathrm{sign}\left(w\right)\right)w\right\}\right]\hfill \end{array}\right\},\end{equation} \tag{ 60 }$

and the corresponding EOS can be written as

$\begin{equation}\begin{cases}E=\frac{\alpha }{\left(1+\chi \right)},\quad \hfill & \left(\mathrm{a}\right)\hfill \\ F=\frac{\alpha }{{\left(1+\chi \right)}^{2}}\left[{\mathbb{E}}_{s}{\left({s}_{i}-{\sum }_{j\in {\Psi}}{s}_{j}{\bar{J}}_{j}\right)}^{2}+Q\right],\quad \hfill & \left(\mathrm{b}\right)\hfill \\ R=\frac{1}{{K}^{2}}\left[\left(H+\frac{{\lambda }^{2}{M}^{2}}{N}\right)\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi }}{e}^{-\frac{{\lambda }^{2}{M}^{2}}{2HN}}\right],\quad \hfill & \left(\mathrm{c}\right)\hfill \\ E\eta =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma ,\quad \hfill & \left(\mathrm{d}\right)\hfill \\ Q=\frac{F}{{E}^{2}}+R\tilde{\Lambda }-\left(-ER+F\eta \right)\frac{\eta }{\int \frac{\rho \left(\gamma \right)}{{\left(\tilde{\Lambda }-\gamma \right)}^{2}}d\gamma },\quad \hfill & \left(\mathrm{e}\right)\hfill \\ K=E\tilde{\Lambda }+\frac{1}{\eta },\quad \hfill & \left(\mathrm{f}\right)\hfill \\ \chi =\frac{1}{E}+\eta \tilde{\Lambda },\quad \hfill & \left(\mathrm{g}\right)\hfill \\ H=\frac{R}{{\eta }^{2}}+F\tilde{\Lambda }+\left(-ER+F\eta \right)\frac{E}{\int \frac{\rho \left(\gamma \right)}{{\left(\tilde{\Lambda }-\gamma \right)}^{2}}d\gamma },\quad \hfill & \left(\mathrm{h}\right)\hfill \\ \eta =\frac{1}{K}\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right),\quad \hfill & \left(\mathrm{i}\right)\hfill \\ {\bar{J}}_{j}=\frac{\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left(\mathrm{tanh}\left({K}_{0}\right),\lambda \left(1+\chi \right)\right)}{1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)},\quad j\in {\Psi},\quad \hfill & \left(\mathrm{j}\right)\hfill \end{cases}\end{equation} \tag{ 61 }$

Note that the mean estimates $\left\{{\bar{J}}_{j},j\in {\Psi}\right\}$ in (61) is obtained by solving the following reduced optimization problem

$\begin{equation}\underset{\left\{{\bar{J}}_{j}\right\}}{\mathrm{arg}\,\mathrm{min}}\left\{\frac{1}{2\left(1+\chi \right)}{\mathbb{E}}_{s,z}\left[{\left({s}_{0}-{\sum }_{j\in {\Psi}}{s}_{j}{\bar{J}}_{j}-\sqrt{Q}z\right)}^{2}\right]-\lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert \right\},\end{equation} \tag{ 62 }$

where the corresponding fixed-point equation associated with any ${\bar{J}}_{k},k\in {\Psi}$ can be written as follows

$\begin{equation}\frac{1}{1+\chi }{\mathbb{E}}_{s}\left[{s}_{k}\left({s}_{0}-{\sum }_{j\in {\Psi}}{s}_{j}{\bar{J}}_{j}\right)\right]-\lambda \,\mathtt{sign}\left({\bar{J}}_{k}\right)=0,\quad \forall \enspace k\in {\Psi},\end{equation} \tag{ 63 }$

where the $\mathtt{sign}\enspace (\cdot )$ denotes an element-wise application of the standard sign function. For a RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with degree d and coupling strength K₀, without loss of generality, assuming that all the active couplings are positive, we have ${\mathbb{E}}_{s}\left({s}_{0}{s}_{k}\right)=\mathrm{tanh}\left({K}_{0}\right),\forall \enspace k\in {\Psi}$ , and ${\mathbb{E}}_{s}\left({s}_{k}{s}_{j}\right)={\mathrm{tanh}}^{2}\left({K}_{0}\right),\,\forall \enspace k,j\in {\Psi},k\ne j$ . Given these results and thanks to the symmetry, we obtain

$\begin{equation}{\bar{J}}_{j}=\frac{\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left(\mathrm{tanh}\left({K}_{0}\right),\lambda \left(1+\chi \right)\right)}{1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)},\quad j\in {\Psi},\end{equation} \tag{ 64 }$

where $\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left(z,\tau \right)=\mathtt{sign}\left(z\right){\left(\left\vert z\right\vert -\tau \right)}_{+}$ is the soft-thresholding function, i.e.

$\begin{equation}\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left(z,\tau \right)\equiv \mathtt{sign}\left(z\right){\left(\left\vert z\right\vert -\tau \right)}_{+}\equiv \begin{cases}z-\tau ,\quad \hfill & z > \tau \hfill \\ 0,\quad \hfill & \left\vert z\right\vert \leqslant \tau \hfill \\ z+\tau ,\quad \hfill & z< -\tau .\hfill \end{cases}\end{equation} \tag{ 65 }$

On the other hand, in the inactive set $\bar{{\Psi}}$ , each component of the scaled noise estimates can be statistically described as the solution to the scalar estimator $\underset{w}{\mathrm{min}}\left\{\frac{K}{2}{w}^{2}-\left(\sqrt{H}z-\frac{\lambda M}{\sqrt{N}}\,\mathrm{sign}\left(w\right)\right)w\right\}$ in (58). Consequently, recalling the definition of w in (11), the estimates $\left\{{\hat{J}}_{j},j\in \bar{{\Psi}}\right\}$ in the inactive set $\bar{{\Psi}}$ are

$\begin{align}\hfill {\hat{J}}_{j}& =\frac{\sqrt{H}}{K\sqrt{N}}\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left({z}_{j},\frac{\lambda M}{\sqrt{HN}}\right)\hfill \\ \hfill & =\underset{{J}_{j}}{\mathrm{arg}\,\mathrm{min}}\left[\frac{1}{2}{\left({J}_{j}-\frac{1}{K}\sqrt{\frac{H}{N}}{z}_{j}\right)}^{2}+\frac{\lambda M}{KN}\left\vert {J}_{j}\right\vert \right],\quad j\in \bar{{\Psi}},\hfill \end{align} \tag{ 66 }$

which ${z}_{j}\sim \mathcal{N}\left(0,1\right),j\in \bar{{\Psi}}$ are i.i.d. random Gaussian noise.

Consequently, it can be seen that from (64) and (66), statistically, the ℓ₁-LinR estimator is decoupled into two scalar thresholding estimators for the active set Ψ and inactive set $\bar{{\Psi}}$ , respectively.

A.3.2. Logistic loss $\ell \left(y\right)=\mathrm{log}\left(1+{e}^{-2y}\right)$

In the case of logistic lass $\ell \left(y\right)=\mathrm{log}\left(1+{e}^{-2y}\right)$ for the ℓ₁-LogR estimator, however, there is no analytic solution to y in $\underset{y}{\mathrm{min}}\left[\frac{{\left(y-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }+\ell \left(y\right)\right]$ and we have to solve it together iteratively with other parameters Θ. After some algebra, we obtain the EOS for the ℓ₁-LogR estimator:

$\begin{equation}\begin{cases}\frac{\hat{y}\left(s,z,\chi ,Q,J\right)-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)}{\chi }=1-\mathrm{tanh}\left(\hat{y}\left(s,z,\chi ,Q,J\right)\right),\quad \hfill \\ E=\alpha {\mathbb{E}}_{s,z}\left(\frac{{s}_{0}z}{\sqrt{Q}}\mathrm{tanh}\left(\hat{y}\left(S,z,\chi ,Q,J\right)\right)\right),\quad \hfill \\ F=\alpha {\mathbb{E}}_{s,z}\left({\left(1-\mathrm{tanh}\left(\hat{y}\left(S,z,\chi ,Q,J\right)\right)\right)}^{2}\right),\quad \hfill \\ R=\frac{1}{{K}^{2}}\left[\left(H+\frac{{\lambda }^{2}{M}^{2}}{N}\right)\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi }}{e}^{-\frac{{\lambda }^{2}{M}^{2}}{2HN}}\right],\quad \hfill \\ E\eta =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma ,\quad \hfill \\ Q=\frac{F}{{E}^{2}}+R\tilde{\Lambda }-\left(-ER+F\eta \right)\eta \frac{1}{\int \frac{\rho \left(\lambda \right)}{{\left(\tilde{\Lambda }-\lambda \right)}^{2}}d\lambda },\quad \hfill \\ K=E\tilde{\Lambda }+\frac{1}{\eta },\quad \hfill \\ \chi =\frac{1}{E}+\eta \tilde{\Lambda },\quad \hfill \\ H=\frac{R}{{\eta }^{2}}+F\tilde{\Lambda }+\left(-ER+F\eta \right)E\frac{1}{\int \frac{\rho \left(\lambda \right)}{{\left(\tilde{\Lambda }-\lambda \right)}^{2}}d\lambda },\quad \hfill \\ \eta =\frac{1}{K}\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right),\quad \hfill \\ {\bar{J}}_{j}=\frac{\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left({\mathbb{E}}_{s,z}\left(\hat{y}\left(s,z,\chi ,Q,J\right){s}_{0}{\sum }_{j\in {\Psi}}{s}_{j}\right),\lambda d\chi \right)}{d\left(1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)\right)},\quad j\in {\Psi}.\quad \hfill \end{cases}\end{equation} \tag{ 67 }$

In the active set Ψ, the mean estimates $\left\{{\bar{J}}_{j},j\in {\Psi}\right\}$ can be obtained by solving a reduced ℓ₁-regularized optimization problem

$\begin{align}\hfill & \underset{{\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}}{\mathrm{min}}\left\{{\mathbb{E}}_{s,z}\left(\underset{y}{\mathrm{min}}\left[\frac{{\left(y-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }+\mathrm{log}\left(1+{e}^{-2y}\right)\right]\right)\right.\left.+\lambda {\sum }_{j\in {\Psi}}\left\vert {\bar{J}}_{j}\right\vert \right\}.\hfill \end{align} \tag{ 68 }$

In contrast to the ℓ₁-LinR estimator, the mean estimates $\left\{{\bar{J}}_{j},j\in {\Psi}\right\}$ in (68) for the ℓ₁-LogR estimator do not have analytic solutions and also have to be solved numerically. For a RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with degree d and coupling strength K₀, after some algebra, the corresponding fixed-point equations for $\left\{{\bar{J}}_{j}=J,j\in {\Psi}\right\}$ are obtained as follows

$\begin{equation}J=\frac{\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left({\mathbb{E}}_{s,z}\left(\hat{y}\left(s,z,\chi ,Q,J\right){s}_{0}{\sum }_{j\in {\Psi}}{s}_{j}\right),\lambda d\chi \right)}{d\left(1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)\right)},\end{equation} \tag{ 69 }$

which can be solved iteratively.

The estimates in the inactive set $\bar{{\Psi}}$ are the same as (66) that of ℓ₁-LinR, which can be described by a scalar theresholding estimator once the EOS is solved.

Appendix B.: Check the consistency of ansatz (A1)

To check the consistency of ansatz (A1), first we categorize the estimators based on the distance or generation from the focused spin s₀. Considering the original Ising model whose coupling network is a tree-like graph, we can naturally define generations of the spins according to the distance from the focused spin s₀. We categorize the spins directly connected to s₀ as the first generation and denote the corresponding index set as ${{\Omega}}_{1}=\left\{i\vert {J}_{i}^{\ast }\ne 0,i\in \left\{1,\dots ,N-1\right\}\right\}$ . Each spin in Ω₁ is connected to some other spins except for s₀, and those spins constitute the second generation and we denote its index set as Ω₂. This recursive construction of generations can be unambiguously continued on the tree-like graph, and we denote the index set of the gth generation from spin s₀ as Ω_g. The overall construction of generations is graphically represented in figure 4. Generally, assume that the set of nonzero values of the ℓ₁-LinR estimator is denoted as ${\Psi}=\left\{{{\Omega}}_{1},\dots ,{{\Omega}}_{g}\right\}$ . Then, ansatz (A1) means that the correct active set of the mean estimates is ${\Psi}=\left\{{{\Omega}}_{1}\right\}$ .

**Figure 4.** Schematic of generations of spins. In general, the gth generation of spin s₀ is denoted as Ω_g, whose distance from spin s₀ is g.
Download figure:
Standard image High-resolution image

To verify this, we examine the values of mean estimates based on (60). Due to the symmetry, it is expected that for each a = 1, ..., g, the values of the mean estimates ${\bar{J}}_{j\in {{\Omega}}_{a}}={J}_{a}$ are identical to each other within the same set Ω_a, a = 1...g. In addition, if the solutions satisfy ansatz (A1) in (11), i.e. J₁ = J, J_a = 0, a ⩾ 2, from (60) we obtain

$\begin{equation}\begin{cases}\frac{1}{1+\chi }\left[\mathrm{tanh}\left({K}_{0}\right)-\left(1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)\right)J\right]-\lambda =0,\quad \hfill & j\in {{\Omega}}_{1};\hfill \\ \left\vert \frac{1}{1+\chi }\left[{\mathrm{tanh}}^{a}\left({K}_{0}\right)-{\mathrm{tanh}}^{a-1}\left({K}_{0}\right)\left(1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)\right)J\right]\right\vert \leqslant \lambda ,\quad \hfill & j\in {{\Omega}}_{a},a\geqslant 2,\hfill \end{cases}\end{equation} \tag{ 70 }$

where the result ${\mathbb{E}}_{s}\left({s}_{i}{s}_{j}\right)={\mathrm{tanh}}^{{d}_{0}}\left({K}_{0}\right)$ is used for any two spins s_i, s_j whose distance is d₀ in the RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ . Note that the solution of the first equation in (70) automatically satisfies the second equation (sub-gradient condition) since $\left\vert \mathrm{tanh}\left({K}_{0}\right)\right\vert \leqslant 1$ , which indicates that J₁ = J, J_a = 0, a ⩾ 2 is one valid solution. Moreover, the convexity of the quadratic loss function indicates that this is the unique and correct solution, which checks the ansatz (A1).

Appendix C.: Check the consistency of ansatz (A2)

We here check the consistency of a part of the ansatz (A2) in section 3.2, the orthogonal matrix O diagonalizing the covariance matrix C is distributed from the Haar orthogonal measure. To achieve this, we compare certain properties of the orthogonal matrix generated from the diagonalization of the covariance matrix C with the orthogonal matrix which is actually generated from the Haar orthogonal measure. Specifically, we compute the cumulants of the trace of the power k of the orthogonal matrix. All cumulants with degree r ⩾ 3 are shown to disappear in the large N limit [41, 42]. The nontrivial cumulants are only second order cumulant with the same power k. We have computed these cumulants about the orthogonal matrix from the covariance matrix C and found that they exhibit the same behavior as the ones generated from the true Haar measure, as shown in figure 5.

**Figure 5.** The RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with N = 1000, d = 3, K₀ = 0.4 is generated and we compute the associated covariance matrix C and then diagonalize it as C = OΛO^T, obtaining the orthogonal matrix O. Then the $\mathtt{Tr}\left({O}^{k}\right),\mathtt{Tr}\left({O}^{-k}\right)$ for several k (k = 1 ∼ 8) are computed, where $\mathtt{Tr}\left(\cdot \right)$ is the trace operation. This procedure is repeated 200 times with different random numbers, from which we obtain the ensemble of $\mathtt{Tr}\left({O}^{k}\right)$ and $\mathtt{Tr}\left({O}^{-k}\right)$ . Consequently, the cumulants of 1st, 2nd, and 3rd orders are computed. All of them exhibit the expected theoretical behavior.
Download figure:
Standard image High-resolution image

**Figure 5.** The RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with N = 1000, d = 3, K₀ = 0.4 is generated and we compute the associated covariance matrix C and then diagonalize it as C = OΛO^T, obtaining the orthogonal matrix O. Then the $\mathtt{Tr}\left({O}^{k}\right),\mathtt{Tr}\left({O}^{-k}\right)$ for several k (k = 1 ∼ 8) are computed, where $\mathtt{Tr}\left(\cdot \right)$ is the trace operation. This procedure is repeated 200 times with different random numbers, from which we obtain the ensemble of $\mathtt{Tr}\left({O}^{k}\right)$ and $\mathtt{Tr}\left({O}^{-k}\right)$ . Consequently, the cumulants of 1st, 2nd, and 3rd orders are computed. All of them exhibit the expected theoretical behavior.
Download figure:
Standard image High-resolution image

Appendix D.: Details of the high-dimensional asymptotic result

Here the asymptotic performance of Precision and Recall are considered for both the ℓ₁-LinR estimator and the ℓ₁-LogR estimator. Recall that perfect Ising model selection is achieved if and only if Precision = 1 and Recall = 1.

D.1. Recall rate

According to the definition in (5), the recall rate is only related to the statistical properties of estimates in the active set Ψ and thus the mean estimates ${\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}$ in the limit M → ∞ are considered.

D.1.1. Quadratic loss

In this case, in the limit M → ∞, the mean estimates ${\left\{{\bar{J}}_{j}=J\right\}}_{j\in {\Psi}}$ in the active set Ψ are shown in (64) and rewritten as follows for ease of reference

$\begin{equation}J=\frac{\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left(\mathrm{tanh}\left({K}_{0}\right),\lambda \left(1+\chi \right)\right)}{1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)}.\end{equation} \tag{ 71 }$

As a result, as long as $\lambda \left(1+\chi \right)< \mathrm{tanh}\left({K}_{0}\right)$ , J > 0 and thus we can successfully recover the active set so that Recall = 1. In addition, when $M=\mathcal{O}\left(\mathrm{log}\,N\right)$ , χ → 0 as N → ∞, as demonstrated later by the relation in (81). As a result, the regularization parameter needs to satisfy $0< \lambda < \mathrm{tanh}\left({K}_{0}\right)$ .

D.1.2. Logistic loss

In this case, in the limit M → ∞, the mean estimates ${\left\{{\bar{J}}_{j}=J\right\}}_{j\in {\Psi}}$ in the active set Ψ are shown in (69) and rewritten as follows for ease of reference

$\begin{equation}J=\frac{\mathtt{s}\mathtt{o}\mathtt{f}\mathtt{t}\left({\mathbb{E}}_{s,z}\left(\hat{y}\left(s,z,\chi ,Q,J\right){s}_{0}{\sum }_{j\in {\Psi}}{s}_{j}\right),\lambda d\chi \right)}{d\left(1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)\right)}.\end{equation} \tag{ 72 }$

There is no analytic solution for $\hat{y}\left(s,z,\chi ,Q,J\right)$ and the following fixed-point equation has to be solved numerically

$\begin{equation}\frac{\hat{y}\left(s,z,\chi ,Q,J\right)-{s}_{0}\left(\sqrt{Q}z+J{\sum }_{j\in {\Psi}}{s}_{j}\right)}{\chi }=1-\mathrm{tanh}\left(\hat{y}\left(s,z,\chi ,Q,J\right)\right).\end{equation} \tag{ 73 }$

Then one can determine the valid choice of λ to enable J > 0. Numerical results show that the choice of λ is similar to that of the quadratic loss.

D.2. Precision rate

According to the definition in (5), to compute the Precision, the number of true positives TP and false positives FP are needed, respectively. On the one hand, as discussed in appendix D.1, in the limit M → ∞, the recall rate approach to one and thus we have TP = d for a RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ . On the other hand, the number of false positives FP can be computed as FP = FPR ⋅ N, where FPR is the false positive rate.

As shown in appendix A.3, the estimator in the inactive set $\bar{{\Psi}}$ can be statistically described by a scalar estimator (66) and thus the FPR can be computed as

$\begin{equation}FPR=\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right),\end{equation} \tag{ 74 }$

which depends on λ, M, N, H. However, for both the quadratic loss and logistic loss, there is no analytic result for H in (59). Nevertheless, we can obtain some asymptotic result using perturbative analysis.

Specifically, we focus on the asymptotic behavior of the macroscopic parameters, e.g. χ, Q, K, E, H, F, in the regime FPR → 0, which is necessary for successful Ising model selection. From $\eta =\frac{1}{K}\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)$ in EOS (59) and the FPR in (74), there is FPR = Kη. Moreover, by combining $E\eta =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma$ and $K=E\tilde{\Lambda }+\frac{1}{\eta }$ , the following relation can be obtained

$\begin{equation}\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)=1-\int \frac{\rho \left(\gamma \right)}{1-\frac{\gamma }{\tilde{\Lambda }}}d\gamma .\end{equation} \tag{ 75 }$

Thus as $FPR=\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)\to 0$ , there is $\int \frac{\rho \left(\gamma \right)}{1-\frac{\gamma }{\tilde{\Lambda }}}d\gamma \to 1$ , implying that the magnitude of $\tilde{\Lambda }\to \infty$ . Consequently, using the truncated series expansion, we obtain

$\begin{align}\hfill E\eta & =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma =-\frac{1}{\tilde{\Lambda }}\sum\limits _{k=0}^{\infty }\frac{\left\langle {\gamma }^{k}\right\rangle }{{\tilde{\Lambda }}^{k}}\hfill \\ \hfill & \simeq -\frac{1}{\tilde{\Lambda }}-\frac{\left\langle \gamma \right\rangle }{{\tilde{\Lambda }}^{2}},\hfill \end{align} \tag{ 76 }$

where $\left\langle {\gamma }^{k}\right\rangle =\int \rho \left(\gamma \right){\gamma }^{k}d\gamma$ . Then, solving the quadratic equation (76), we obtain the solution (the other solution is not considered since it is a smaller value) of $\tilde{\Lambda }$ as

$\begin{equation}\tilde{\Lambda }=\frac{-1-\sqrt{1-4E\eta \left\langle \gamma \right\rangle }}{2E\eta }\simeq \left\langle \gamma \right\rangle -\frac{1}{E\eta }.\end{equation} \tag{ 77 }$

To compute $\int \frac{\rho \left(\gamma \right)}{{\left(\tilde{\Lambda }-\gamma \right)}^{2}}d\gamma$ , we use the following relation

$\begin{equation}f\left(\tilde{\Lambda }\right)=-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma \simeq -\frac{1}{\tilde{\Lambda }}-\frac{\left\langle \gamma \right\rangle }{{\tilde{\Lambda }}^{2}},\end{equation} \tag{ 78 }$

$\begin{equation}\frac{df\left(\tilde{\Lambda }\right)}{d\tilde{\Lambda }}=\int \frac{\rho \left(\gamma \right)}{{\left(\tilde{\Lambda }-\gamma \right)}^{2}}d\gamma \simeq \frac{1}{{\tilde{\Lambda }}^{2}}+2\frac{\left\langle \gamma \right\rangle }{{\tilde{\Lambda }}^{3}}.\end{equation} \tag{ 79 }$

Substituting the results (77)–(79) into (59), after some algebra, we obtain

$\begin{equation}K\simeq E\left\langle \gamma \right\rangle ,\end{equation} \tag{ 80 }$

$\begin{equation}\chi \simeq \eta \left\langle \gamma \right\rangle ,\end{equation} \tag{ 81 }$

$\begin{equation}Q\simeq \frac{{\left\langle \gamma \right\rangle }^{3}{E}^{2}{\eta }^{2}R-{\left\langle \gamma \right\rangle }^{3}EF{\eta }^{3}+3{\left\langle \gamma \right\rangle }^{2}F{\eta }^{2}-R\left\langle \gamma \right\rangle }{3E\eta \left\langle \gamma \right\rangle -1},\end{equation} \tag{ 82 }$

$\begin{equation}H\simeq \frac{{\left\langle \gamma \right\rangle }^{3}{E}^{2}{\eta }^{2}F-{\left\langle \gamma \right\rangle }^{3}R\eta {E}^{3}+3{\left\langle \gamma \right\rangle }^{2}R{E}^{2}-F\left\langle \gamma \right\rangle }{3E\eta \left\langle \gamma \right\rangle -1}.\end{equation} \tag{ 83 }$

In addition, as $FPR=\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)\to 0$ , from (59) we obtain

$\begin{align}\hfill R& =\frac{1}{{K}^{2}}\left[\left(H+\frac{{\lambda }^{2}{M}^{2}}{N}\right)\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi }}{e}^{-\frac{{\lambda }^{2}{M}^{2}}{2HN}}\right]\hfill \\ \hfill & \simeq \frac{H}{{K}^{2}}\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2HN}}\right)\simeq \frac{H}{K}\eta \simeq \frac{H}{E\left\langle \gamma \right\rangle }\eta ,\hfill \end{align} \tag{ 84 }$

where the first result in ≃ uses the asymptotic relation $\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(x\right)\simeq \frac{1}{x\sqrt{\pi }}{e}^{-{x}^{2}}$ as x → ∞ and the last result in ≃ results from the asymptotic relation in (80). Then, substituting (84) into (83) leads to the following relation

$\begin{equation}\left(3E\eta \left\langle \gamma \right\rangle -1\right)H\simeq {\left\langle \gamma \right\rangle }^{3}{E}^{2}{\eta }^{2}F-{\left\langle \gamma \right\rangle }^{2}{\eta }^{2}{E}^{2}H+3E\eta \left\langle \gamma \right\rangle H-F\left\langle \gamma \right\rangle .\end{equation} \tag{ 85 }$

Interestingly, the common terms $3E\eta \left\langle \gamma \right\rangle H$ in both sides of (85) cancel with each other. Therefore, the key result for H is obtained as follows

$\begin{equation}H\simeq F\left\langle \gamma \right\rangle .\end{equation} \tag{ 86 }$

In addition, from (86) and (82), Q can be simplified as

$\begin{equation}Q\simeq R\left\langle \gamma \right\rangle .\end{equation} \tag{ 87 }$

As shown in (59), $F=\alpha {\mathbb{E}}_{s,z}{\left(\frac{d\ell \left(y\right)}{dy}{\vert }_{y=\hat{y}\left(s,z,\chi ,Q,J\right)}\right)}^{2}$ , thus the result $H\simeq F\left\langle \gamma \right\rangle$ in (86) implies that there is a linear relation between H and α ≡ M/N. The relation between E, F, H and α are also verified numerically in figure 6 when M = 50(log N) for N = 10² ∼ 10¹² using the ℓ₁-LinR estimator.

**Figure 6.** E, F, H versus α when α = 50(log N)/N for N = 10² ∼ 10¹² for RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with d = 3, K₀ = 0.4. Note that in this case, there is $\left\langle \gamma \right\rangle =1$ .
Download figure:
Standard image High-resolution image

In the paramagnetic phase, it can be obtained that the mean value of the eigenvalue $\left\langle \gamma \right\rangle$ . Specifically, we have $\left\langle \gamma \right\rangle =\frac{1}{N}{\sum }_{i=1}^{N}{\gamma }_{i}=\frac{1}{N}\,\mathrm{Tr}\,C=(1/N)\times N=1$ . Denote by $H\simeq F\left\langle \gamma \right\rangle \equiv \alpha {\triangle}$ , where ${\triangle}={\mathbb{E}}_{s,z}{\left(\frac{d\ell \left(y\right)}{dy}{\vert }_{y=\hat{y}\left(s,z,\chi ,Q,J\right)}\right)}^{2}=\mathcal{O}\left(1\right)$ , then the FPR in (74) can be rewritten as follows

$\begin{align}\hfill FPR& =\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\frac{\lambda M}{\sqrt{2\alpha {\triangle}N}}\right)=\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(\lambda \sqrt{\frac{M}{2{\triangle}}}\right)\leqslant \frac{1}{\sqrt{\pi }}{e}^{-\frac{{\lambda }^{2}M}{2{\triangle}}-\frac{1}{2}\,\mathrm{log}\left(\frac{{\lambda }^{2}M}{2{\triangle}}\right)},\hfill \end{align} \tag{ 88 }$

where the last inequality uses the upper bound of erfc function, i.e. $\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left(x\right)\leqslant \frac{1}{x\sqrt{\pi }}{e}^{-{x}^{2}}$ . Consequently, the number of false positives FP satisfies

$\begin{align}\hfill FP& \leqslant \frac{N}{\sqrt{\pi }}{e}^{-\frac{{\lambda }^{2}M}{2{\triangle}}-\frac{1}{2}\,\mathrm{log}\left(\frac{{\lambda }^{2}M}{2{\triangle}}\right)}=\frac{1}{\sqrt{\pi }}{e}^{-\frac{{\lambda }^{2}M}{2{\triangle}}-\frac{1}{2}\,\mathrm{log}\left(\frac{{\lambda }^{2}M}{2{\triangle}}\right)+\mathrm{log}\,N}< \frac{1}{\sqrt{\pi }}{e}^{-\frac{{\lambda }^{2}M}{2{\triangle}}+\mathrm{log}\,N},\hfill \end{align} \tag{ 89 }$

where the last inequality holds when $\frac{{\lambda }^{2}M}{2{\triangle}} > 1$ , which is necessary when FP → 0 as N → ∞. Consequently, to ensure FP → 0 as N → ∞, from (89), the term $\frac{{\lambda }^{2}M}{2{\triangle}}$ should grow at least faster than log N, i.e.

$\begin{equation}M > \frac{2{\triangle}\mathrm{log}\,N}{{\lambda }^{2}}.\end{equation} \tag{ 90 }$

Meanwhile, the number of false positives FP will decay as $\mathcal{O}\left({e}^{-c\,\mathrm{log}\,N}\right)$ for some constant $c\left( > 0\right)$ .

D.2.1. Quadratic loss

In this case, when $0< \lambda < \mathrm{tanh}\left({K}_{0}\right)$ , from (61), we can obtain an analytic result for △ as follows

$\begin{equation}{\triangle}\simeq {\mathbb{E}}_{{s}_{0}}{\left(s-{\sum }_{j\in {\Psi}}{s}_{j}{\bar{J}}_{j}\right)}^{2}\end{equation} \tag{ 91 }$

$\begin{equation}=\frac{1-{\mathrm{tanh}}^{2}{K}_{0}+d{\lambda }^{2}}{1+\left(d-1\right){\mathrm{tanh}}^{2}\,{K}_{0}}.\end{equation} \tag{ 92 }$

On the other hand, from the discussion in appendix D.1, the recall rate Recall → 1 as M → ∞ when 0 < λ < tanh K₀. Overall, for a RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with degree d and coupling strength K₀, given M i.i.d. samples ${\mathcal{D}}^{M}=\left\{{\boldsymbol{s}}^{\left(1\right)},\dots ,{\boldsymbol{s}}^{\left(M\right)}\right\}$ , using ℓ₁-LinR estimator (4) with regularization parameter λ, perfect recovery of the graph structure G can be achieved as N → ∞ if the number of samples M satisfies

$\begin{equation}M > \frac{c\left(\lambda ,{K}_{0}\right)\mathrm{log}\,N}{{\lambda }^{2}},\quad \lambda \in \left(0,\mathrm{tanh}\left({K}_{0}\right)\right)\end{equation} \tag{ 93 }$

where $c\left(\lambda ,{K}_{0}\right)$ is a value dependent on the regularization parameter λ and coupling strength K₀, which can be approximated in the limit N → ∞ as:

$\begin{equation}c\left(\lambda ,{K}_{0}\right)=\frac{2\left(1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)+d{\lambda }^{2}\right)}{1+\left(d-1\right){\mathrm{tanh}}^{2}\left({K}_{0}\right)}.\end{equation} \tag{ 94 }$

D.2.2. Logistic loss

In this case, from (67), the value of △ can be computed as

$\begin{equation}{\triangle}\simeq {\mathbb{E}}_{s,z}\left({\left(1-\mathrm{tanh}\left(\hat{y}\left(S,z,\chi ,Q,J\right)\right)\right)}^{2}\right).\end{equation} \tag{ 95 }$

However, different from the case of ℓ₁-LinR estimator, there is no analytic solution but it can be calculated numerically. It can be seen that the ℓ₁-LinR estimator only differs in the value of scaling factor △ with the ℓ₁-LogR estimator for Ising model selection.

Appendix E.: Details of the non-asymptotic result for moderate M, N

As demonstrated in appendix A.3, from the replica analysis, both ℓ₁-LinR and ℓ₁-LogR estimators are decoupled and their asymptotic behavior can be described by two scalar estimators for the active set and inactive set, respectively. It is desirable to obtain the non-asymptotic result for moderate M, N. However, it is found that the behavior of the two scalar estimators by simply inserting the finite values of M, N into the EOS does not always lead to good consistency with the experimental results, especially for the Recall when M is small. This can be explained by the derivation of the free energy density. In calculating the energy term ξ, the limit M → ∞ is taken implicitly when assuming the limit N → ∞ with α ≡ M/N. As a result, the scalar estimator associated with the active set can only describe the asymptotic performance in the limit M → ∞. Thus, one cannot describe the fluctuating behavior of the estimator in the active set such as the recall rate for finite M. To characterize the non-asymptotic behavior of the estimates in the active set Ψ, we replace the expectation ${\mathbb{E}}_{s}(\cdot )$ in (58) by the sample average over M samples, and the corresponding estimates are obtained as

$\begin{align}\hfill {\left\{{\hat{J}}_{j}\right\}}_{j\in {\Psi}}& =\underset{{J}_{j,j\in {\Psi}}}{\mathrm{arg}\,\mathrm{min}}\left\{\frac{1}{M}\sum\limits _{\mu =1}^{M}\underset{{y}^{\mu }}{\mathrm{min}}\left[\frac{{\left({y}^{\mu }-{s}_{0}^{\mu }\left(\sqrt{Q}{z}^{\mu }+{\sum }_{j\in {\Psi}}{J}_{j}{s}_{j}^{\mu }\right)\right)}^{2}}{2\chi }+\ell \left({y}^{\mu }\right)\right]\right.\left.+\lambda {\sum }_{j\in {\Psi}}\left\vert {J}_{j}\right\vert \right\},\hfill \end{align} \tag{ 96 }$

where ${z}^{\mu }\sim \mathcal{N}\left(0,1\right)$ and ${s}_{0}^{\mu },{s}_{j,j\in {\Psi}}^{\mu }\sim P\left({s}_{0},{\boldsymbol{s}}_{{\Psi}}\vert {\boldsymbol{J}}^{\ast }\right)$ are random samples μ = 1, ..., M. Note that the mean estimates ${\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}$ are replaced by ${\left\{{\hat{J}}_{j}\right\}}_{j\in {\Psi}}$ in (96) as we now focus on its fluctuating behavior due to the finite size effect. In the limit M → ∞, the sample average will converge to the expectation and thus (96) is equivalent to (68) when M → ∞.

E.1. Quadratic loss $\ell \left(y\right)={\left(y-1\right)}^{2}/2$

In the case of quadratic loss $\ell \left(y\right)={\left(y-1\right)}^{2}/2$ , there is an analytic solution to y in $\underset{y}{\mathrm{min}}\left[\frac{{\left(y-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }+\ell \left(y\right)\right]$ . Consequently, similar to (62), the result of (96) for the ℓ₁-LinR estimator becomes

$\begin{equation}{{{J}}_{j}}_{j\in {\Psi}}=\underset{{J}_{j,j\in {\Psi}}}{\mathrm{arg}\,\mathrm{min}}\left[\frac{1}{2\left(1+\chi \right)M}\sum\limits _{\mu =1}^{M}{\left({s}_{i}^{\mu }-{\sum }_{j\in {\Psi}}{s}_{j}^{\mu }{J}_{j}-\sqrt{Q}{z}^{\mu }\right)}^{2}+\lambda {\sum }_{j\in {\Psi}}\left\vert {J}_{j}\right\vert \right].\end{equation} \tag{ 97 }$

As the mean estimates ${\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}$ are modified as in (97), the corresponding solution to the EOS in (61) also needs to be modified, and this can be solved iteratively as sketched in algorithm 1. For a practical implementation of algorithm 1, the details are described in the following.

First, in the EOS (19), we need to obtain $\tilde{\Lambda }$ satisfying the following relation

$\begin{equation}E\eta =-\int \frac{\rho \left(\gamma \right)}{\tilde{\Lambda }-\gamma }d\gamma ,\end{equation} \tag{ 98 }$

which is difficult to solve directly. To obtain $\tilde{\Lambda }$ , we introduce an auxiliary variable $\Gamma \equiv -\frac{1}{\tilde{\Lambda }}$ , by which (98) can be rewritten as

$\begin{equation}\Gamma =\frac{E\eta }{\int \frac{\rho \left(\gamma \right)}{1+\Gamma \gamma }d\gamma },\end{equation} \tag{ 99 }$

which can be solved iteratively. Accordingly, the χ, Q, K, H in EOS (19) can be equivalently written in terms of Γ.

Second, when solving the EOS (19) iteratively using numerical methods, it is helpful to improve the convergence of the solution by introducing a small amount of damping factor $\mathtt{d}\mathtt{a}\mathtt{m}\mathtt{p}\in [0,1)$ for χ, Q, E, R, F, η, K, H, Γ in each iteration.

The detailed implementation of algorithm 1 is shown in algorithm 2.

Algorithm 2. Detailed implementation of algorithm 1 for the ℓ₁-LinR estimator with moderate M, N.

E.2. Logistic loss $\ell \left(y\right)=\mathrm{log}\left(1+{e}^{-2y}\right)$

In the case of square lass $\ell \left(y\right)=\mathrm{log}\left(1+{e}^{-2y}\right)$ , since there is no analytic solution to y in $\underset{y}{\mathrm{min}}\left[\frac{{\left(y-{s}_{0}\left(\sqrt{Q}z+{\sum }_{j\in {\Psi}}{\bar{J}}_{j}{s}_{j}\right)\right)}^{2}}{2\chi }+\ell \left(y\right)\right]$ , the result of (96) for the ℓ₁-LogR estimator becomes

$\begin{align}\hfill {\hat{J}}_{j,j\in {\Psi}}& =\underset{{J}_{j,j\in {\Psi}}}{\mathrm{arg}\,\mathrm{min}}\left[\frac{1}{M}\sum\limits _{\mu =1}^{M}\underset{{y}^{\mu }}{\mathrm{min}}\left[\frac{{\left({y}^{\mu }-{s}_{0}^{\mu }\left(\sqrt{Q}{z}^{\mu }+{\sum }_{j\in {\Psi}}{J}_{j}{s}_{j}^{\mu }\right)\right)}^{2}}{2\chi }\right.\right.\hfill \\ \hfill & \quad \left.\left.+\mathrm{log}(1+{e}^{-2y})\right]+\lambda {\sum }_{j\in {\Psi}}\left\vert {J}_{j}^{\mu }\right\vert \right],\hfill \end{align} \tag{ 100 }$

Similarly as the case for quadratic loss, as the mean estimates ${\left\{{\bar{J}}_{j}\right\}}_{j\in {\Psi}}$ are modified as in (100), the corresponding solutions to the EOS in (67) also need to be modified, which can be solved iteratively as shown in algorithm 3.

Algorithm 3. Detailed implementation of solving the EOS (67) together with (100) for ℓ₁-LogR with moderate M, N.

Appendix F.: Eigenvalue distribution $\rho \left(\gamma \right)$

From the replica analysis presented, the learning performance will depend on the eigenvalue distribution (EVD) $\rho \left(\gamma \right)$ of the covariance matrix C of the original Ising model.

There are two issues to be noted. One is about the formula connecting the performance of the estimator and the spectral density, and the other is the numeric values of quantities which are computed from the formula. For the first point, no assumption about the spectral density is needed to obtain the formula itself and this formula is valid when the graph structure is tree-like and the Ising model defined on the graph is in the paramagnetic phase. For the second point, we need the specific form of the spectral density to obtain numeric solutions in general. As a demonstration, we assume the random regular graph with constant coupling strength for which the spectral density can be obtained analytically as has already been known before in [7].

In general, it is difficult to obtain this EVD; however, for sparse tree-like graphs such as RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ with constant node degree d and sufficiently small coupling strength K₀ that yields the paramagnetic state $({\mathbb{E}}_{\boldsymbol{s}}(\boldsymbol{s})=\mathbf{0})$ , it can be computed analytically. For this, we express the covariances as

$\begin{equation}{C}_{ij}={\mathbb{E}}_{\boldsymbol{s}}({s}_{i}{s}_{j})-{\mathbb{E}}_{\boldsymbol{s}}({s}_{i}){\mathbb{E}}_{\boldsymbol{s}}({s}_{j})=\frac{{\partial }^{2}\mathrm{log}\,Z(\boldsymbol{\theta })}{\partial {\theta }_{i}\partial {\theta }_{j}},\end{equation} \tag{ 101 }$

where $Z(\boldsymbol{\theta })=\int d\boldsymbol{s}{P}_{\text{Ising}}(\boldsymbol{s}\vert {J}^{\ast })\mathrm{exp}({\sum }_{i=0}^{N-1}{\theta }_{i}{s}_{i})$ and the assessment is carried out at θ = 0.

In addition, for technical convenience we introduce the Gibbs free energy as

$\begin{equation}A\left(\boldsymbol{m}\right)=\underset{\boldsymbol{\theta }}{\mathrm{max}}\left\{{\boldsymbol{\theta }}^{T}\boldsymbol{m}-\mathrm{log}\,Z\left(\boldsymbol{\theta }\right)\right\}.\end{equation} \tag{ 102 }$

The definition of (102) indicates that following two relations hold:

$\begin{align}\hfill \frac{\partial {m}_{i}}{\partial {\theta }_{j}}=\frac{{\partial }^{2}\mathrm{log}\,Z(\boldsymbol{\theta })}{\partial {\theta }_{i}\partial {\theta }_{j}}={C}_{ij},\\ \hfill \frac{\partial {\theta }_{i}}{\partial {m}_{j}}={[{C}^{-1}]}_{ij}=\frac{{\partial }^{2}A(\boldsymbol{m})}{\partial {m}_{i}\partial {m}_{j}},\end{align} \tag{ 103 }$

where the evaluations are performed at θ = 0 and m = arg min_m A( m ) (=0 under the paramagnetic assumption).

Consequently, we can focus on the computation of $A\left(\boldsymbol{m}\right)$ to obtain the EVD of C⁻¹. The inverse covariance matrix of a RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ can be computed from the Hessian of the Gibbs free energy [7, 47, 48] as

$\begin{align}\hfill {\left[{C}^{-1}\right]}_{ij}& =\frac{\partial A\left(\boldsymbol{m}\right)}{\partial {m}_{i}\partial {m}_{j}}\hfill \\ \hfill & =\left(\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1\right){\delta }_{ij}-\frac{\mathrm{tanh}\left({J}_{ij}\right)}{1-{\mathrm{tanh}}^{2}\left({J}_{ij}\right)}\left(1-{\delta }_{ij}\right),\hfill \end{align} \tag{ 104 }$

and in matrix form, we have

$\begin{equation}{C}^{-1}=\left(\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1\right)\mathbf{I}-\frac{\mathrm{tanh}\left(\boldsymbol{J}\right)}{1-{\mathrm{tanh}}^{2}\left(\boldsymbol{J}\right)},\end{equation} \tag{ 105 }$

where I is an identity matrix of proper size, and the operations $\mathrm{tanh}\left(\cdot \right),{\mathrm{tanh}}^{2}\left(\cdot \right)$ on matrix J are defined in the component-wise manner. For RR graph $G\in {\mathcal{G}}_{N,d,{K}_{0}}$ , J is a sparse matrix, therefore the matrix $\frac{\mathrm{tanh}\left(\boldsymbol{J}\right)}{1-{\mathrm{tanh}}^{2}\left(\boldsymbol{J}\right)}$ also corresponds to a sparse coupling matrix (whose nonzero coupling positions are the same as J ) with constant coupling strength ${K}_{1}=\frac{\mathrm{tanh}\left({K}_{0}\right)}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)}$ and fixed connectivity d, the corresponding eigenvalue (denoted as ζ) distribution can be calculated as [49]

$\begin{equation}{\rho }_{\zeta }\left(\zeta \right)=\frac{d\sqrt{4{K}_{1}^{2}\left(d-1\right)-{\zeta }^{2}}}{2\pi \left({K}_{1}^{2}{d}^{2}-{\zeta }^{2}\right)},\quad \left\vert \zeta \right\vert \leqslant 2{K}_{1}\sqrt{d-1}.\end{equation} \tag{ 106 }$

From (105), the eigenvalue η of C⁻¹ is

$\begin{equation}{\eta }_{i}=\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1-{\zeta }_{i},\end{equation} \tag{ 107 }$

which, when combined with (106), readily yields the EVD of η as N → ∞ as follows:

$\begin{align}\hfill {\rho }_{\eta }\left(\eta \right)& ={\rho }_{\zeta }\left(\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1-\eta \right)\hfill \\ \hfill & =\frac{d\sqrt{4{\left(\frac{\mathrm{tanh}\left({K}_{0}\right)}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)}\right)}^{2}\left(d-1\right)-{\left(\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1-\eta \right)}^{2}}}{2\pi \left({\left(\frac{\mathrm{tanh}\left({K}_{0}\right)}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)}\right)}^{2}{d}^{2}-{\left(\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1-\eta \right)}^{2}\right)},\hfill \end{align} \tag{ 108 }$

where $\eta \in \left[\frac{d}{1-{\mathrm{tanh}}^{2}{K}_{0}}-d+1-\frac{2\,\mathrm{tanh}\left({K}_{0}\right)\sqrt{d-1}}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)},\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1+\frac{2\,\mathrm{tanh}\left({K}_{0}\right)\sqrt{d-1}}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)}\right]$ .

Consequently, since γ = 1/η, we obtain the EVD of $\rho \left(\gamma \right)$ as follows

$\begin{align}\hfill \rho \left(\gamma \right)& =\frac{1}{{\gamma }^{2}}{\rho }_{\eta }\left(\eta =\frac{1}{\gamma }\right)\hfill \\ \hfill & =\frac{d\sqrt{4{\left(\frac{\mathrm{tanh}\left({K}_{0}\right)}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)}\right)}^{2}\left(d-1\right)-{\left(\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1-\frac{1}{\gamma }\right)}^{2}}}{2\pi {\gamma }^{2}\left({\left(\frac{\mathrm{tanh}\left({K}_{0}\right)}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)}\right)}^{2}{d}^{2}-{\left(\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1-\frac{1}{\gamma }\right)}^{2}\right)}\hfill \end{align} \tag{ 109 }$

where $\gamma \in \left[1/\left(\frac{d}{1-{\mathrm{tanh}}^{2}{K}_{0}}-d+1+\frac{2\,\mathrm{tanh}\left({K}_{0}\right)\sqrt{d-1}}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)}\right),1/\left(\frac{d}{1-{\mathrm{tanh}}^{2}\,{K}_{0}}-d+1-\frac{2\enspace \mathrm{tanh}\left({K}_{0}\right)\sqrt{d-1}}{1-{\mathrm{tanh}}^{2}\left({K}_{0}\right)}\right)\right]$ .

Appendix G.: Additional experimental results

Figures 7 and 8 show the full results of non-asymptotic learning performance prediction when λ = 0.1 and λ = 0.3, respectively. Good agreements between replica results and experimental results are achieved in all cases. As can be seen, there is negligible difference in Precision and Recall between ℓ₁-LinR and ℓ₁-LogR. Meanwhile, compared to figure 7 when λ = 0.1, the difference in RSS between ℓ₁-LinR and ℓ₁-LogR is reduced when λ = 0.3. In addition, by comparing figures 7 and 8, it can be seen that under the same setting, when λ increases, the Precision becomes larger while the Recall becomes smaller, implying a tradeoff in choosing λ in practice for Ising model selection with finite M, N.

**Figure 7.** Theoretical and experimental results of *RSS*, *Precision* and *Recall* for both ℓ₁-LinR and ℓ₁-LogR when λ = 0.1, N = 200, 400, 800 with different values of α ≡ M/N. The standard error bars are obtained from 1000 random runs. An excellent agreement between theory and experiment is achieved, even for small N = 200 and small α (small M).
Download figure:
Standard image High-resolution image

**Figure 8.** Theoretical and experimental results of RSS, *Precision* and *Recall* for both ℓ₁-LinR and ℓ₁-LogR when λ = 0.3, N = 200, 400, 800 with different values of α ≡ M/N. The standard error bars are obtained from 1000 random runs. An excellent agreement between theory and experiment is achieved, even for small N = 200 and small α (small M).
Download figure:
Standard image High-resolution image

Figures 9 and 10 show the full results of critical scaling prediction when λ = 0.1 and λ = 0.3, respectively. For comparison, both the results of ℓ₁-LinR and ℓ₁-LogR are shown. It can be seen that apart from the good agreements between replica results and experimental results, the prediction of the scaling value ${c}_{0}\left(\lambda ,{K}_{0}\right)\equiv \frac{c\left(\lambda ,{K}_{0}\right)}{{\lambda }^{2}}$ is very accurate.

**Figure 9.** *Precision* and *Recall* versus N when M = c log N and K₀ = 0.4 for ℓ₁-LinR and ℓ₁-LogR when λ = 0.1, where ${c}_{0}\left(\lambda ,{K}_{0}\right)\equiv \frac{c\left(\lambda ,{K}_{0}\right)}{{\lambda }^{2}}\approx 137$ . When $c > {c}_{0}\left(\lambda ,{K}_{0}\right)$ , the *Precision* increases consistently with N and approaches 1 as N → ∞ while it decreases consistently with N when $c< {c}_{0}\left(\lambda ,{K}_{0}\right)$ .
Download figure:
Standard image High-resolution image

**Figure 9.** *Precision* and *Recall* versus N when M = c log N and K₀ = 0.4 for ℓ₁-LinR and ℓ₁-LogR when λ = 0.1, where ${c}_{0}\left(\lambda ,{K}_{0}\right)\equiv \frac{c\left(\lambda ,{K}_{0}\right)}{{\lambda }^{2}}\approx 137$ . When $c > {c}_{0}\left(\lambda ,{K}_{0}\right)$ , the *Precision* increases consistently with N and approaches 1 as N → ∞ while it decreases consistently with N when $c< {c}_{0}\left(\lambda ,{K}_{0}\right)$ .
Download figure:
Standard image High-resolution image

**Figure 10.** *Precision* and *Recall* versus N when M = c log N and K₀ = 0.4 for ℓ₁-LinR and ℓ₁-LogR when λ = 0.3, where ${c}_{0}\left(\lambda ,{K}_{0}\right)\equiv \frac{c\left(\lambda ,{K}_{0}\right)}{{\lambda }^{2}}\approx 19.4$ . When $c > {c}_{0}\left(\lambda ,{K}_{0}\right)$ , the *Precision* increases consistently with N and approaches 1 as N → ∞ while it decreases consistently with N when $c< {c}_{0}\left(\lambda ,{K}_{0}\right)$ . The *Recall* increases consistently and approach to 1 as N → ∞.
Download figure:
Standard image High-resolution image

**Figure 10.** *Precision* and *Recall* versus N when M = c log N and K₀ = 0.4 for ℓ₁-LinR and ℓ₁-LogR when λ = 0.3, where ${c}_{0}\left(\lambda ,{K}_{0}\right)\equiv \frac{c\left(\lambda ,{K}_{0}\right)}{{\lambda }^{2}}\approx 19.4$ . When $c > {c}_{0}\left(\lambda ,{K}_{0}\right)$ , the *Precision* increases consistently with N and approaches 1 as N → ∞ while it decreases consistently with N when $c< {c}_{0}\left(\lambda ,{K}_{0}\right)$ . The *Recall* increases consistently and approach to 1 as N → ∞.
Download figure:
Standard image High-resolution image

Ising model selection using ℓ1-regularized linear regression: a statistical mechanics analysis*

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

Dates

Abstract