$2^{1296}$ Exponentially Complex Quantum Many-Body Simulation via Scalable Deep Learning Method

For decades, people are developing efficient numerical methods for solving the challenging quantum many-body problem, whose Hilbert space grows exponentially with the size of the problem. However, this journey is far from over, as previous methods all have serious limitations. The recently developed deep learning methods provide a very promising new route to solve the long-standing quantum many-body problems. We report that a deep learning based simulation protocol can achieve the solution with state-of-the-art precision in the Hilbert space as large as $2^{1296}$ for spin system and $3^{144}$ for fermion system , using a HPC-AI hybrid framework on the new Sunway supercomputer. With highly scalability up to 40 million heterogeneous cores, our applications have measured 94% weak scaling efficiency and 72% strong scaling efficiency. The accomplishment of this work opens the door to simulate spin models and Fermion models on unprecedented lattice size with extreme high precision.


JUSTIFICATION FOR THE GORDON BELL PRIZE
Deep-learning based quantum many-body simulations with state-of-the-art lattice size and energy precision, using nearly 40 million heterogeneous cores with 94% weak and 72% strong scaling efficiency. The Hilbert space is as unprecedentedly large as 2 1296 and 3 144 , for the spin systems (e.g., *Corresponding authors. ✝Equal contributions. 1-2 model) and the Fermion systems (e.g., -model), respectively.

OVERVIEW OF THE PROBLEM
3.1 The quantum many-body problem Strongly correlated quantum many-body physics is one of the most fascinating research fields in condensed matter physics. When a large number of microscopic particles interact with each other, quantum entanglement is built among them and exotic physical phenomena emerge, such as (hightemperature) superconductivity [50] reported by Nobel laureates for several times[2], quantum Hall effects [48] reported by Nobel laureates in the year of 1998 [1], quantum spinliquid [6] reviewed in the Science journal in the year of 2020, etc. Solving these problems will greatly deepen our understanding of the basic laws of the natural world and guide us to find novel physical phenomena and new quantum materials, which may have great potential applications in energy, information, etc. Despite the basic rules of quantum mechanics are known, solving the strongly correlated quantum many-particle problems is still extremely challenging. This is because the Hilbert space of the solution grows exponentially with the size of the problem. Furthermore, the perturbative methods, although have made great successes for simulating weakly correlated material systems, will completely fail for the strongly correlated systems. Revealing the fascinating physical natures in quantum many-body systems mainly relays on the nonperturbative numerical methods, such as exact diagonalization (ED), quantum Monte Carlo method (QMC) [3] and density matrix renormalization group (DMRG) [49]. However, these methods all have serious limitations: e.g., the computational cost of ED grows exponentially with the system size, and therefore the system size of the ED method is limited to less than 50 sites [34]. QMC suffers from the notorious sign problem for fermionic and frustrated systems [44]; and DMRG is limited to 1D or quasi-1D systems and does not work well for higher dimension systems [36]. To keep the same accuracy, the computational costs of DMRG grow exponentially with the width of the lattice. So far, the width of DMRG simulation is limited to ≈12 [47]. Very recently, the so-called tensor network methods, e.g., the projected entangled-pair states (PEPS) method [46] have been developed, which can simulate the fermionic and frustrated systems. Some of the authors in this paper have developed large scale PEPS code PEPS++, and implemented it on Sunway TaihuLight [18]. We have demonstrated solving the 1-2 model on a 24 × 24 lattice with open boundary conditions (OBC), with the bond dimension = 16. However, the periodic boundary conditions (PBC) usually has lower boundary effects than OBC, there are more suitable for simulating quantum many-body systems. However due to the high scaling to the bond dimension ( 18 ), there is no report of PEPS simulation on PBC with ≥ 6, which is still too small to catch the essential physics of quantum many-body models.

Deep learning methods for quantum
many-body problems Recently, the rapid progress of artificial intelligence powered by deep learning [37,38] has attracted science community on applying deep learning to solve scientific problems. For example, using neural network to solve differential equations [17], accelerate molecular dynamics [21], predict proteins' 3D structures [22], control nuclear fusion [13], and so on. There are also great efforts in applying the deep learning methods to study the quantum many-body problems.
Take the quantum spin model as an example, a quantum many-body state is represented by a superposition of the spin configurations, where | ⟩ = | 1 , 2 , · · · , ⟩ is a set of spins with one on each site of the lattice, and is the total number of sites. If the degree of freedom on each site is , the Hilbert space (i.e., the number of | ⟩) of the problem is . The weight of the spin configuration ( ) is represented by a neural network. The network parameters are optimized to determine the ground state of the system, i.e., the state of the lowest energy g.s = min⟨ΨˆΨ⟩/⟨Ψ|Ψ⟩, whereˆis the Hamiltonian of the system.
The self-learning procedure for optimizing the neural network is depicted in Fig.(1)(c). Firstly, Markov-Chain-Monte-Carlo (MCMC) is performed according to | ( )| 2 . When collecting a sample, the forward of neural network gives local energy and the backward of neural network gives local derivative . Secondly, the gradients and the covariance matrix needed for the so-called Stochastic Reconfiguration (SR) optimization method [39] are calculated. Finally, the network parameters are updated, and the values for updating parameters are . Within the fast computation speed of neural networks, the wave-functions can be further optimized by a Lanczos step [19]. Table 1. Comparisons of current state-of-art neural network methods for solving spin models and the Fermion models. The frustrated-free Heisenberg model has been solved by RBM to a rather high precision [8]. For the 1-2 model, three kinds of methods are used: 1, a single CNN [27]; 2, a deep CNN [11,24,26,43]; 3, a Gutzwiller projected wave-function and a RBM [16,33]. Because of the high computational complexity of Gutzwiller projected wave-function, the investigated lattice size is limited to =18 [33]. The method of a deep CNN with has lower computational complexity, however the energy's precision depends on the representation ability of the deep CNN [11,26,43]. With a proper deep CNN structure, and increasing the parameter number to the magnitude of 10 5 , the energy's precision reaches state-of-the-art [24]. In this work, the parameter number is further increased to the magnitude of 4 × 10 5 , and the investigated lattice size is increased to =36. For the Fermion models, because of the high computational complexity of Slater determinants, the investigated lattice size is limited [31,32,41]. In this work, the -model is solved on the square lattice with size up to =12, using the determinant free deep CNN wave-functions. Comparing to the traditional deep learning tasks such classifications, there are some major challenges to solve the quantum many-body problems via neural networks, because one has to obtain an extremely high accurate ground state in the exponentially large Hilbert space:

Model
• The neural network's generalization ability should be high enough to represent the quantum state in the exponentially large Hilbert space. • Double precision of the network parameters is mandatory. • The ground energy should be the global minimal of the energies. The first-order-gradient based optimizers like Adam, SGD, et.al are not efficient, as they can be easily trapped into local minimal. Here, a second-order natural-gradient-like method, such as the Stochastic Reconfiguration method is used. • Finding the solution in the exponentially-large Hilbert space requires extremely large amount of MCMC samples, which is also required for the precise SR optimization.

CURRENT STATE OF THE ART
Great efforts have been made to solve the quantum manybody problems via deep learning method [7,9]. The comparisons of current state-of-the-art results are listed in Tab Restricted-Boltzmann-Machine (RBM), with 323200 network parameters and obtained rather high precision results on the 10 × 10 square lattice [8]. However, the NN can not handle frustrated systems, such as the famous 1-2 model, which is depicted in Fig.(1)(a). Liang and He [27] solved the frustrated 1-2 model using a single-layered CNN with 11009 parameters with satisfactory precision. To further improve the energy precision on the 10 × 10 lattice, the Gutzwiller Projection fermionic wave-function with a RBM [16] and the deep CNN with a sign rule [11] are used. In the year of 2021, on the "Fugaku" supercomputer, the pair product states with a RBM has achieved high accuracy on the lattice as large as 18×18 [33]. In the year of 2022, by increasing the deep CNN's parameter number to 106529, the energy's precision is highly improved and the lattice size is increased to 24 × 24 [24]. In this work, the deep CNN's parameter number is further increased to 421953, and the square lattice size is increased to =36. The Fermion models are even more challenging. In the year of 2019, by employing a CNN based Jastrow and sign corrections, the spinless Fermions model is investigated on a 10 × 10 lattice [41]. There are several investigations for the Fermi-Hubbard model. In 2019, by employing a fullyconnected neural network (FNN) in the Slater determinant, the model is studied on a 4 × 4 lattice [31]. In the year of 2021, a FCNN is used as the determinant free wave-function to study the model on a 6 × 6 lattice [20]. In the year of 2022, by employing the FNN in the Slater determinants, the lattice size is increased to 4 × 16 [32]. In this work, by using an efficient CNN with the network parameters of 113815, we investigate the -model, up to a 12 × 12 lattice. Themodel is schematically depicted in Fig.(1)

Summary of contributions
We develop a highly efficient and highly scalable method for solving quantum many-body problems to the extremely high precision. This is achieved by combining the unprecedented representation ability of the deep CNN neural network structures ( Fig.2(a)-(d)), and a highly scalable (Fig.2(e)(f)) and finetuned implementation on heterogeneous Sunway architectures (Fig.3). With the optimized HPC-AI framework and up to 40 million heterogeneous sw26010pro cores, we solve the ground state of the 1-2 model in the 36×36 lattice, corresponding to a 2 1296 dimensional Hilbert space, and solve the -model in the 12 × 12 lattice, corresponding to a 3 144 dimensional Hilbert space.

Neural network innovations
To meet the strictly demanding requirement, we develop two deep CNN structures for the spin models and fermion models, respectively. We denote the two NN structures, CNN1 and CNN2 respectively. The computational complexity of CNN is much lower comparing to the fully connected structure. By taking the advantage of translational invariance, the CNN can scale to very large lattices.
The structure of CNN1 is depicted in Fig.(2)(a)(b), the network is built by stacking the building block denoted in Fig.(2)(a) for six times [26]. A building block consists the twodimensional convolution and one-dimensional maxpooling and transposed-convolution. To maintain the dimensions, paddings are employed on the input spin lattice and between two building blocks. The padding scheme is based on the periodic boundary conditions. The final output is the product of the neurons, which is the wave-function coefficient for the input spin configuration.
The structure of CNN2 is depicted in Fig.(2)(c)(d). CNN2 is originated from CNN1, with two major modifications: 1, the input spin lattice is processed by an embedding layer; 2, the one-dimensional maxpooling is performed with stride one. To maintain the dimensions, padding is performed by copying the first several neurons to the last. The deep structure is built by stacking the building block in Fig.(2)(c) for six times. Finally, with a convolution operation, the channel number is decreased to one, the final output is the product of the neurons.
The deep CNN structures used in this work have several important differences compared to other CNN structures for quantum many-body problems. [8,11,41] First, the nonlinearity in our deep CNN is induced by the maxpooling, instead of traditional activation functions. The maxpooling picks up the most important degree of freedom in a convolution filter, which is similar to the coarse-grained process in a renormalization group theory. [35]. Secondly, the wavefunction coefficient is generated by the product of neurons, which differs from the exponential function in RBM based structures.

Algorithm innovations
The training process of deep learning usually start from random parameters. A good initial state may greatly accelerate the training process, especially for the quantum many-body problems, which have exponentially-large Hilbert space. A good initial state is also benefit for avoiding local optimizations and therefore drastically reduces the computational cost, especially for large-scale lattices. We develop several techniques to select the initial states.

Transfer learning for large lattice.
Because of the exponentially large Hilbert space, direct training of the model in large lattice starting from random parameters is very difficult. The convolution operation is intrinsically scalable for different lattice sizes. Furthermore, the energy convergence on a small lattice is fast, as the lattice size increases, the ground energy will converge with respect to the lattice size. Fig. 2(e) presents transfer learning technique applied in this work. Initially, the network is trained on 6 × 6 lattice from randomly initialized parameters. The network parameters are then checkpointed and used for optimizing large 10 × 10 system as the pre-trained model. With multiple stages of transfer learning, we finally obtain a good initial state for solving the ground state on 36 × 36 lattice. Thanks to transfer learning, the local optimizations are avoided and the training steps for large lattice are significantly reduced.

Parallel initial state selection.
Because of the exponentially large Hilbert space, a proper initial model parameter and the initial spin configuration is beneficial for avoiding local optimizations, thus crucial for fast energy convergence. Here we calculate the energy of a randomly initialized CNN by performing MCMC sampling, assuming that the CNN with lower initial energy will have a better convergence after SR optimizations. The initial state selection for -model is shown in Fig. 2(f). To adapt to the heterogeneous architecture, the paradigm can be described in two-level: the different processes have different initial CNN parameters and the independent chains in each process have different initial spin lattice. After the MCMC process, the CNN parameter together with the initial spin configuration that has the lowest energy result are selected. After the initial state selections, the optimization steps are significantly reduced especially on large lattices.

HPC-AI hybrid framework innovation
Despite the fact that quantum many-body problem resides in the realm of dimensional reduction and feature extraction in a much border context, current deep learning frameworks and model optimizations are inefficient. There are two major difficulties: Firstly, the input size and the network size applied in physics are usually smaller than that of mainstream deep learning tasks, where insufficient computation leads to low computing efficiency and hardware utilization. Secondly, the global minimal energy is difficult for the prevalent first-order gradient optimizer (i.e., Adam, SGD) and the mini-batch based training. To address above difficulties, we propose a highly optimized HPC-AI hybrid framework on the new Sunway supercomputer and achieve multifaceted improvements.

MCMC-SR parallel framework.
The whole application can be divided into two components: the front-end MCMC sampling and the back-end model optimization. The first stage can be seen as data parallelism, where different processes hold the same model parameters but compute different input data. The second stage is responsible to handle MCMC samples and compute the of network parameters in SR by constructing a covariance matrix.
Process-level parallel MCMC Fig. 3(a) depicts the parallel MCMC sampling among multiple processes. The importance sampling in the Hilbert space is achieved by multiple independent Markov chains. Considering the MC sampling tasks are intrinsically parallel, they can be distributed across all participated processes. Inside each process, the model forward and backward generate and , respectively. The computation of the deep CNN is divided layer by layer in the deep learning framework swFLOW. On sw26010pro, the major computation on MPE is accelerated by the swHMAE with 64 CPEs. The detailed discussion on swFLOW and swH-MAE will be presented in Sec. 5.4.2 . The distributed MCMC sampling of four processes will generate 4 × samples, when each process owns samples. Each sample contains parameters, where is the number of network parameters.
Thread-level parallel MCMC By increasing batch size in the model forward and backward process, the hardware utilization can be further increased. One process maintains multiple independent chains and obtains multiple sampling states in one step, which greatly increases the independent chain number and improves the overall performance. Meanwhile, multiple repeated CNN forward with varying lengths can be replaced by the fixed-length execution, seen in Fig.  3(b). With the same batch size, the repeated model forward executions may reuse the existing tensor memory resource and improve the performance with efficient static graph execution mode. Such implementation increases the sample number and leads to faster energy convergence.
Hotspot energy calculation The most time-consuming step in each iteration is to collect the local energy , which is about ( 2 ) complexity for × lattice. Specifically, obtaining the correct energy of each lattice requires tremendous computation for sprime_batch (∼ 2 for ) and sampling gap (∼ 2 for random walking). Therefore the model forward execution is roughly 4 2 times that of backward, which differs from mainstream tasks (each forward execution corresponds to one backward). For the 1-2 model with =36 square lattice, one optimization step with millions of independent samples requires billions of CNN forward execution,  Thread-level parallel sampling thus the calculation of CNN forward comprises over 95% of total execution time.
Taking the spin-1/2 1-2 model for instance, the lattice sampling in Fig.3(c) presents an example of 4 × 4 lattice. The spin lattice is wrapped as a 2D tensor (the input image) for model input. Each pixel in the image represents the spin value on the corresponding lattice site, where the value of each pixel is = ±1. The whole image represents a spin configuration, which is a basis of the wave-function: | 1 , 2 , · · · , 16 ⟩. The local energy is closely related to Hamiltonian. The process for calculating a local energy is denoted in Fig.3(c). For example, considering the nearestneighbor interactions, when the spin pair between nearestneighbor sites (⟨ , ⟩) owns different spin values ([-1, 1] for 1-2 model or [-1, 0, 1] for -model), a new spin configuration with flipping the spin pairs is collected. In the figure, three possible spin configurations according to the proposed spin pairs are marked in red. Considering all the terms in the Hamiltonian, the batch size of collected spin configurations is in the magnitude of ( 2 ). These spin configurations are then fed into the CNN model and then generate the logits ( ( )). The local energy is calculated according to ( ) = ′ ( ′ ) ( ) ⟨ ′ˆ⟩ . Distributed SR computation With sufficient samples from parallel Markov chains, the values for updating network parameters are calculated. In Fig.3(d), the procedure for computing includes three steps: adjusting the data format, constructing the covariance matrix, and solving a system of linear equations. After MCMC sampling, in each process there are a series of samples, and the samples are then organized as a 2D-mesh distributed style. For the system with processes, the collected samples are divided into √ blocks, and the processes are split into √ groups, where each group contains √ processes. The data block is scattered into corresponding processes within the group. For the instance of four processes (Proc-0 and Proc-1 in group-1, Proc-2 and Proc-3 in group-2), the block exchange between 2 processes can be illustrated by swapping blocks of ② and ③ in group 1 (⑥ and ⑦ in group 2). With 2D-mesh distributed matrix [4 , ], the following equations construct 2D-mesh matrix [ , ] with parallel matrix multiplication and inversion. The final is obtained by combining the matrix with energy's first-order gradient. Similar to data parallelism training, all processes shares the same by MPI broadcast. Fig.3(e) presents the overall software stack of swFLOW, which is a deep learning framework to support the efficient running of AI applications on Sunway platform. With standard computational graph design, various neural network models can be easily realized. For adapting swFLOW to the sw26010pro processors, the neural network operator executions are accelerated by Sunway heterogeneous many-core accelerator engine(swHMAE). For typical operators such as convolution, matrix multiplication, they are supported by libraries such as swDNN and swBLAS. In addition, the customized many-core algorithm optimization is directly implemented in swHAME, such as Tile and Slice or other customized exclusive OPs.

Deep learning software stack.
Besides the common operators in deep learning, in the CNNs used in this work, some operators like the PBC padding needs to be specially treated. Where the padding scheme is depicted in Fig.3(e), the 2 × 2 tensor is padded into the 4 × 4 tensor (marked in red). The original tensor is tiled on height and width dimensions, and we then extract the central part of the extended tensor. As depicted in Fig.3(e), the input tensor is firstly split into many sub-tensors, where each sub-tensor has three elements, such as [a,b,c]. Then we use DMA with broadcast to copy each sub-tensor from the main memory to LDM space. Finally, each CPE will use DMA to move LDM data to the main memory corresponding to the output tensor. In this way, the memory access bandwidth and parallelism of heterogeneous cores can be fully utilized.

Physical systems
In this work, we benchmark our methods on two important and representative models, namely the 1-2 frustrated spin model [ Fig.1(a)] and the -fermion model [ Fig.1(b)]. The 1-2 model is a candidate model for the quantum spin liquids [29,33], whereas the -is the basic model for the high-temperature superconductivity. In this work, both models are investigated on the × square lattices with periodic boundary conditions.
The Hamiltonian of 1-2 model reads: where ⟨ , ⟩ and ⟨⟨ , ⟩⟩ indicate the nearest and next-nearest neighbouring spins pairs. s is the spin operator on the -th site. We set 1=1 and 2 = 0.5 throughout the investigations.
The Hamiltonian of -model is: where ⟨ , ⟩ indicate the nearest neighbouring pairs. The operator † , creates an electron of spin s on site , s and are the electron spin moment and charge density operators on site . We encode the Fermions into spins via Jordan-Wigner transformation [4]. In the simulations, =1, =0.4 and the hole doping ℎ =0.125 is used.

HPC System and Environment
The sw26010pro processor is the upgraded version of the sw26010 processor for the Sunway Taihulight supercomputer. Fig.4 presents the architecture of sw26010pro, which consists of six core groups (CG), where each CG is attached to a ring network. Each CG contains one management process element (MPE, control core) and a cluster of 64 computing process elements (CPE, computing core) arranged in 8 × 8 mesh.
Generally, the MPE is responsible for the management of all computing resources and global memory, while the CPE is responsible for executing the tasks assigned by MPE. Besides, a local directive memory (LDM) is attached to each CPE, similar to the shared memory in GPU.  Figure 4. Architecture of sw26010pro many-core processor.
The software environment has been greatly improved on the new Sunway supercomputer, especially for AI applications [15,25,28]. Meanwhile, the traditional high performance algebra libraries, like BLAS, Lapack, and the distributed ScaLapack [10], are also optimized. The new generation of Sunway supercomputer delivers great potential to attack various challenges for both HPC and AI.

Measurement Methodology
To achieve high-accuracy ground state energy, double precision floating point numbers are required for faithful quantum system simulations. Since the overall software stack is very complicated, varying from deep learning (swFLOW, swH-MAE and swDNN) to classic HPC environment (swBLAS, swLAPACK, swScaLAPACK), manually counting all doubleprecision arithmetic instructions is impractical. Instead, we collect the number of double-precision arithmetic operations by using the hardware performance monitor provided by the vendor of new Sunway supercomputer. The evaluation is conducted on new Sunway supercomputer, and the specific experimental parameters are listed in Table 2. The evaluations from both aspects of computing performance and physics system are elaborated. The manycore CPE optimization for single process and the scalability analysis on the Sunway supercomputer are presented below, as well as the detailed analysis for the parallel CNN-based MCMC and distributed SR computing. Meanwhile, we further evaluate the model training for achieving the global lowest energy, as well as the comparison with state-of-theart results.  10k processes 40k processes 90k processes 160k processes 10k processes 40k processes 90k processes 160k processes   106529  parameters   8  54  56  56  57  58  41  38  39  16  106  106  108  110  66  48  46  47  32  202  206  210  208  76  63  59  63  64  408  408  410  413  96  95  115  121  128  705  702  708  711  135  158  167  213   421953  parameters   8  102  100  104  108  666  450  331  300  16  212  212  216  216  746  489  421  433  32  384  386  384  388  909  604  515  540  64  748  750  754  756  1087  843  766  867  128  1288  1280  1292  1300  1459  1338 1259 1521

Many-core acceleration on SW26010pro
On the new Sunway supercomputer, most computing performance for each CG is provided by 64 CPEs. In order to achieve high performance on the Sunway supercomputer, it is crucial to effectively use all these CPEs. For the common compute-intensive OPs, swDNN library finally gains 538x, 211x and 505x speedup for Conv2d, Conv2dBackInput and Conv2dBackFilter, respectively. As previously introduced, the neural networks used in this paper involve lots of memoryintensive OPs (i.e., PBC_padding, Conv1d). Therefore, the special models proposed for physics are much restricted by the gap between the computing capability and the memory access bandwidth. Here we separately optimize these operators on new Sunway heterogeneous processors, especially for several memory-intensive operators. Table 4 presents the test cases and results for the typical memory-intensive operators. Compared to the MPE baseline version, swHMAE achieves great performance improvements. The many-core speedup ratio reaches 34.45 for Tile, 39.34 for Slice, 32 for PBC, and 52.77 for GradPBC. The GradPBC operator obtains better speedup, and it originates from its computing feature. It is not a pure memory access operator, but includes a certain amount of accumulation calculation, which increases the utilization of CPEs. With these highly optimized kernels, the overall speedup for hotspot energy calculation achieves considerable performance improvement over the original MPE version. The many-core

Scaling
For the whole application test, we record the total execution time of a complete training iteration, which includes four kernels. The MCMC stage includes the CNN-based random walking and importance sampling, and the SR stage includes building covariance matrix and matrix inversion. Fig.3 presents the CNN-based sampling, which is principally independent among all participated processes. There is little communication overhead for increasing processes and chains. In detail, the CNN computation is related to the lattice size (∼ 2 for the proposed spin pairs and ∼ 2 for model forward calculation). On the contrary, the SR computing is determined by the number of network parameters and processes, but is not relevant to the lattice sizes.   Table 3. The MCMC can be perfectly parallelized and exhibits excellent scalability. There are two hotspots for SR optimization: matrix multiplication (gemm) and matrix factorization (potrf ). The computation of gemm increases along with the size of the rows and columns, which are the number of the total Markov chains and parameters. Meanwhile, the computation of potrf is only related to the number of parameters, which is completely independent of the system scale or Markov chain number. For the model with 106529 parameters and small batch size, the computation amount of potrf is much higher than that of gemm, thus dominating the execution time of SR. The SR time decreases as the processes increase for the parallel potrf (Fig. 5). However, the gemm increases (Fig. 5) as the batch size increases, which gradually dominates the most part of execution time. Since the parallel potrf may saturate for the fixed model parameters, the SR time eventually increases with the increasing processes. The 421953-parameter model shows a similar phenomenon, except that the computation amount of potrf increases dramatically (potrf is sensitive to parameter number), and more processes are required to reach the performance saturation point.

Strong scaling.
In order to examine the performance of strong scalability, the Markov chains (batch size) per process are decreased in proportion to the process number so that the total Markov chains are kept at around 6 million.
The detailed test results are shown in Fig. 6(a), where the elapsed time of four kernels are introduced. Meanwhile, it demonstrates the parallel efficiency of two test cases with 10 × 10 1-2 model and 12 × 12 -model, respectively. As is shown in the columns, the elapsed time of MCMC-energy decreases from 141s to 38s and achieves a parallel efficiency of 97% when scaling to 39,546,000 cores. The MCMC-walking time decreases from 70s to 24s and achieves 77% parallel efficiency. The SR part involves distributed and parallel ScaLapack calculation. The performance strongly relies on the communication and the elapsed time decreases from 76s to 49s with a 40% parallel efficiency. Overall, the 10 × 10 case obtains a 2.6x throughput improvement and a parallel efficiency of 69% when scaling to 39,546,000 cores, while the 12 × 12 case shows slightly better strong scalability than the 10×10 case (benefiting from the fixed SR time and increasing proportion of MCMC time), which obtains a 2.8x throughput improvement and 72% parallel efficiency.

Weak scaling.
The detailed weak scalability results are drawn in Fig. 6(b), including the elapsed time of MCMC (random walking, energy computation) and SR (covariance matrix, and matrix inversion), as well as the parallel efficiency of 2 test cases with 12 × 12 -model and 16 Table 5 presents the average energies of the best initial states found via different number of Markov chains. The energies are achieved by the CNN2 with 113815 parameters. Obviously, the larger the number of Markov chains involved in the parallel search, the higher the quality of the initial states found. Because of the large Hilbert space, the difficulty of finding high-quality initial states for large lattice is significantly higher than that for small lattice. However the difficulty can be overcome by increasing chain number. For small lattice quantum systems, the quality of the initial state has relatively little influence on energy convergence. With the selected initial states, after 100 SR steps, the energy decreases from -0.3517 to -0.4523 for 12 × 12 lattice; from -0.3483 to -0.4599 for 16 × 16 lattice, and from -0.2900 to -0.4173 for 24 × 24 lattice.

Optimization Results
The optimized energies of the 1-2 model and the -model are presented. For the 1-2 model in the maximal frustration region 2 = 0.5 1 , the convergence of energy with respect to lattice size is depicted in Fig.7. In the figure, the energies are obtained by the CNN1 with the parameter number of 106529. The solid line is fitted with the function = / 3 + .  Other state-of-art results are also shown in the figure for comparison. Comparing the energies in this work to the results obtained by the PP+RBM method on the "Fugaku" supercomputer [33]. When =10 the energy in this work is -0.497468, which is 3.2 × 10 −4 higher than the -0.497629 reported in [33], and 5.8×10 −4 lower than the -0.49717 reported in [24]. For =18, the ground state energy obtained in this work is -0.496500, which is 4.5 × 10 −4 lower than -0.496275 reported in [33]. Thanks to excellent scalability of the CNN1, we are able to study the model on the sizes that are significantly larger than previous works. The ground state energies of the -model are presented in Table.6. Unfortunately, so far there is no reference ground state energies for the -with PBC, we therefore compare the results to those of OBC calculated via PEPS in Ref. [14]. In the table, the energies are obtained by the CNN2 with the parameter number of 113815. When =8, the energy achieved by CNN2 is -0.60965, which is 3 × 10 −3 higher than that reported in [14]. When =12, the energy achieved by CNN2 is -0.63217, which is 3.8×10 −3 lower than that reported in [14]. We would like to note that the energies of different boundary conditions can not be compared directly, which may have some small difference. The results show here are only a demonstration of the effectiveness of the CNN2 for Fermion systems.

IMPLICATIONS
In this work, we report simulations of challenging quantum many-body problems that combines the state-of-art deeplearning method and efficient implementation on the nextgeneration computational platform. We have obtained the ground state energies of the 1 -2 model and -model with accuracy that surpass the of current-state-of-art results. With this ability, we are able to simulate a large class of important physical models to the high accuracy to make confident conclusions about the physics. The method therefore opens up a new promising path to solve the extremely important and challenging problems that have plagued people for more than half a century, and would help us to understand exotic physical phenomena, such as quantum spin liquids , high temperature superconductivity [12], supersolid [5], heavy fermions [40], fractional quantum Hall effects, [23,42,45] and many more, which is not only essential for the understanding of fundamental physics but also have many important applications.
On the other hand, this work demonstrated the effectiveness of solving the quantum many-body problem via the deep learning methods on the supercomputer platform, which is completely different from traditional machine learning tasks. These applications represent a large class of scientific problems that are more challenging in the representation ability of the network and the accuracy requirements of the results. It also puts forward higher requirements for efficient optimization algorithm and implementation. This framework may apply to other problems, e.g., the simulation of quantum circuit on classic supercomputers. Currently tensor network methods have shown high efficiency for quantum circuit simulations on supercomputers [30] on the lattice as large as 10 × 10. The scalable neural network based quantum state representation may capable to perform quantum circuit simulations for even larger number of qubits.
Our framework is not limited to Sunway, and it can be easily ported to other supercomputing platforms. However, as computational power increases, the overall application performance will be limited by the gap between the computing capability and the memory access bandwidth. Here we take the computational power v.s. memory bandwidth ratio (FLOP/Byte) to measure the potential performance. Because the single precision does not meet the requirements of achieving faithful ground state energies, here we focus on the double precision FLOP/Byte. With the double precision support of tensor cores, the FLOP/Byte ratio on A100 GPU is 9.56 (19.5TFLOPS/2039GB/s=9.56FLOP/Byte). The double precision FLOP/Byte ratio on Fujitsu A64FX CPU is 3.3 (3.3792TFLOPS/1024GB/s=3.3FLOP/Byte). Considering the performance of the LDM used in sw26010pro is not competitive to the performance of HBM2 used in A100 and A64FX, therefore the performance of our framework can be further boosted on Summit or Fugaku systems.