The MPI + CUDA Gaia AVU–GSR Parallel Solver Toward Next-generation Exascale Infrastructures

We ported to the GPU with CUDA the Astrometric Verification Unit–Global Sphere Reconstruction (AVU–GSR) Parallel Solver developed for the ESA Gaia mission, by optimizing a previous OpenACC porting of this application. The code aims to find, with a [10, 100] μarcsec precision, the astrometric parameters of ∼108 stars, the attitude and instrumental settings of the Gaia satellite, and the global parameter γ of the parametrized Post-Newtonian formalism, by solving a system of linear equations, A × x = b , with the LSQR iterative algorithm. The coefficient matrix A of the final Gaia data set is large, with ∼1011 × 108 elements, and sparse, reaching a size of ∼10–100 TB, typical for the Big Data analysis, which requires an efficient parallelization to obtain scientific results in reasonable timescales. The speedup of the CUDA code over the original AVU–GSR solver, parallelized on the CPU with MPI + OpenMP, increases with the system size and the number of resources, reaching a maximum of ∼14×, >9× over the OpenACC application. This result is obtained by comparing the two codes on the CINECA cluster Marconi100, with 4 V100 GPUs per node. After verifying the agreement between the solutions of a set of systems with different sizes computed with the CUDA and the OpenMP codes and that the solutions showed the required precision, the CUDA code was put in production on Marconi100, essential for an optimal AVU–GSR pipeline and the successive Gaia Data Releases. This analysis represents a first step to understand the (pre-)Exascale behavior of a class of applications that follow the same structure of this code. In the next months, we plan to run this code on the pre-Exascale platform Leonardo of CINECA, with 4 next-generation A200 GPUs per node, toward a porting on this infrastructure, where we expect to obtain even higher performances.


Introduction
In this epoch of technological evolution, the size of the problems to solve in several contexts is rapidly increasing and can also require up to ∼10-100 PB of storage.To allow the analysis of these Big Data, novel parallelization techniques have to be continuously defined to find solutions in human-size timescales.The architecture of the infrastructures is also consequently changing, becoming increasingly heterogeneous (Carpenter et al., 2022), to accomplish the necessity of optimally computing data of these sizes, going toward the (pre-)Exascale era.The supercomputers will require an increasing number of computational nodes, which will have, in turn, a RAM memory organized in a multi-levels hierarchy of nonvolatile memories and hosts less performant than the accelerators (such as GPUs or FPGAs) that will be increasingly employed for calculations and will have an increasing memory and number of streaming multiprocessors.This configuration will also be likely to achieve the target of Green Computing, namely to process this amount of Big Data without excessively increasing the energy consumption while obtaining a high performance.Moreover, the infrastructures will need increasingly faster bridges between the CPU and the accelerators, to reduce the host-to-device (H2D) and the deviceto-host (D2H) data transfers bottleneck in the applications, and storage areas defined with parallel filesystems to guarantee a faster access to the data (Carpenter et al., 2022).Computer clusters such as Marconi100 (M100) of CINECA1 and JUWELS of Forschungszentrum Jülich2 are already going in this direction but a turning point will be provided by next-generation pre-Exascale infrastructures, such as the CINECA platform Leonardo3 , which started to be operative last November.
A typical science case that might involve a large amount of data is the inverse problem, that consists in estimating the parameters of a model from a set of observational measurements.Two possible approaches for this task are the frequentist and the bayesian ones.Concerning the former approach, one of the exploited computational techniques is the LSQR iterative algorithm, to solve large, ill-posed, overdetermined, and possibly sparse systems of equations (Paige and Saunders, 1982a,b).This algorithm is employed in several contexts, such as medicine (Bin et al., 2020;Guo et al., 2021), geophysics (Joulidehsar et al., 2018;Liang et al., 2019a,b), geodesy (Baur and Austen, 2005), industry (Jaffri et al., 2020), and astronomy (Borriello et al., 1986;Van der Marel, 1988;Naghibzadeh and van der Veen, 2017;Becciani et al., 2014;Cesare et al., 2021Cesare et al., , 2022c,a,b),a,b).For a more in-depth discussion about the LSQR algorithm and other LSQR-based applications and libraries, see Section 2 of Cesare et al. (2022b).
As an example, in the astronomy context, this algorithm is employed by the Gaia Astrometric Verification Unit-Global Sphere Reconstruction (AVU-GSR) Parallel Solver.This code was developed for the ESA Gaia mission (Gaia Collaboration et al., 2023) under the Data Processing and Analysis Consortium (DPAC) (Mignard and Drimmel, 2007), i.e., the scientific community of the mission, funded by the national space agencies, in charge of the definition of the data reduction pipelines (Vecchiato et al., 2018).The code has been in production since 2014 on M100 cluster according to an agreement between Istituto Nazionale di Astrofisica (INAF) and CINECA, with the support of the Italian Space Agency (ASI).
The Gaia AVU-GSR code solves with the LSQR algorithm an overdetermined system of linear equations (Becciani et al., 2014;Cesare et al., 2022b), where A is the coefficient matrix, and b and x are the arrays of the known terms and of the solution, respectively.The matrix A is sparse and it might contain ∼10 11 × 10 8 elements for the expected final dataset of Gaia.Even only considering its non-zero coefficients, it will occupy a large amount of memory (∼10-100 TB).By solving this system, the AVU-GSR code finds, with an accuracy in the range of 10-100 µarcsec and of 10-100 µarcsec year −1 , the astrometric parameters (parallaxes, right ascension, declination, and proper motions along these two directions) of ∼10 8 stars in the Milky Way, the so-called primary stars (Vecchiato et al., 2018).The Gaia AVU-GSR code is a verification module of the same solution found with the software Astrometric Global Iterative Solution (AGIS; O' Mullane et al. 2011;Lindegren et al. 2012) adopting a different algorithm, to make the determination of the astrometric parameters more robust.Besides the astrometric parameters, the Gaia AVU-GSR solver finds the attitude and instrumental specifications of the Gaia satellite, and the global parameter γ of the Parametrized Post-Newtonian (PPN) formalism, with the same precision around [10,100] µarcsec.The high accuracy of these parameters is essential to properly investigate the formation and the evolution of the Milky Way (e.g., Giammaria et al., 2021;Krolikowski et al., 2021) and to test Einstein's theory of General Relativity (e.g., Vecchiato et al., 2003;Hees et al., 2018;Crosta et al., 2020;Butkevich et al., 2022).
The LSQR algorithm is the bulk of the AVU-GSR solver and it works by calculating, at each iteration, the iterative estimates of the known terms and of the solution arrays with the aprod 1 and aprod 2 functions: and which are the most computational demanding parts of the LSQR procedure, representing more than 90% of the entire calculation.
The last official in-production version of the Gaia AVU-GSR code was entirely parallelized on the CPU with a hybrid MPI + OpenMP approach.In Cesare et al. (2022b), we explored the feasibility of a GPU porting of the application by adopting a preliminary approach, where we replaced the OpenMP directives with the OpenACC ones.With this porting, the speedup of the OpenACC code over the OpenMP code, both run on M100, was of ∼1.5.In this paper, we present an optimization of the GPU parallelization of the code starting from the results of our first porting, where we replace the high-level parallelization approach, using OpenACC, with a low-level one, using CUDA (Cesare et al., 2022a).This implied a reorganization of several parts of the code but a substantial performance boost of ∼14x over the OpenMP version, as tested on M100.This speedup might further improve on Leonardo, with GPUs having a larger memory and number of streaming multiprocessors than on M100, which is an optimistic estimate in perspective of a future porting on this platform.The CUDA code also showed to achieve a great numerical stability and to obtain parameters with the required accuracy, reasons for which it was put in production on M100.
The paper develops across the following sections.Section 2 summarizes the general structure of the Gaia AVU-GSR code, Section 3 describes the previous versions of the code, i.e., the MPI + OpenMP (Section 3.1) and the MPI + OpenACC (Section 3.2) ones, and Section 4 details the CUDA porting of the Gaia AVU-GSR code.A performance comparison with the OpenACC porting is presented throughout Section 4. Section 5 compares the performance of the MPI + CUDA and the MPI + OpenMP codes on M100 for a set of systems with increasing size and Section 6 compares the solutions of these systems to verify their consistency and quantify the accuracy of the obtained solutions.At last, Section 7 discuss the main results of the paper and presents the future analyses to be developed.

The structure of the Gaia AVU-GSR code
The black part of Algorithm 1 summarizes the general structure of the Gaia AVU-GSR code, which is common to the OpenMP, OpenACC, and CUDA versions.The preparatory phase consists in importing from binary files the quantities necessary to solve the system (e.g., the coefficient matrix and the known terms; line 1).To accelerate the convergence speed of the iterative procedure, the system is preconditioned before the starting of the LSQR algorithm.Specifically, we normalized the parameters of each column by the norm of the column itself (lines 2-3).The normalization factors of all the columns are stored in a 1D array, p.The solution is re-multiplied by p after the end of the LSQR algorithm (line 27).Then, the initial guess of the solution to be iteratively found with the LSQR algorithm is computed through the aprod 2 function (see Eq. (3); line 13).Each MPI process calculates a part of the solution that is then reduced among all the MPI processes (line 15).
After these passages, the LSQR procedure starts.The LSQR algorithm is a while loop (lines 17-26) that iterates the solution up to a convergence condition or until a maximum number of iterations set at runtime is reached.At each iteration, the two main steps are the execution of the aprod function in the modes 1 and 2 (lines 18 and 22).The aprod 1 (Eq.( 2)) provides the iterative estimate of the known terms b for each equation of the system and for a set of constraints equations (Vecchiato et al., 2018), required since the system is overdetermined.After the calculation of the aprod 1, the b array is reduced among the MPI processes.Then, the aprod 2 (Eq.( 3)) provides the iterative estimate of the solution array x.Also for this step, the constraints equations are defined and the solution is reduced among the MPI processes.At the end of each iteration, the errors (variances) on the unknowns and the covariances between the different couples of unknowns are calculated (line 26).The convergence condition is achieved in the least-squares sense, when the residuals r i = b i −A×x i , estimated at the i-th iteration, go below a given tolerance, set to the machine precision (∼10 −16 on M100).
The coefficient matrix of the system A is large and has a high sparsity degree.Specifically, for the expected final dataset of Gaia, the matrix might contain ∼10 11 × 10 8 elements (see Sect. 1).The rows of A, i.e., the equations of the system, represent the observations of the Milky Way stars, where each star is observed N Obsperstar ∼ 10 3 times, besides the constraints equations.The number of the columns of A is instead the number of unknowns to solve.
For each row, the coefficients are divided in their astrometric, attitude, instrumental, and global sections.The astrometric part of A contains N Astro × N Stars coefficients per row.N Stars is the number of stars considered in the system, in the range of [10 6 , 10 8 ], and 0 ≤ N Astro ≤ 5 is the number of astrometric coefficients per star and the number of non-zero astrometric coefficients per row.The total number of non-zero astrometric parameters is of N Obsperstar × N Astro × N Stars ∈ [10 9 , 10 12 ] and they represent the ∼90% of the coefficient matrix A, where they follow a blockdiagonal structure of N Stars blocks.The N Obsperstar rows of each block are the astrometric parameters observed for the same star and the number of columns of each block is equal to N Astro .In our current modelization, the attitude part has N Att = 12 nonzero coefficients per row, organized in N Axes = 3 blocks of N ParAxis = 4 elements separated by N DFA zeros, where N Axes = 3 is the number of axes of the satellite attitude, N ParAxis = 4 is the number of nonzero coefficients per axis, and N DFA is the number of degrees of freedom of each axis.In the instrumental part, we have 0 ≤ N Instr ≤ 6 nonzero coefficients per row, distributed without a particular scheme.So far, we have considered in the global part only N Glob = 1 coefficient, the γ parameter of the PPN formalism, or we have run without computing a global part.
To operate in human-size timescales, the calculations are performed with a dense coefficient matrix A d that only contains the nonzero coefficients of A for each section.Therefore, the number of coefficients per row passes from ∼10 8 to a maximum of N par = 24, in our current modelization, and the total number of elements of A d is of ∼10 11 × 10 1 .The indexes that the astrometric, the attitude, and the instrumental coefficients of A d had in the original matrix A are stored in two one-dimensional integer arrays, M i (for the astrometric and attitude parts) and I c (for the instrumental part), to map the correct positions of these parameters in the matrix A. For further details about the structure of the coefficient matrix, see Sections 3 and 4 of Cesare et al. (2022b).
The system of equations is parallelized over the MPI processes such that different subsets of the total number of observations, n, are assigned to each MPI process.The one-dimensional integer array N [nproc] stores the number of observations assigned to each MPI process, where nproc is the number of MPI processes defined at runtime. Figure 1 represents the system of equations parallelized on four MPI processes in one node of a computer cluster, where different colors refer to diverse MPI processes.
Concerning the rows of A, whereas the observation equations are distributed among the MPI processes throught the N array, the constraints equations, placed at the bot- Different colors (yellow, light green, dark green, and blue) refer to different MPI processes or processing elements (PE).The block-diagonal part in the left side of the coefficient matrix illustrates its nonzero astrometric section.In the middle panel, the four square blocks diagonally placed, and labeled as "Astrometric" represent the astrometric part of the solution array, distributed among the MPI processes.Instead, the four dark gray aligned blocks, labeled as "Att+Instr+Glob", represent the attitude, instrumental, and global portions of the solution array, replicated on each MPI process, as written above.At the end of each iteration i, a reduction of the replicated portions of x is performed.tom of the system, are replicated on each MPI process.For this reason, after the execution of the aprod 1 function, the only part of the known terms array b that has to be reduced among the MPI processes is the one related to the constraints equations.Given that the constraints equations represent a negligible fraction of the total number of equations, their replica was more convenient compared to their distribution among the MPI processes, which would have implied a rearrangement of the code.Concerning the columns of A, the astrometric part is distributed among the MPI processes whereas the other three parts are replicated on them.Therefore, after the execution of the aprod 2 function, only the attitude + instrumental + global parts of the solution array x are reduced among the MPI processes.The regular blockdiagonal structure of the astrometric parameters made their distribution among the MPI processes more intuitive.Instead, the other three sections do not follow a regular pattern, which would have made their distribution on the MPI processes less trivial.Since, the attitude + instrumental + global parts only represent the 10% of the total system, their replica does not imply a substantial slowdown of the code.

The OpenMP parallelization
In the in-production code, the observations assigned to each MPI process are further parallelized over the OpenMP threads.The left panels of Algorithms 2 and 3 highlights in boldface the regions of the code parallelized with OpenMP, namely the aprod 1 and 2 functions.In the aprod 1, we parallelized the for loop that iterates on the number of observations in each MPI process, N [pid], with the #pragma omp for directive, where pid identifies the rank of the MPI process.Instead, in the aprod 2 the most external for loop iterates from where tid is the ID number of the OpenMP thread, that goes from 0 to nth, the total number of threads, and N t is a one-dimensional integer array that contains the observations computed by each thread tid.Specifically, N t [tid][0] and N t [tid][1] are the first and the last observation computed by the thread tid.

The OpenACC parallelization
In our preliminary porting to a GPU environment, the OpenMP parallelization model is replaced by Ope-nACC (Cesare et al., 2021(Cesare et al., , 2022c,b),b).The middle panels of Algorithms 2 and 3 highlights in boldface the correspondent parts of the left panels, parallelized with OpenACC instead of OpenMP.For reasons of optimization, we divided the aprod 1 function in four parallel regions, one for each section of the system, and we organized the aprod 2 in a single parallel region.Each parallel region is enclosed within a #pragma acc parallel directive, which starts a parallel execution on the current device.In the aprod 1, the variable sum is defined within the private clause (lines 4, 14, 27, and 37 of Algorithm 2), which ensures each GPU thread to have a local copy of it.In both aprod 1 and aprod 2, we parallelized the most external for loop in each parallel region with the #pragma acc loop directive (lines 6, 16, 29, and 39, in Algorithm 2, and line 6, in Algorithm 3).These for loops iterate on the observations related to each MPI process from 0 to N [pid], also in the aprod 2 function, where the array N t [ is no more needed since we do not use any longer the OpenMP threads.In the aprod 2, the #pragma acc atomic directive (lines 10, 17, 21, and 25, of Algorithm 3) prevents the GPU threads to simultaneously overwrite the same elements of the array x, i.e., it avoids a data race condition.
We tested the performance of the MPI + OpenACC code on M100, with 4 NVIDIA Volta V100 GPUs per node having 16 GB of memory each.Figure 2a shows the output of the NVIDIA Nsight System profiler 4 correspondent to one iteration of the LSQR algorithm, highlighted with a large transparent light green box, for a system occupying 50 GB of memory.The system was run on four MPI processes in one node of M100 and the shown profiler output refers to one of the four processes.The time scale at the top of the panel shows the absolute time from the beginning of the program execution and the small yellow rectangle at the bottom-right corner of the light green box shows the iteration time, equal to 1.347 s.Within the light green box, the profiler shows the code regions parallelized on the GPU (blue), dedicated to data transfers (green and purple, for the H2D and D2H directions), and to calculation on the CPU (blank spaces between different regions).The blue regions labeled as "b plus..." and as "x plus..." show the aprod 1 and 2 functions, respecitvely.The time fractions of one iteration due to GPU computation is ∼70%, whereas the ones due to data transfers and CPU computation are of ∼15%.This shows that the code is compute bound, namely data copies are subdominant compared to computation.This is essential for a GPU-ported application that, if data movements are not properly managed, can result in an even worse performance than the correspondent CPU version.
With this parallelization, the OpenACC code accelerates of ∼1.5x over the OpenMP version.Specifically, the speedup is due to the porting of the aprod 2 region, that accelerates of ∼3.6x, whereas the aprod 1 region looses in performance of ∼0.8x (Cesare et al., 2022b).Further opti-4 https://developer.nvidia.com/nsight-systemsmizations are possible, as better detailed in the following sections.

The CUDA porting of the Gaia AVU-GSR code
To further improve the performance of the Gaia AVU-GSR solver, we decided to change the parallelization approach.Instead of optimizing the high-level OpenACC parallelization, which might be possible, for example by manually defining the grid of the GPU threads on which each parallel region is run rather than leaving this task to the compiler, we decided to adopt a low-level model, that is the NVDIA native language CUDA.This choice was driven by the fact that, in future further optimizations, this approach will better lead us to manually tune some parameters directly related to the device.In the CUDA code, we manually allocated the GPU threads where to run each parallel region in a grid of threads blocks, each one customized to match the GPU architecture and the topology of the problem to solve.
Whereas the high-level OpenACC porting implied a minimal re-design of the application, ideal for beginner users (Cesare et al., 2020;Aldinucci et al., 2021), at the expense of a possible performance loss, the low-level parallelization with CUDA required a substantial re-engineering of the code structure.This can be seen from the left, middle, and right columns of Algorithms 2 and 3, that represent the parallelization of the aprod 1 and 2 functions with OpenMP, OpenACC, and CUDA, respectively.Comparing the left and the middle columns, we can see that the structure of the OpenMP and of the OpenACC aprod 1 and 2 functions, both parallelized through high-level directives, are very similar, whereas the right columns of the same algorithms show that the structure of the CUDA aprod functions is different.
In the below sections, we detail the parallelization of the CUDA code on multiple GPUs (Section 4.1), the definition of the CUDA kernels for the aprod 1 and 2 functions (Section 4.2), the GPU porting of regions that in the Ope-nACC code were still running on the CPU (Section 4.3), the management of the data-transfers between the host and the device (Section 4.4), and the compilation of the application (Section 4.5).

Multi-GPU parallelization
As the OpenACC code (Cesare et al., 2022c(Cesare et al., , 2021(Cesare et al., , 2022b)), the CUDA code runs on multiple GPUs, according to the number of MPI processes set at runtime.Specifically, the MPI processes are scheduled on the GPUs of the node in a round-robin fashion.This operation is performed with the commands at lines 4 and 5 of Algorithm 1 highlighted in gray.The optimal configuration to run the code is to set the number of MPI processes per node to the number of the GPUs of the node (4 on M100), since it allows to obtain the best performance by exploiting all the   2a) and of the CUDA code (Figure 2b) parallelized on 4 MPI processes in one node of M100, for a system that occupies 50 GB of memory.The two outputs show a zoom-in of a single iteration of the LSQR algorithm, better highlighted with a large transparent light green box, and refer to one of the 4 MPI processes.Within the light green box of each panel, the blue regions represent the sections of the code parallelized on the GPU, the green and the purple/red regions illustrate the H2D and D2H data movements (very small in panel 2b), and the blank gaps between different regions are related to the sections of code still running on the CPU.The time scale above each panel shows the absolute time from the beginning of the programs execution and the small yellow rectangles at the bottom-right corner of each transparent light green box shows the iteration time (1.347s and 0.289359 s for the OpenACC and the CUDA codes, respectively).
GPUs of the node and simultaneously employing the minimal number of MPI resources, as also shown in Section 7.1 of Cesare et al. 2022b.

CUDA kernels definition in aprod 1 and aprod 2 functions
The right columns of Algorithms 2 and 3 show, in boldface, the definition of the CUDA kernels for the astrometric, the attitude, the instrumental, and the global sections of the aprod 1 and 2 functions (lines c.1-55, in Algorithm 2, and lines c.1-41 , in Algorithm 3) and their call in the main scope of the program (lines c.63-68, in Algorithm 2, and lines c.51-59, in Algorithm 3).The index i = blockIdx.x*blockDim.x+ threadIdx.xdefined within the kernels, is the global index of the GPU thread within the grid of blocks of threads, where blockIdx.x is the index of the block inside the grid, blockDim.x is the size of the block in threads unit, and threadIdx.xis the thread index local to each block.The arrays involved in the calculations in the GPU kernels, such as the dense system matrix A d , the solution array x, and the known terms array b, have to be first allocated and then copied on the GPU.In Algorithm 1, only the device allocation of A d is highlighted gray, as an example (line 6), whereas all the H2D and D2H copies are reported in gray.The arrays allocated on the device are identified with the "dev" subscript, as we can see in the kernels in the left columns of Algorithms 2 and 3.
Comparing the right and the middle columns of Algorithms 2 and 3, we can see that the content of the CUDA kernels that compute the different sections of the system, except for the global part of the aprod 2 function, are equivalent to the correspondent parts in the OpenACC code, which are directly defined in the main scope of the program within the for loops iterating on the number of observations assigned to each MPI process pid, parallelized with the #pragma acc loop directive.
To parallelize the for loops iterating on the observations assigned to each MPI process, from observation 0 to observation N [pid], in each section of the system, the index of the GPU thread was directly mapped to the index of the observation.In this way, each thread can independently perform the product b = A d × x, in the aprod 1 kernels, and x = A T d × b, in the aprod 2 kernels.In these CUDA kernels, the for loop syntax disappears and it is replaced by the if-condition i < N [pid] (lines c.4, c.17, c.34, and c.46 of Algorithm 2, and lines c.4, c.15, and c.30 of Algorithm 3), to avoid the thread index i to point to a memory address beyond the size of the product arrays.
It is essential to define, in each kernel, the hierarchy of the GPU threads to best match the GPU architecture and the topology of the problem to solve, which implies an efficient exploitation of the hardware of the device and results in an optimal performance.In our case, the topology of the problem is one-dimensional, since the product arrays are 1D.The grid of threads can be defined in a Cartesian coordinate space set by the x, y, and z axes.The three directions are not equivalent to each other: specifically, along the x direction it is possible to allocate more threads than along the y and the z directions.In particular, on a V100 GPU we can allocate ∼2×10 9 threads along the x direction and ∼6.5 × 10 4 threads along the y and z directions.We thus defined the grids of threads along the x direction, as identified by the .xspecification (lines c.3, c.16, c.33, and c.45 of Algorithm 2, and lines c.3, c.14, and c.29 of Algorithm 3).
As in the OpenACC code, in the aprod 2 kernels the operations x += A T d ×b are performed atomically.In CUDA, the atomic operation is performed with the atomicAdd function, that takes as first argument the memory address of the element of the x array where the result is cumulated and as second argument the quantity that has to be cumulated (lines c.8, c.20, and c.34 of Algorithm 3).
Performing different tests, we verified that defining more kernels allows to save the ∼10-30% of the computation time compared to perform more operations in the same kernel.For this reason, we parallelized the four sections of the system both in the aprod 1 and in the aprod 2 on more kernels, differently from the aprod 2 in the OpenACC code, which was defined within a single parallel region.Moreover, we also split the calculation of the attitude section, both in the aprod 1 and 2 regions, in three kernels, one for each attitude axis.In Algorithms 2 and 3, we only report the attitude kernel for axis 0 since the kernels for the other two axes are equivalent.
The parallelization of the global part of the aprod 2 was less trivial than the other three parts.Looking at the OpenACC column of Algorithm 3, at line a.26, we can see that the index of the element of the x array where the result of the atomic operation is cumulated does not depend on the index of the observation i, differently from the other three sections (lines a.11,a.18,a.22,c.8,c.20,and c.34 of Algorithm 3).This might cause a bottleneck in this point of the code since there is no parallelism over the GPU threads, whereas, in the astrometric, attitude, and instrumental parts of the aprod 2, the access to the element of the x array where the result is cumulated occurs in parallel for each thread i matched to the observation i.
So far, for scientific purposes of the current production, we did not derive the γ PPN parameter and, thus, this section does not represent a bottleneck.However, the γ parameter will be calculated in upcoming runs to test General Relativity, and we cannot exclude that future astrometric models will have more global parameters, which makes necessary to properly parallelize this region of code.To verify how leaving the atomic operation in the aprod 2 global part would affect the performance of the code, we ran a 6 GB and a 50 GB system with 1 and 5 global parameters.Even with one global parameter, the computation of the aprod 2 global section dominates over the other sections.We, thus, decided to rearrange the parallelization of this part.We removed the atomic operation and we implemented in CUDA the sum at line a.26 of Algorithm 3 with a parallel reduction, which also exploits the GPU shared memory.This reduce sum is organized in two kernels, the former parallelized on a certain number of blocks, each of which computes a partial sum saved in the array x dev,sum (line c.37, Algorithm 3), and the latter parallelized on a single block of threads, which combines all the partial results in x dev (line c.39, Algorithm 3).Adopting this new implementation, we obtain a ∼20x speedup for the global part of the aprod 2 function and a speedup of [1.5, 3]x for the entire aprod 2 region.A deeper description of this implementation for the aprod 2 global part will be object of a future work.
The grid of threads on which the kernels are parallelized, except for the aprod 2 global kernels, are set through the gridDim and blockDim vectors, defined with the dim3 integer vector type.The gridDim vector contains three elements, i.e., the number of blocks of threads in the grid along the x, y, and z directions.Since we defined a 1D grid, we have 1 block along the y and z directions.Instead, the blockDim vector contains the number of threads in each block again along the x, y, and z directions.As in gridDim, the second and the third element of blockDim are set to 1.
On a V100 GPU, the maximum number of threads per block is 1024.We set the number of threads per block, T W , to 1024, since it allows to obtain the best performance.Since the regions that we parallelized with the CUDA kernels correspond to the for loops that iterate from observation 0 to observation N [pid] and that the index i of the observation is directly mapped to the index i of the GPU thread, the number of blocks should be N [pid]/TW to properly fit the problem.If N [pid] were an exact multiple of TW, this would result in a grid of N [pid] threads.Yet, this is not necessary the case, and to avoid defining a grid with less threads than required, the number of blocks is set to N bl = (N [pid] − 1)/TW + 1.Since the total number of threads N th = N bl × TW = N [pid] + TW − 1 > N [pid], the control condition ifi < n defined in the kernels (lines c.4, c.17, c.34, and c.46 of Algorithm 2 and lines c.4, c.15, and c.30 of Algorithm 3) is necessary to avoid memory overflows.The scalar n coincides with N [pid], as specified by the parameter N [pid] passed to the kernel at lines c.63-c.68 of Algorithm 2 and at lines c.51-c.55 of Algorithm 3.For the aprod 2 global kernels, we defined a different grid (lines c.43 and c.44 of Algorithm 3), where the number of threads per block is always set to 1024 and the number of blocks employed for the reduction operation is 24 × 16, empirically obtained such that it provided the best performance.
In the kernels call within the main scope of the program, the parameters passed in the angle brackets are the number of blocks and of threads per block.In the aprod 2, two additional arguments are passed within the angle brackets of the astrometric, the attitude, and the instrumental kernels.The third parameter sets the amount of the GPU shared memory to be used by the kernel, in this case 0 bytes, and the fourth argument is the identifier of the execution queue, called stream, on the GPU.This allows to execute these kernels asynchronously, which is possible without imparing the code correctness due to the atomic operations.The overlapping execution regions between the different aprod 2 kernels represent a very minor fraction of the kernels execution, since the number of threads that can concurrently run on the GPU is limited to the number of GPU cores, much smaller than the number of threads in the grid.Yet, the asyncronous computation is useful to reduce the latencies between the successive kernels calls, essential when a large number of iterations is required to reach the convergence of the LSQR algorithm.
After the call of all the CUDA kernels of the aprod 1 and aprod 2 regions, a cudaDeviceSynchronize() barrier is set to wait all the kernels to end their computation (line c.70 of Algorithm 2 and line c.60 of Algorithm 3).This is necessary since, as soon as a CUDA kernel is called, the control returns immediately to the host and the CPU operations defined immediately after the kernels calls, i.e., the MPI reduction operations that combine the partial results obtained from the aprod 1 and aprod 2 (line c.71 of Algorithm 2 and line c.61 of Algorithm 3), would run concurrently to the calculations performed by the kernels.
Figure 2b shows the output of the NVIDIA Nsight System profiler for a 50 GB run of the CUDA code correspondent to the one of Figure 2a for the OpenACC code, i.e., parallelized on 4 MPI processes in one node of M100.As for Figure 2a, the top time scale refers to the absolute time from the start of the program execution and the value in the yellow rectangle shows the time for the illustrated iteration (0.289359 s).Comparing the outputs of Figures 2a and 2b, the aprod 1 and 2 regions computed with the CUDA code present a 6.4x and 1.6x speedup over the same sections computed with the OpenACC application, respectively, correspondent to a speedup of 5.1x and 5.8x over the OpenMP code for an analogous run (Cesare et al., 2022c,b).Considering one entire iteration, the CUDA code accelerates of ∼5x over the OpenACC code and of ∼7x over the OpenMP code.However, the speedup increases with the memory occupied by the system and with the resources employed for the parallelization, as better illustrated in Section 5.

GPU porting of the CPU regions of the code
Besides optimizing the parallelization of the aprod 1 and aprod 2 regions, we also ported to the GPU other sections of code that in the OpenACC version were running on the CPU.Looking at Figure 2a, we can clearly see that many code regions were still running on the CPU (blank gaps).
First of all, we ported to the GPU the computation of the constraints equations both for the aprod 1 and the aprod 2 functions.With this porting, the computation time of these regions remains basically unaffected, since their calculation was already very fast (∼10 −4 s) on the CPU.However, porting these regions is essential to reduce the H2D and D2H data copies, since it avoids to entirely copy the b and the x arrays back to the host to perform the same calculations on the CPU.
Second of all, we ported to the GPU the calculation of the quantity to be compared with the tolerance of 10 −16 that determines the convergence of the LSQR algorithm.This quantity depends on the norms, β and α, of the b and the x arrays.In the AVU-GSR code, the norm of these two arrays is calculated, in each MPI process, by square summing the array elements divided by the array maximum local to the MPI process and by multiplying this normalized squared sum by this maximum: where a is either b or x.Then, the global norm of the array, |a|, is reduced among the MPI processes and sent back to each MPI process.This method to compute the norm is adopted not to loose in numerical precision.In the MPI + OpenMP and MPI + OpenACC codes, this norm is computed on the CPU with the cblas dnrm2 function of the cblas libraries (Galassi et al., 2018) 5 .In the MPI + CUDA version, we ported to the GPU the calculation of this norm by computing the local maximum and the squared and scaled sum of the array elements with a parallel reduce operation.With this porting, the computation of β and α accelerates of ∼35x over the CPU implementation.
By porting these calculations to the GPU, the time fraction of one iteration due to CPU computation reduces from ∼15% (see Section 3.2), to ∼3%, and the time fraction due to GPU computation increases from ∼70% (see Section 3.2) to ∼90%.This can be visually seen from Figure 2b, where the blank regions of the profiler are drastically reduced compared to Figure 2a.

H2D and D2H data transfers
In the OpenACC code, we copied, at each iteration, both the entire b and x arrays, from the host to the device before the beginning of the aprod 1 and aprod 2 functions, and from the device to the host, after the end of the same functions.The b and the x arrays only represent the 5% of the total memory occupied by the system of equations (Cesare et al., 2022b) and consequently the time fraction of one iteration due to data copies was anyway subdominant compared to the time fraction due to GPU computation (see Section 3.2 and Figure 2a).However, some of these copies were unnecessary and they could be further reduced.
In the CUDA code, the first copy of the entire b and x arrays on the device is performed before the beginning of the LSQR cycle (lines 11 and 16 of Algorithm 1).Since the aprod 1 only modifies the b array, we only copy back to the host this array after the execution of the aprod 1, necessary operation since the b array has to be reduced among the MPI processes.In fact, only the constraints part of the b array has to be reduced among the MPI processes (see Section 2).For this reason, we only copy back to the host the final portion of the b array correspondent to the constraints part (see "length(b Constraints )" in the cudaMemcpy commands at lines 19 and 21 of Algorithm 1), which represents a minor fraction of the entire array.The same portion of the b array is again copied to the device after the reduction operation.
For the same reason, since the aprod 2 only modifies the x array, the x array alone is copied back to the host after the execution of the aprod 2. After the copy, the x array is reduced among the MPI processed on the host and then is again copied to the device.
Rearranging the data copies in this way, the time fraction of one iteration due to data copies reduces from ∼15%, in the OpenACC code, to ∼3%, in the CUDA code (see Section 3.2 and Figure 2).The data copies in the CUDA code are highlighted in bold gray in Algorithm 1.

Compilation
To compile the MPI + CUDA code, written both in C and C++, we wrote a Makefile and we employed the nvcc CUDA compiler driver for the CUDA release 11.3 and the version 21.5-0 of the nvc++ compiler.We compile with the -arch=sm 70 option to target the Volta architecture of the GPU, present on M100.

Performance tests
The MPI + OpenMP code has been in production since 2014 and it has run on all the Tier0 systems of CINECA.It is currently running on M100, which has 980 compute nodes having the following features: 1. 2 sockets of 16 physical cores each, of the type IBM POWER9 AC922, with a processor speed of 3.1 GHz.Each physical core corresponds to 4 virtual cores, with a total of 128 (2 × 16 × 4) virtual cores per node; 2. 4 GPUs of the type NVIDIA Volta V100, with a memory of 16 GB each, connected with Nvlink 2.0; 3. 256 GB of RAM.
As shown by different runs, such as the performance tests illustrated in Cesare et al. (2022b), the MPI + OpenMP code runs in its optimal configuration when parallelized on 16 MPI processes per node and 2 OpenMP threads per MPI process.For a typical run for the production occupying a memory of 340 GB, parallelized on 2 nodes on 16 MPI processes + 2 OpenMP threads per node, and with a number of observations and of stars equal to N obs = 1.8 × 10 9 and N stars = 8.4 × 10 6 , respectively, we achieve a convergence after ∼141000 iterations with an iteration time of ∼4.23 s, which results in a total elapsed time of t e,OMP ≃ 166 hours, namely about one week.However, this elapsed time is obtained for a system having a number of observations ∼2 orders of magnitude smaller than the number of observations expected for the final Gaia dataset (∼10 11 ).When we will have to deal with such a large dataset, that will occupy ∼10-100 TB of memory, the time-to-solution would become ∼30-300 times larger, which will result in a far from optimal production.To manage these data sizes, a properly accelerated code is needed.For this purpose, we compared the performance of the MPI + CUDA and of the MPI + OpenMP codes on M100 for a set of systems with increasing size, measuring the acceleration factor of the CUDA code over the OpenMP code to verify whether the CUDA code was worth to be put in production.
We ran the OpenMP and the CUDA applications for different input datasets provided by the Data Processing Center of Turin (DPCT), which is supervised by the Aerospace Logistics Technology Engineering Company (AL-TEC) in collaboration with the Astrophysics Observatory of Turin (INAF-OATO).These inputs are real Gaia datasets and they are employed for the production of the OpenMP code.The datasets have different sizes, occupying a memory of 40, 100, 300, and 350 GB, and each of them only computes some sections of the complete model.The 40 GB and 300 GB systems solve the attitude and the instrumental parts, the 100 GB system solves the astrometric part, and the 350 GB system solves the astrometric, the attitude, and the instrumental parts.As anticipated in Section 4, no system solves the global part.
Figure 3 shows the ratio between the average times of one LSQR iteration of the OpenMP and the CUDA codes, as a function of the system size.We ran the OpenMP and the CUDA codes in their optimal configurations (16 MPI processes + 2 OpenMP threads per node, for the OpenMP code, and 4 MPI processes per node for the CUDA code, see Section 4.1).We report below each point in Figure 3 the number of GPUs employed by the CUDA code, coincident with the number of MPI tasks, and of physical cores employed by the OpenMP code.Each code runs on the minimum number of nodes needed.For the same amount of memory, the CUDA code might require more nodes than the OpenMP code, since the memory of the four GPUs in each node is smaller than the RAM memory of the node (64 GB vs 256 GB).For example, the 100 GB system is paralleliized on 8 GPUs, i.e., 2 nodes, for the CUDA code, and on 32 cores, i.e., 1 node, for the OpenMP code.#pragma omp for o.7 for j 2 ← 0 to N ParAxis do o.17 #pragma acc loop a.7 a.14 #pragma acc parallel private(sum) a.15 { a.16 #pragma acc loop a.17 a.27 #pragma acc parallel private(sum) a.28 { a.29 #pragma acc loop a.30  for j ← 0 to N Astro do a.10 #pragma acc atomic for j 1 ← 0 to N Axes do a.14 for j 2 ← 0 to N ParAxis do a.17 #pragma acc atomic The speedup of the CUDA code over the OpenMP code increases with both the system size and the number of employed GPU resources.From the first to the second point, correspondent to the 40 GB and the 100 GB systems, the speedup doubles, passing from ∼5 to ∼10.This might be explained by the fact that whereas the OpenMP code is always parallelized on the same amount of resources, in the CUDA code the number of resources doubles in the second run compared to the first run.Instead, considering the last two points, correspondent to the 300 GB and the 350 GB systems, the speedup does not substantially increase, passing from ∼12 to ∼14.In this case, the two systems are always parallelized on the same amount of resources.The slighly increase of the speedup might be justified by an increase of the GPU occupancy in the 350 GB system compared to the 300 GB system.In the 300 GB system, the memory assigned per MPI process, and thus per GPU, is of ∼9.5 GB, which implies a GPU occupancy of ∼60%.Instead, in the 350 GB system, the memory assigned per MPI process is of ∼13 GB, which implies a GPU occupancy of ∼80%.Cesare et al. (2022b) show in Figures 4b and 5a a strong scaling test for the OpenMP code up to 16 nodes: the strong scaling curve already departs from the ideal linear speedup, tending to a plateau, after ∼3 nodes (96 physical cores).This means that, for a fixed amount of memory, the performance of the OpenMP code does not substantially improve if we continue to increase the number of physical cores on which it runs.The same figures show that the OpenACC strong scaling behavior is similar to the OpenMP one and that the ratio between the OpenMP and OpenACC mean iteration times is nearly constant and around 1.4.Given that in the CUDA code, as in the Ope-nACC code, the MPI tasks are assigned to the GPUs of the node in a round-robin fashion, we expect the strong scaling curve to be also similar for the CUDA code.Furthermore, the CUDA code is much more performant than the OpenACC code and, thus, we expect it to accelerate over the OpenMP code even if the latter is run on a larger amount of physical cores.
The maximum speedup of ∼14x is obtained for the 350 GB system.With this speedup, the CUDA code is more than 9x faster than the OpenACC code.Given the trend observed in Figure 3, we expect the speedup to continue to increase for systems of larger sizes.This is a remarkable result in perspective of the final dataset of Gaia which makes this implementation of the CUDA code a good candidate to be put in production.However, before proceeding, we performed a further test, detailed in the following section, to verify whether the rearrangement of the code required for the CUDA parallelization had impaired the correctness of the application.

Numerical stability
To check if the CUDA parallelization was correctly implemented, we compared the solutions of the systems of equations considered in the previous section and the errors on the solutions, obtained with the OpenMP and the CUDA codes.Figure 4a plots the solution of the astrometric section of the 350 GB system found with the CUDA code against the solution of the same system found with the OpenMP code. Figure 4b shows the same for the errors on the solutions.The one-to-one relation (black dashed line) is shown as a reference.We do not illustrate the analogous plots for the other sections of the same system and for the other systems, since they show equivalent outputs.
The scatter plots in Figure 4 show that the CUDA and the OpenMP solutions and errors tightly distribute along the one-to-one relation, which suggests an agreement between the two couples of quantities.However, the figures only show a qualitative result, deduced by a visual inspection.To better quantify the consistency between the solutions and the errors found from the two codes, we calculate the average and the standard deviation of their differences, as reported in Table 1.Table 1 also reports the same quantities for the attitude and instrumental sections of the 350 GB system and for the other systems.We can see that the average differences, both for the solutions (d x ) and for the errors (d σ ), are very close to 0, spanning a range, in absolute value, from 3.1 × 10 −23 rad6 to 1.5 × 10 −20 rad, for the solutions, and from 3.1 × 10 −14 rad to 3.2 × 10 −10 rad, for the standard errors.The standard deviations of the differences are, in every case, larger than the averages, which implies the agreement of the differences with zero.Moreover, the average differences, for both the solutions and the errors, are sometimes positive and sometimes negative, which suggests the absence of systematic errors.
To better evaluate the agreement between the CUDA and the OpenMP solutions of every system, we also compared their differences with their errors.For each solution and error point, we computed the ratio: where x i,CUDA and x i,OpenMP are the solution points found from the CUDA and the OpenMP codes, and σ i,CUDA and σ i,OpenM P are their errors.For every section of all the systems, the ratio q is smaller than 1, which means that the solutions found from the two codes are always consistent within 1σ.These results show that the CUDA and the OpenMP solutions and errors are in agreement with each other for systems of increasing size and prove the numerical stability of the CUDA code.
Besides checking the consistency between the results obtained with the two codes, we also wanted to verify if the solutions were obtained with the accuracy required by the Gaia mission (∼[10, 100] µarcsec for the parallaxes and  the positions and ∼[10, 100] µarcsec year −1 for the proper motions, see Section 1) to achieve a high precision astrometry, in order to properly investigate, e.g., the kinematics and the dynamics of the Galaxy.We converted the uncertainties on the solutions (σ) from radians to arcseconds with the relation: In the 100 GB and 350 GB runs, which compute the astrometric section of the system, the average uncertainties on the astrometric parameters along with their standard deviations are of σ = (4.0× 10 −5 ± 5.5 × 10 −3 ) arcsec and σ = (2.1 × 10 −5 ± 2.1 × 10 −4 ) arcsec, both for the OpenMP and the CUDA codes, in agreement with the needed precision.For the 100 GB run, nearly 80% of the astrometric solution points have uncertainties below 10 µarcsec, and more than 97% and 99% of the astrometric solution points have uncertainties below 100 µarcsec and 500 µarcsec.For the astrometric part of the 350 GB run, the ∼99% of the solution points already have uncertainties below 100 µarcsec.Also the attitude and instrumental parameters are generally obtained with a compatible accuracy.
Given these results, we put the CUDA code in production in Q2 2022.The CUDA solver was also put on a proprietary GitLab repository of CINECA and its copyright is held by INAF.

Conclusions and future works
We ported to a GPU environment with the CUDA programming language the AVU-GSR parallel solver, developed for the ESA Gaia mission and originally parallelized on the CPU with a hybrid MPI + OpenMP model.The code solves a system of linear equations with the iterative LSQR algorithm to find the astrometric parameters of ∼10 8 stars in the Milky Way, the attitude and the instrumental settings of the Gaia spacecraft, and the global parameter γ of the PPN formalism.To iteratively find the solution up to the convergence of the algorithm defined in the least square sense, the LSQR calls, at each step, the aprod function in its modes 1 and 2, which provide an iterative estimate for the known terms and the solution arrays, respectively.
The porting presented in this paper is the result of an optimization of a previous GPU porting of this application, performed with the high-level language OpenACC.The OpenACC code showed a moderate speedup of ∼1.5x over the OpenMP code.As already pointed out at the beginning of Section 4, further speedups might as well have been obtained with a better optimization of the usage of the OpenACC language.Indeed, the speedup of ∼1.5x refers to a quite basic parallelization with Ope-nACC, where the OpenMP directives were basically replaced by the OpenACC correspondent ones.However, we preferred to adopt the low-level language CUDA for the new porting since it allows to better match the architecture of the device and, thus, to possibly achieve larger performances.On the other hand, this reduces the code portability, since the CUDA parallelization is architecturedependent.However, since the Gaia mission is expected to end in the following years and only a further porting of this code on Leonardo supercomputer is expected, we aimed to improve the performance rather than to obtain a larger code portability.
With the CUDA porting, we reorganized the structure of the Gaia AVU-GSR solver, by defining the kernels to parallelize different regions of the code, such as the aprod 1 and 2 functions.In each of the kernels, we manually  4a) of the astrometric section of the 350 GB system and its error (Figure 4b) computed with the CUDA code against the solution and the error of the same system computed with the OpenMP code.The one-to-one relation is shown as a black dashed line, for comparison.
Table 1: Comparison between the CUDA and OpenMP solutions and errors on the solutions for the four considered systems of equations.Column 1: Memory occupied by the system of equations; column 2: section solved for the considered system; column 3: mean and standard deviation of the differences between the solutions of the systems of equations found from the CUDA and the OpenMP codes; column 4: mean and standard deviation of the differences between the errors on the solutions found from the CUDA and the OpenMP codes.The quantities d x and d σ refer to the differences between a CUDA and an OpenMP quantity.

Memory Section
defined the hierarchy of the grid of threads to match as better as possible the GPU architecture and the topology of the problem to solve.We also ported with CUDA other regions of code that in the OpenACC application were still running on the CPU and we reduced the H2D and D2H data copies with respect to the OpenACC code.With these optimizations, the time fraction of one LSQR iteration due to GPU computation rises from ∼70% to ∼90%, and the time fractions due to CPU calculations and data transfers reduces from ∼15% to ∼3%.
Running the CUDA and the OpenMP applications on M100, the CUDA code presents a speedup over the OpenMP code increasing with the system size and with the employed GPU resources.The speedup reaches a maximum of ∼14 for a system occupying 350 GB of memory and is expected to increase for systems of larger sizes and by run-ning the codes on next-generation platforms with GPUs having more memory and streaming multiprocessors, such as the CINECA supercomputer Leonardo.Indeed, the A200 GPUs of Leonardo have 4x more memory and more streaming multiprocessors than the V100 GPUs of M100, which allows to execute more concurrent threads.Since both M100 and Leonardo have 4 GPUs per node, the GPU memory per node on Leonardo is quadrupled with respect to M100.We plan to perform the first tests of the AVU-GSR code on Leonardo in the first half of 2023.
The CUDA code showed great numerical stability, since it provided solutions and uncertainties on the solutions fully consistent within 1σ with the correspondent ones found with the OpenMP code for a set of systems.Moreover, the solutions are obtained with the accuracy of [10,100] µarcsec, as required by the Gaia mission.Given these results, the MPI + CUDA AVU-GSR solver was put in production on M100.This is a fundamental achievement since it provides an optimal production for the AVU-GSR pipeline, allowing to obtain important data for scientific analyses, such as the study of the Milky Way formation and evolution, in reduced timescales.
The increasing trend of the speedup with the system size is a very important result toward the scientific purposes of the upcoming Data Releases of the Gaia mission, from which TBs of data will be produced up to an expected final dataset of ∼10-100 TB.In perspective of these pre-Exascale data products, we will continue the optimization process of the AVU-GSR code, for example by porting to the GPU further sections of code, and the consequent investigation of the performance, scaling, and numerical stability of the code, for systems with an increasing size, up to the sizes expected for the final Gaia dataset.These are some of the targets of a two-years project already underway and funded by INAF, an INAF Mini Grant, of which the author VC is the PI and which is performed in collaboration with Prof. Marco Aldinucci of the University of Turin.For this future analysis, we will use Leonardo, to better investigate the behavior of the AVU-GSR code on a next-generation pre-Exascale infrastructure and in perspective of a final porting of this code on Leonardo.
Besides allowing a better performance, this novel arrangement of the hardware, with hosts less performant than the accelerators and GPUs with a larger memory and number of streaming multiprocessors, such as on Leonardo, will imply a low energy consumption for the size of the problems that will need to be computed by HPC GPUoriented applications.When a code such as the AVU-GSR solver is ported to the GPU resulting in a ≳ 14x speedup over the CPU version, besides obtaining results in a minor time, we also expect to save a substantial amount of energy.This is not obvious, since H2D and D2H data transfers, not present in the CPU application, might be rather energy consuming, but this might be compensated by the high speedup.In a future work, we aim to compare the energy consumption of the CUDA and the OpenMP codes by running systems of increasing size, up to ∼10-100 TB, both on M100 and on Leonardo, to verify whether the CUDA code is in fact "greener" than the OpenMP code and whether running on Leonardo allows to save even more energy than on M100.Since the GPU memory per node on Leonardo is 4x the GPU memory per node on M100, a quarter of the resources could be required on Leonardo with respect to M100 to run a system of equal size.This might imply a minor energy consumption on Leonardo compared to M100.Also this analysis is a target of the Mini Grant project.
This research activity has important repercussions in the developement, toward a (pre-)Exascale calculation, of other LSQR-based applications involving the solutions of systems with a high sparsity degree, similarly to the Gaia AVU-GSR solver.The parallelization techniques employed in this code could be adapted and exploited in different contexts that adopt the LSQR, such as the reconstruction of images in radioastronomy (Naghibzadeh and van der Veen, 2017), geophysics (Joulidehsar et al., 2018;Liang et al., 2019a,b), geodesy (Baur and Austen, 2005), medicine (Bin et al., 2020;Guo et al., 2021), and industry (Jaffri et al., 2020) (see Section 1).In conclusion, the continue developement of efficient parallelization techiques is essential to face the increasingly faster production of data in contexts of different nature, going toward the Big Data era.

Algorithm 1 :Figure 1 :
Figure1: Parallelization scheme of the system of equations (Eq. 1) on four MPI processes in a single node of a computer cluster.Left panel: coefficient matrix A. Middle panel: unknowns array x.Right panel: known terms array b.Different colors (yellow, light green, dark green, and blue) refer to different MPI processes or processing elements (PE).The block-diagonal part in the left side of the coefficient matrix illustrates its nonzero astrometric section.In the middle panel, the four square blocks diagonally placed, and labeled as "Astrometric" represent the astrometric part of the solution array, distributed among the MPI processes.Instead, the four dark gray aligned blocks, labeled as "Att+Instr+Glob", represent the attitude, instrumental, and global portions of the solution array, replicated on each MPI process, as written above.At the end of each iteration i, a reduction of the replicated portions of x is performed.

Figure 2 :
Figure 2: Result of the NVIDIA Nsight Systems profiler for a run of the OpenACC code (Figure2a) and of the CUDA code (Figure2b) parallelized on 4 MPI processes in one node of M100, for a system that occupies 50 GB of memory.The two outputs show a zoom-in of a single iteration of the LSQR algorithm, better highlighted with a large transparent light green box, and refer to one of the 4 MPI processes.Within the light green box of each panel, the blue regions represent the sections of the code parallelized on the GPU, the green and the purple/red regions illustrate the H2D and D2H data movements (very small in panel 2b), and the blank gaps between different regions are related to the sections of code still running on the CPU.The time scale above each panel shows the absolute time from the beginning of the programs execution and the small yellow rectangles at the bottom-right corner of each transparent light green box shows the iteration time (1.347s and 0.289359 s for the OpenACC and the CUDA codes, respectively).

Figure 3 :
Figure3: Speedup of the CUDA code over the OpenMP code as a function of the memory occupied by the system.For every point, the number of GPUs employed by the CUDA code and the number of physical cores employed by the OpenMP code is indicated.

Figure 4 :
Figure 4: Solution (Figure4a) of the astrometric section of the 350 GB system and its error (Figure4b) computed with the CUDA code against the solution and the error of the same system computed with the OpenMP code.The one-to-one relation is shown as a black dashed line, for comparison.