Comparison of four different methods for massive MIMO detection and VLSI design

The use of Multiple-input-multiple-output (MIMO) technology has proven to be successful. To make MIMO more accessible, it is necessary to simplify the processor operation. This paper proposes four approaches for achieving this goal: symmetric successive overrelaxation (SSOR), Gauss-Seidel (GS), approximate matrix inversion based on Neumann series, and Conjugate Gradients. The performance and VLSI of these methods are compared to determine their effectiveness.


Introduction
MIMO technology has been effectively utilized in a variety of communication systems, including the 4G cellular system and the wireless LAN standard IEEE 802.11n [1,2].The massive MIMO system, which differs from traditional MIMO by installing hundreds of antennas at a base station and providing service to a specific group of users, has great potential for the future of wireless communication [1,3].
However, massive MIMO systems still have some drawbacks.The increase in the number of antennas adds enormous computational complexity to massive MIMO systems.For example, massive MIMO systems are more difficult to handle data detection than small-scale MIMO systems.Inevitably, high-dimensional matrix inversion poses a significant challenge for Massive MIMO systems, especially when using computationally demanding detection methods such as ML and MMSE.Furthermore, for large-scale matrices, traditional matrix inversion techniques such as Cholesky decomposition and LDL decomposition are impractical due to their complexity [3].Despite the great advantages of massive MIMO systems for future communications, the computational complexity limits their wide use.This article will include four methods-symmetric successive overrelaxation (SSOR), Gauss-Seidel (GS), Neumann series-based approximate matrix inversion and Conjugate Gradients, both of which significantly reduce the complexity of calculations under the premise of ensuring intensive reading.The SSOR method is calculated directly, so as to avoid calculating the inverse matrix to reduce the computational complexity [2]; Gauss-Seidel proposes an algorithm for the initial solution of GS diagonal approximation, and decomposes matrix without calculating accurate matrix inversion.Its algorithm is parallel, its delay is much lower than of GS, and it has a higher throughput [1,4]; Neumann series-based approximate matrix inversion proposes several large-scale matrix inversion methods, but limited to cases where the number of Neumann series terms used is less than 1 or 2. The design is realised by a fixed-potion algorithm [3,5], building a Gram & Inverse module can be efficiently found.
When implementing the Conjugate Gradients method, much fewer parts will be used, but the throughput will be less than the ordinary method, and the delay will be relatively large.It is a value-added and reconfigurable VLSI architecture [6].
The subsequent sections of the paper are organized as follows: Section 2-6 introduces algorithm and VLSI architecture of the four different methods.Finally, conclusions are drawn in Section 7.

System model
A large-scale MIMO system have been designed with N antennas at the base station to serve K users with single antennas.Typically, N is greater than K, such as N=256 and K=32 [2].The size of the signal vector y received at BS can be expressed as In this formula, n is the additive Gaussian white noise (AWGN) vector, whose entry follows CN (0, σ2).
After obtaining the channel matrix H, the estimated value of the transmission signal vector obtained by the zero-forcing (ZF) signal detector can be expressed as We have defined  =    and  ̂=   .
It has demonstrated that the ZF signal detector can attain nearly optimal results [2].Nevertheless, it necessitates a complex inversion of the matrix W to the power of negative one, denoted as  −1 .This is not a simple task to be implemented in hardware.Next, these four methods will be introduced in detail in the article.

SSOR method The algorithm part:
There are many articles about the SSOR method, and Article 2 is described below.Firstly, the Hermitian positive define matrix W should be decompose as In this expression, D, L and   represent the digital sectioncomponent of W, its lower triangular portion exclusively, and its upper triangular portion exclusively, respectively.
Then, the first half iteration is calculated, which is identical to the iterative method known as successive overrelaxation (SOR), as follows: and also Where i is the amount of iterations and  ̂(0) is the initial solution and ω is the relaxation parameter.
The ω could be defined as As for the bit error rate performance comparison of simulation results of SSOR method, its performance is close to that of ZF signal detector when i is equal to 3 and 4, and even when i is equal to 2, its performance is better than that of Neumann-based signal detector.

The algorithm part
There are many articles about the GS method, and Article 5 is described below.
The first step is just like the SSOR method.the Hermitian positive define matrix W should be decompose as In this expression, D, L and   represent the digital section of W, its lower triangular portion exclusively, and its upper triangular portion exclusively, respectively.
Then we use the Neumann series to solve the approximately initial solution for  ̂(0) .Where  −1 can be calculated in the following ways, Then let X=D and just keep its first two terms, and we get an approximate to  −1 .
where E is the off-diagonal part of W. So the initial solution  (0) is Finally, we can solve s through the following equation, The simulation results represent that when the iteration amount of GS method is equal to 1, its performance is close to Cholesky and better than that of Neumann method when the iteration number of i is equal to 3, 4 or 5.Even if the iteration number of GS method is 2, its performance is obviously better than that of Neumann method when the iteration number of Neumann method is 3.In the realization result of FPGA, it was observed that the detection throughput utilizing the Gauss-Seidel method was inferior to that using the Neumann method, but it can be improved by adding more units.The efficiency of the hardware was evaluated by utilizing the area-delay product.It is evident that the GS-based detector offers superior hardware efficiency compared to the Neumann-based detector.Additionally, it exhibits improved error rate performance.

The algorithm part
Numerous scholarly works have tackled the Neumann series approach for matrix inversion, and the following section will focus on the third article.
First, the normalization matrix A is given as follows, Where, distributed zero-mean Gaussian variables with variance  0 , and   = |  | 2 (for 1≤i≤N).
And then the estimate of vector s can be calculated by the following equation, Our objective is to obtain an estimate of the matrix A that is similar to its inverse matrix X, through the use of the Neumann series approximation.Specifically, we are seeking a value of A that satisfies the following equation: By truncating the Neumann series at its k-term, an approximation for the k-term of the inverse matrix of  −1 can be obtained:, (15) Since the matrix system W and A both dominate diagonally, to decompose A into a diagonal matrix D and a non-diagonal matrix E, we can write A = D + E. By choosing X = D in Equation 15, we get: Through the simulation results, we can find that when k is equal to 2, the performance of approximate method is close to cholesky decomposition.If k is equal to 3 or 4, the performance of approximate method is basically the same as that of cholesky decomposition.The figure displays the linear detector structure of the approximate matrix inversion method proposed in this study, which comprises of two key components: the pre-processing module for generating the initial matrices M and N, and the iterative computation module for computing the approximate inverse matrix  −1 .In contrast to the Cholesky method structure, this design has benefits in terms of both area and frequency.Moreover, the hardware efficiency of this architecture increases as the size of M increases.

The algorithm part
There are many articles about the Conjugate Gradients method, and Article 4 is detailed as follows.
The following approach is founded on the conjugate gradient.First of all, the approximation for  ̂ is equal to   , and then we have where p, r and t are vectors, β is coefficient.
Where r, t and β can be solved by the following equations respectively: Finally, the vector e could be expressed as the following equation, According to the simulation results, the performance of CGLS method showed poorer performance compared to the cholesky method for k values of 1 and 2. However, for k=3, the CGLS method exhibited significantly better performance than the Neumann method and approached the performance of the cholesky method.An architecture with low complexity is introduced in this study, consisting of a reconfigurable array of processing elements, as depicted in the figure.The array can be dynamically adjusted to perform various operations such as matrix-vector multiplication, vector inner product, scaling vector addition/subtraction, and scalar partition.To implement the proposed soft input detection algorithm, a global finite state machine is utilized to control the sequence of operations.Upon examining the resource utilization of the CGLS-based detector and the Neumann-series detector, it is apparent that while the CGLS detector may have a lower throughput than the Neumann detector, its hardware efficiency, quantified by the area delay product, is considerably better.

Conclusion
In summary, this paper introduces four methods, which are SSOR, GS, Neumann series-based approximate matrix inversion and Conjugate Gradients, which can significantly reduce the computational complexity.SSOR method avoids calculating inverse matrix by direct calculation.GS method disassembles the original matrix, and its algorithm is parallel, so this method has smaller delay and larger throughput.Although the Neumann series-based approximate matrix inversion can effectively reduce the computation complexity, it has some limitations when used.Conjugate Gradients can save the cost of resources when the curve VLSI structure.

Figure 1 .
Figure 1.A comparison of the bit error rate (BER) performance between the signal detector based on the Symmetric Successive Over-Relaxation (SSOR) method and the detector based on the Neumann series approximation method [2].

Figure 3 .
Figure 3.The structure of the soft-output data detection method based on the Gauss-Seidel (GS) algorithm [7].The figure illustrates the efficient design that comprises five different units, including (1) the Unit for Preprocessing, (2) the Unit for Expanding Neumann Series, (3) the Unit for Gauss-Seidel Method, (4) the Unit for Computing SINR, and (5) the Unit for Computing LLR.The Preprocessing Unit is