Attention-based deep learning approach for CSI feedback under 5G TDL channel

In 5G communication systems, accurate channel state information (CSI) is indispensable for signal detection and regulation at the base station side. However, frequent CSI feedback from users leads to excessive system overhead. To tackle this challenge, this paper puts forward a novel deep learning framework - HCNet based on attention mechanism and autoencoder, aiming to efficiently compress and reconstruct the high-dimensional time-varying CSI matrices in an end-to-end approach. The proposed framework incorporates a self-attention module in the encoder to explicitly capture global dependencies within the CSI matrix. Meanwhile, the decoder adopts gated recurrent unit networks to fully exploit inter-feature correlations and redundancy. To evaluate the performance, simulations are conducted using datasets conforming to current 5G time-varying TDL channel models. Results demonstrate superior performance over existing deep learning based feedback networks. Specifically, the proposed framework can reduce the normalized mean square error of CSI reconstruction by 4 dB under various compression ratios which confirms the effectiveness of the attention-enhanced autoencoder structure for compressive CSI sensing and feedback in practical dynamic communication systems.


Introduction
With the advent of the 5th generation mobile communication technology, the transmission rate and efficiency have been significantly improved.Among others, Massive multiple-input multiple-output (MIMO) and Orthogonal Frequency Division Multiplexing (OFDM) are two key enablers for 5G to achieve high-throughput and reliable wireless connectivity [1].By deploying a large number of antennas and forming narrow beams, Massive MIMO can realize spatial multiplexing and boost system capacity and service quality [2].OFDM modulates data over multiple orthogonal sub-carriers and thus increases the transmission rate via frequency multiplexing.
In current 5G frequency division duplex systems, the base station transmits downlink signals and receives channel state information (CSI) feedback from users simultaneously.With the obtained frequency-domain CSI matrices, the base station can acquire accurate channel state information of each antenna component and subcarrier, thereby conducting precise signal detection and modulation.However, as the numbers of MIMO antennas and OFDM sub-carriers grow, the overwhelming CSI feedback will incur excessive overhead in the uplink.Conventional compression methods based on codebooks cannot meet the real-time requirements posed by high mobility and Massive MIMO.Therefore, designing an efficient CSI feedback scheme is an urgent problem to be addressed.
Compressive sensing provides a new perspective to significantly reduce CSI feedback by exploiting channel sparsity.With greedy algorithms, the downlink channel can be reconstructed from a small number of measurements [3].However, such linear compressive sensing relies on hand-crafted priors and provides limited reconstruction accuracy.
In recent years, deep learning has offered new possibilities for CSI compression and feedback [4][5][6].By end-to-end training, deep neural networks can learn sophisticated non-linear patterns underlying the CSI matrices.Existing works mainly focus on improved autoencoder architectures and integration of recurrent neural networks.However, most ignore the inherent time-frequency correlations and complex structures of practical wireless channels.How to utilize these prior information to guide the training and further enhance CSI reconstruction remains an open research area.
Motivated by the analysis above, this paper proposes an attention mechanism and designs an end-to-end deep learning framework for real-time CSI acquisition and feedback.The main contributions lie in: 1) introducing attention modules in the encoder to capture global dependencies within CSI matrices; 2) constructing high-fidelity dynamic channel models to generate datasets for performance evaluation.

System model
In the study of CSI compression feedback networks, modeling the end-to-end communication system is a necessary prerequisite that lays the mathematical foundation of the problem.

Communication model
Consider a downlink multi-user MIMO-OFDM system where the base station is equipped with Nt transmit antennas and K total OFDM sub-carriers.To simplify the analysis, this paper assumes that there is one single-antenna user and focuses on the neural network architecture design.At time slot t and sub-carrier n, the received signal can be expressed as formula (1): where h nt ∈ ℂ N t ×1 , v nt ∈ ℂ N t ×1 , s nt ∈ ℂ and z nt ∈ ℂ denote the downlink channel vector, precoding vector, modulated symbol, and noise for the nth sub-carrier, respectively.The user can estimate the CSI matrix based on the received downlink pilots and feedback the matrix H ̃= [ℎ ̃1, ⋯ , ℎ ̃K] H ∈ ℂ K×N t , where H ∈ ℂ K×N t to the base station.With H ̃, the base station can design the precoding matrix V = v n , n = 1, ⋯ , K to obtain accurate CSI for each sub-carrier, for signal detection and regulation.

CSI feedback matrix
To reduce the feedback overhead, the user typically performs a two-dimensional discrete Fourier transform (DFT) on the estimated frequency-domain CSI matrix H as formula (2): where F d and F a are DFT matrices for the delay and angle domains, respectively.The transformed 2D matrix contains the delay and angle information for each sub-carrier.Since the delay span is limited in OFDM systems, only the first K rows need to be fed back.Upon receiving the compressed CSI matrix in the angle-delay domain, the base station can restore the frequency-domain channel vector for the nth sub-carrier via 2D inverse DFT as formula (3): where (F d H H) n takes the nth row of the DFT transformed matrix.
In this paper, we generate datasets based on the Tapped Delay Line (TDL) channel model from 3GPP for training and testing [7].

System workflow
In the studied end-to-end MIMO-OFDM system, the complete CSI compression and feedback workflow is as follows: Firstly, the base station transmits downlink pilot signals to the user equipment, based on which the user equipment estimates the frequency-domain CSI matrix.Then, the user equipment applies a two-dimensional discrete Fourier transform to the estimated CSI matrix to obtain a compressed expression in the angle-delay domain.The compressed CSI matrix is input into the encoder module of the designed end-to-end deep neural network for compression encoding.The compressed codeword after encoding is fed back to the base station via the uplink (assumed to be ideal feedback conditions here).Upon receiving the compressed codeword, the base station inputs it into the decoder module of the end-to-end network to recover the CSI matrix.Finally, the base station designs signal detection and modulation schemes based on the recovered CSI matrix and transmits precoded signals in the next time slot.
This paper focuses on the encoder and decoder module design of the designed end-to-end CSI compression and feedback network.

Evaluation method
To evaluate the CSI matrix reconstruction performance of the autoencoder, we adopt the normalized Mean Squared Error (nMSE) and cosine similarity as evaluation metrics, which are defined as formula (4) and formula (5): where ‖•‖ F denotes the Frobenius norm, and ⟨•⟩ denotes the inner product.nMSE measures the mean squared difference between the reconstructed matrix H ̂ and original matrix H.The cosine similarity ρ reflects the cosine of the angle between two matrices in vector space.Both methods can evaluate reconstructing performance, and we mainly use nMSE as an evaluation metric in the following research.

HCNet and train scheme
CSINet first adopted deep neural networks for CSI matrix compression and pioneered this area.Its encoder uses fully-connected layers to extract features and generate a compressed codeword, and the decoder reconstructs the channel matrix based on two RefineNet layers [4].This architecture demonstrated excellent capability in compression and reconstruction, but using only fully-connected networks limits the encoder's ability to extract information from the matrix.The follow-up CSINet-LSTM introduced LSTM modules on top of CSINet to explore dependencies between CSI matrix elements [5].This paper adopts the GRU network with fewer parameters instead.CRNet designed the encoder as convolutional residual blocks to enhance feature learning [6].However, these methods did not consider that only a small portion of spectral information in CSI matrices deviates significantly from the mean.To pay attention to these key elements, we propose HCNet with an attention mechanism, whose overall architecture is shown in Figure 1 [8].

Encoder
For the encoder design, we referenced the CRNet design pattern and constructed the backbone network in a residual form.However, we replaced the right branch with an attention module designed based on self-attention.The CSI matrix first goes through three 1×1 convolutions, with the output results as Q, K, and V. Then Q and K are matrix multiplied along the channel dimension to output the channel-wise attention probability matrix.This operation computes self-similarity analogous to a perceptron.The probability matrix is then matrix multiplied with V to compute the attentive result for the CSI matrix.
The left branch is constructed with three convolutional layers, with kernel sizes 3×3, 1×9, and 9×1 respectively.Each convolution is appended with a batch normalization layer for normalization to prevent gradient explosion, vanishing, and overfitting, while also accelerating convergence.Finally, the features from the two branches are concatenated along the channel dimension and fed into a 1×1 convolution to extract spatial features again.The features then go through a GRU network for inter-element information extraction, and finally into a fully-connected network to generate the M-dimensional encoded vector.

Decoder
For the decoder design, we added a GRU network after the final CRBlock based on the CRNet decoder.See Figure 1 for the specific structure.Upon receiving the encoded vector, it first goes through a fully-connected layer to upsample and reconstruct the M-dimensional encoded data.It then goes through a 5×5 convolution, followed by two CRBlocks to construct a richer mapping structure.Each CRBlock has two branches: the left branch with 3×3, 1×9, and 9×1 convolutions in series with 64 channels to help reconstruct the CSI matrix better.Using 1×9 and 9×1 kernels instead of a 9×9 kernel reduces parameters and speeds up convergence.The right branch has a 1×5 and a 5×1 convolution in series.The two branches extract multi-scale features with different receptive fields.The results from the two branches are concatenated along the channel dimension and further integrated via a 1×1 convolution.A residual connection is applied between the input and output for reduced information loss during propagation.The output after activation is fed into the next CRBlock.After the two core modules, a convolution layer integrates the features and adjusts the output CSI matrix shape.Finally, a GRU module completes the final decoding.

Train scheme
The proposed HCNet is trained in an end-to-end manner.During training, we set nMSE as the loss function to accurately reflect the difference between network prediction and truth.In addition, Adam is chosen as the optimizer for its merits of integrating AdaGrad, RMSProp, SGD and enabling fast convergence and parameter stability.The model is trained for 500 epochs and batch size is set to 256.For the learning rate, a dynamic schedule is starting from 10^-3; reducing to 2.5*10^-4 at the 15th Epoch; recovering to 10^-3 at the 250th Epoch; and finally increasing to 2*10^-3 at the 375th Epoch.This learning rate scheduling prevents the network from getting stuck in local optima.In terms of activation function, we chose the LeakyReLU function which is expressed as formula (6) to increase the nonlinear fitting ability of the model.
In model initialization, we adopt different strategies for fully-connected and convolutional layers.Kaiming uniform distribution is used for fully-connected layers, which helps prevent gradient vanishing and explosion.Kaiming normal distribution is utilized for convolutional layers to match the distribution characteristics of non-linear activations.Overall, this initialization method can accelerate model convergence and reduce manual tuning.

Simulation result
For the dataset, we generated 320,000 time point data using the 3GPP TDL 5G channel model which can accurately simulate the multipath effects in mobile communications and reflect the dynamic characteristics of wireless channels [7].Each time point corresponds to a CSI matrix with dimensions 2*16*32.The three dimensions represent: 2 for real and imaginary parts, since computers have difficulty with complex number operations.So we convert it into real and imaginary parts, and use neural networks to fit the inherent relationships.16 and 32 represent the angles and delays of the CSI matrix after DFT and keeping only the first K rows.During training, we divide the dataset into epochs of 256 data points, and split the dataset into training and test sets with a ratio of 4:1.
In addition to using HCNet for training and testing reconstruction performance, we also used three classic neural networks for compressing CSI matrices as benchmarks: CSINet, CSINet-LSTM, and CRNet [4][5][6].We set three compression ratios to explore the impact on reconstruction performance, which was evaluated by ρ and nMSE.The final experimental results are shown in Table 1, where nMSE is presented in dB form.The dB conversion formula is formula (7): The results show that HCNet demonstrates better reconstruction performance under all compression ratios.This is due to its unique structure with feature extraction, self-attention, and inter-element information extraction capabilities.Additionally, the impact of compression ratio on reconstruction performance is as expected, i.e. lower CR which means using fewer channel resources leads to poorer reconstruction performance.2. nMSE drops sharply in the initial epochs, then shows a slow declining trend with fluctuations over a long period.Additionally, we observed that suddenly increasing the learning rate during training leads to a sudden increase in nMSE.Our initial analysis is that the overall planning is non-convex, so abruptly increasing the learning rate causes parameter updates to change drastically, failing to converge stably towards the optimal direction.However, keeping the learning rate constant leads to low efficiency in later training.Using CRNet's training strategy also did not improve reconstruction performance [6].In future work, we will also explore how to design the learning strategy.Moreover, we tried placing the attention module I designed at different locations in the network.Placing it at the very beginning did not work well, likely because it lost information integrity.Using a residual structure significantly improved performance, but introduced much redundancy.If we do not design a self-attention structure, and use the GRU output as the K and Q matrices to simulate a seq2seq structure, the training effect is very poor.Our initial analysis is that our CSI matrix is more suited for image-based attention operations rather than those in NLP.Our designed attention module is more similar to the non-local model for image processing [9].In future work, we will explore using more advanced non-local attention networks, such as non-local sparse attention, spatially and temporally efficient non-local attention [10][11].Model complexity can be represented by the number of trainable parameters.We analyzed the complexity of CSINet, CSINet-LSTM, CRNet, and HCNet.The results are shown in Table 2.It can be observed that although HCNet has the most parameters, this model size is still acceptable.The extra parameters are mainly from the LSTM module and attention module, but this increase in parameters leads to substantially improved reconstruction performance, which is worthwhile.

Table 1 .
nMSE of predicted and true data for different neural network architectures and compression ratios.

Table 2 .
Comparison of model trainable parameters.

Table 2 .
(continued).This paper proposes an end-to-end deep learning network named HCNet with an attention mechanism for real-time and efficient CSI compression and reconstruction in 5G massive MIMO systems under high-dynamic TDL channels.The main innovations lie in: 1) introducing a self-attention module in the encoder to capture global dependencies within the CSI matrix; 2) adding a GRU module in the encoder and decoder to explore inter-feature correlations.We generate datasets based on the 3GPP TDL time-varying multipath model for training and validation.Results show HCNet outperforms existing networks like CSINet, CSINet-LSTM and CRNet under various compression ratios, reducing normalized MSE by at least 4 dB.This is attributed to the self-attention module and GRU module in the model.The self-attention module focuses on capturing global dependencies and critical information within the CSI matrix via self-similarity computation, thus significantly enhancing reconstruction performance.The GRU module exploits inter-feature correlations and reduces information loss during encoding and decoding.Although HCNet increases model complexity, this design greatly improves reconstruction quality with acceptable complexity increase and is still feasible and valuable for engineering practice.This work provides an innovative and effective end-to-end deep learning solution for real-time efficient CSI compression and feedback in high dynamic environments, as well as a beneficial reference for future research.Future work can be carried out in the following aspects: 1) Explore the introduction of other more advanced non-local attention modules, such as non-local sparse attention, spatially and temporally efficient non-local attention, etc. 2) Design multi-scale attention mechanisms to capture features of different dimensions.3) Optimize learning rate scheduling strategies to accelerate network convergence.