Lightweight Transformer with Transposed-attention for Hyperspectral Unmixing

Hyperspectral unmixing primarily involves accurately estimating the end members and abundances from mixed pixels in hyperspectral images. In our task, we design a transposed-attention mechanism in the lightweight Transformer to capture the effective information between channels to provide better performance and faster computing speed. We compare the proposed model with those of the CyCU, FCLS, and DHTN methods. The quantitative and qualitative results indicate that our method outperforms the quality of endmember spectral and abundance maps and has a competitive advantage in calculation speed.


INTRODUCTION
The technique of hyperspectral unmixing (HU) has become critical in addressing the challenge of mixed pixels in hyperspectral images (HSIs).Normally, the unmixing algorithm is classified as a linear mixing model (LMM) [1] and a nonlinear mixing model (NLMM) [2].The former posits that light interacts with a single, pure material, while the latter considers the interaction of light with multiple pure substances.NLMM introduces the non-linear action of materials in real-world scenarios.However, it requires a great deal of prior knowledge.In the past decades, there are numerous unmixing methods proposed to enhance the unmixing performances, such as the geometric methods of vertex component analysis [3], the augmented Lagrangian algorithm (SUnSAL) [4], the fully constrained least squares method [5], and so on.
Nowadays, deep learning (DL) has found widespread application in HU.CyCU-Net [6] utilizes two cascading autoencoders (AEs) in the model and has demonstrated good unmixing performance.Most AEs based unmixing methods rely on convolutional neural networks (CNNs) architecture [7,8], in which the convolution operation is constrained by the dimensionality of the kernel size, and the limited dimensionality of the latent space only captures local features leading to the loss of contextual information.
Transformer has attracted the interest of researchers for HSI unmixing in capturing global contextual feature dependencies and recovering lost information [9].However, there are some limitations for the Transformer application in the unmixing tasks.First, spectral information is spatially sparse.Lacking inter-channel correlation, it becomes challenging to gain insight into the spectral bands that contribute the most to predictions during the unmixing process [10], which will lead to the bad performances of abundance and end members.Furthermore, the computational intricacy of the Transformer grows quadratically with the spatial size.This implies that the computational cost grows rapidly as the length of the input sequence increases.
In this paper, we introduce the lightweight Transformer architecture into an AE-based unmixing network to capture the global context dependencies and map the original spectral to the latent space dimension, which reduces the loss of boundary pixels and improves the estimated quality of abundance and endmembers.Furthermore, considering the different contributions of each spectral band to the HU task, we design a Transposed-attention mechanism in the lightweight Transformer to provide better performance and faster computing speed.

Overall Architecture
Figure 1 displays the architecture of our network.The HSIs input the encoder to acquire the abundance and then the HSIs are reconstructed through the decoder.Specifically, the encoder consists of two stages, each containing one Conv Block and one lightweight Transformer Block.The Conv Block consists of 1×1 convolution, GELU activation functions, and batch norms (BN), which facilitate the extraction of salient image features.Transposed-attention, layer normalization (LN), together with Multi-Layer Perceptron (MLP) are the components that constitute Transformer Block.Global contextual feature dependencies are found by the Transposed-attention modules through the determination of long-range relationships between image patches.To reconstruct the HSIs, the decoder includes a RELU function and a 1×1 convolution.

Lightweight Transformer Unmixing with Transposed-attention
For a feature map X‫א‬ Թ ேൈ , linear layers are used by transposed-attention to generate the key K, query Q, and value V, where Q, K, V ‫א‬ Թ ேൈ , here N = H × W, variables H and W indicate the vertical and horizontal dimensions of the input HSI.The Transposed-attention mechanism is defined as: 1 Atten( , , )+ l l X QKV X. (1) where the Attention layer's output is labeled as X l , while the output feature maps obtained after passing through the Transformer Block are denoted as X l+1 .The dot-product of K T and Q interaction is scaled by a learnable parameter Į to generate a transposed-attention matrix with dimension Թ ேൈே , which is distinct from the regular attention matrix of scale Թ ൈ and is then subjected to the softmax function.
Here, the global feature is captured by computing the pairwise similarity between K T and Q.
After the Transposed-attention modules, MLP calculates feature vectors at each position to capture long-range dependencies, which are expressed as: where In denotes the input features, while In+1 denotes the output features.The weights and biases of the MLP layer are denoted by Wn and Un, respectively.
Figure 1.The overall architecture of our network.

Loss function
The loss function that covers the entire network can be described as: where L is the whole loss function, LRE is the reconstruction error (RE) loss, and the contrast between the input HSI and its reconstructed counterpart is computed.The Spectral Angle Distance (SAD) loss, which is typically used to measure the similarity of the component contents (abundance) extracted from mixed pixels in hyperspectral images, is referred to as LSAD.Lreg is the regulation loss to ensure the abundance vector will satisfy the sum-to-one abundance constraint (ASC).LRE, LSAD, and Lreg are defined in ( 5) to ( 7) respectively.ȕ, Ȗ and Ș are the trade-off parameters of the three losses: where i x indicates the input pixel, and ˆi x is the reconstructed pixel.The abundance matrix's i th row and j th column entries represent the abundance value of a specific component ij a .

EXPERIMENT
We measure the effectiveness of our architecture on the Samson dataset and compared the performances with those of the CyCU, FCLS, and DHTN [11] respectively.The Samson dataset was selected for the scene with a 95×95 pixels image, which contains 156 bands and three endmembers.The experiments are implemented under the Pytorch 3.8 framework.The network's parameters are optimized by employing the Admax optimizer for 200 iterations, beginning with a learning rate of 0.01.During this process, the learning rate smoothly diminishes to 10 -6 based on the cosine annealing function.The qualitative and quantitative results are discussed as follows.

Qualitative analysis for different unmixing methods
The visual abundance map caparisons evaluated on the Samson dataset are indicated in Figure 2. The abundance qualities of the unmixing algorithms of our method, CyCU and DHTN based on DL methods are better than those of FCLS.Our method's abundance maps closely approximate the ground truth (GT), especially for pure material water.Besides, the endmember comparisons are indicated in Figure 3 correspondingly, in which the blue curve corresponds to the ground truth endmembers, while the orange curve corresponds to the estimated endmembers.The proposed method acquires endmembers that are nearly identical to those of the GT.

Quantitative analysis for different unmixing methods
We give the quantitative analysis of SADs and RMSEs evaluated on the Samson data set and compared with those of CyCU, FCLS, and DHTN as indicated in Table 1, where the most favorable outcomes of the RMSE, mean RMSE, and mean SAD are highlighted in bold.A smaller SAD value indicates a closer match between the end members obtained from the network and GT, suggesting a higher quality of the end member extraction process in the context of hyperspectral image analysis.RMSE is a metric used to evaluate the dissimilarity between the estimated abundance maps and the corresponding GT.
For the RMSEs, our method acquires the lowest RMSEs for the material Soil, Tree, and Water.For the mean RMSEs and mean SAD, the proposed method has the values of mean RMSE 0.0928 and mean SAD 0.03179 with a great decrease of 47.7% and 2.8% respectively compared to the suboptimal method DHTN.Besides, we record the average computation time of the DL-based methods for five runs on the Samson dataset.The computation time is 5.32 s of our proposed method, which is the shortest and validates the low calculation complexity of our method.

CONCLUSION
The present study introduces a network architecture designed for hyperspectral unmixing by introducing the transposed-attention mechanism into the lightweight Transformer.In the proposed architecture, CNNs are utilized to extract salient features from HSIs and the Transformer block with Transposedattention is proposed to capture longer-term global information to make the network more competitive in computing speed.We test the qualitative and quantitative performances of the proposed methods with those of other advanced methods on Samson datasets.According to the experimental results, our method is competitive in both qualitative and quantitative analysis.

Figure 2 .
Figure 2. Visual Abundance maps comparison on the Samson dataset.

Table 1 .
Quantitative analysis of SADs and RMSEs evaluated on Samson data set