FCA-Net: A Fast Inference and Channel Attention Based Network for Hyperspectral Image Classification

Hyperspectral imaging (HSI) is a competitive remote sensing technique used in various fields such as land cover mapping and environmental monitoring. Each hyperspectral imaging (HSI) scene is comprised of numerous narrow and contiguous spectral bands, rendering the extraction of information from HSI data cubes a challenging and computationally intricate endeavor. Convolutional neural networks (CNNs) have garnered widespread adoption for HSI classification due to their impressive performance. Nevertheless, the substantial number of internal parameters within CNNs engenders high computational and memory requirements, resulting in inefficient floating-point operations per second (FLOPS), particularly when faced with frequent memory access and an abundance of operators. To address this issue, this paper proposes a novel framework named Fast Inference and Channel Attention-based Network (FCA-Net). Specifically, the framework introduces a lightweight convolutional layer and a channel attention mechanism (CAM) to enhance the extraction of spatial and spectral information within the network. The proposed FCA-Net significantly reduces computational costs while maintaining reliable classification results and can perform fast processing on GPU or even CPU, making it a promising option for embedded systems. Furthermore, the optimized global computational cost, including reduced demand for compute power and memory, results in lower energy consumption, which has previously been proven advantageous for improving deep model performance.


Introduction
Hyperspectral Imaging (HSI) technology, a distinctive form of spectral imaging, is a non-invasive method in remote sensing that enables the observation of Earth while preserving the integrity of the observed sample [1].The essence of an HSI system lies in its sensor, which acquires the electromagnetic radiation emitted or reflected by the observed scene through narrow continuous or discontinuous bands.The information derived from the HSI data cube surpasses that obtained from conventional imaging systems [2].HSI technology exhibits substantial potential for substance identification, thereby establishing itself as a formidable instrument in diverse scientific and technical domains.Notably, the applications of HSI span fields such as geology, landscape characterization, and vegetation monitoring, among others [3].Nonetheless, extracting the desired information from the raw HSI data cube necessitates intricate data processing steps, encompassing calibration, atmospheric correction, and decomposition.
Deep convolutional networks have gained extensive utilization in the realm of remote sensing owing to their inherent capacity for automated local feature extraction from raw data [4].The capability of deep convolutional networks to extract features from high-dimensional data holds particular appeal for the remote sensing community.Various approaches based on convolutional neural networks have been employed for hyperspectral imaging classification (HSIC), yielding diverse levels of achievement.For instance, Zhong et al. [5]devised an end-to-end spectral-spatial residual network (SSRN) for HSIC.Zhang et al. [6] introduced an effective CNN-based spectral partitioning (SP) residual network, in which the SP operation and parallel Cellular Neural Network are executed using grouped convolutions, alongside the enhancement of residual blocks through additional branches.Ye et al. [7] introduced a multiscale spatial-spectral feature-extraction network that operates in a more granular manner.
While CNN-based methodologies have shown continuous advancements in classification performance by increasing network complexity, this progression inevitably leads to escalated computational power and memory requirements for network training.This presents a formidable hurdle when deploying such methodologies on embedded systems and devices, including Unmanned Aerial Vehicles (UAVs), which necessitate real-time data processing.To tackle this challenge, it becomes imperative to devise lightweight networks capable of efficiently executing hyperspectral imaging classification (HSIC) tasks.Previous studies have predominantly focused on reducing network parameters and floating-point operations (FLOPs); however, this approach does not necessarily mitigate a network's inference time and latency.Consequently, it becomes crucial to reevaluate this challenge, drawing inspiration from prior models that have endeavored to minimize the number of parameters and computational costs associated with convolutional layers [8][9].This paper delves into the potential of these approaches within the realm of three-dimensional hyperspectral imaging (HSI).
This paper introduces a novel lightweight and hardware-friendly model for highly accurate hyperspectral data classification.The proposed architecture integrates a lightweight convolutional layer to decrease the number of required filters for spatial information extraction.Furthermore, a channel attention mechanism (CAM) is incorporated into the network to address the inherent challenges of spectral information extraction in neural networks.The presented model achieves a significant reduction in computational costs while maintaining high accuracy, making it a suitable choice for embedded systems.Additionally, the optimized global computational cost, which encompasses reduced compute power and memory requirements, contributes to lower energy consumption, offering potential benefits in enhancing the performance of deep models.

Overview of Architecture
The architecture includes spatial feature extraction (SFEmodule) and channel attention mechanism (CAMmodule) to address the limited ability of neural networks to extract spectral information.Specifically, the network uses partial convolution to effectively extract spatial features by reducing redundant computation and memory access.The channel attention module extracts channel features on the basis of spatial extraction feature map, which involves only a few parameters while providing significant performance gains.Finally, the obtained features are sent to the classifier to obtain the final classification result.The overall architecture of fast inference and channel attention based network (FCA-Net) is shown in figure 1.

Spatial Feature Extraction Module
When performing feature extraction, convolutional neural networks exhibit high similarity between feature maps in different channels.Redundancy in computation has been addressed in previous studies [10], however, only a few have successfully implemented a simple solution.This study proposes the utilization of Partial Convolution (PConv) as a means of effectively decreasing computational redundancy and optimizing memory access, resulting in enhanced cost optimization.
The upper section of Figure 2 illustrates the functionality of PConv, which selectively applies a regular convolution to a subset of the input channels for spatial feature extraction while leaving the remaining channels unaffected.To enable efficient memory access, we consider the first or last consecutive   as a representative of the complete feature map during computation.This assumption is made without compromising the generalizability, as the number of channels in the input and output feature maps exhibit similarity.As a result, the floating-point operations (FLOPs) required by PConv are effectively reduced to: In this equation, ℎ and  represent the height and width of the feature map,  denotes the size of the convolution kernel, and   signifies the number of channels selected for PConv implementation.By adopting a conventional partial ratio of 1/4 (=  /, where  represents the original convolution's channel), the floating point operations (FLOPs) required by PConv are reduced to a mere 1/16 of those in a regular convolution.Additionally, PConv necessitates less memory access.
When =1/4, the memory access is only 1/4 of the regular convolution.
If the remaining channels ( −   ) are simply removed, PConv will only extract spatial features from   channels, and the remaining channels will result in regular convolution with fewer channels.This deviates from our objective of redundancy reduction.It should be noted that we do not eliminate them from the feature map but preserve them as they are useful for subsequent Conv1×1 layers.This approach enables feature information to propagate through all channels, keeping the design simple without excess weight, and making the overall architecture hardware friendly.To integrate the extracted features in a comprehensive manner, additional processing of the features is required to facilitate classification tasks.As depicted in the lower portion of Figure 2, the PConv layer of the SFEmodule is subsequently followed by two Conv 1×1 layers, resembling a configuration akin to reversed residual blocks.Notably, the intermediate layers exhibit a higher number of channels.

Channel Attention Mechanism
To further enhance the classification performance of the network, incorporating attention mechanisms that capture the nonlinear relationships between channels in feature maps has become imperative.Among the conventional attention mechanisms, SENet [11], which employs a compression and excitation channel attention approach, has garnered significant attention.In SENet, the compression component utilizes Global Average Pooling (GAP) to transform feature maps into one-dimensional vectors, while the excitation component employs two Fully Connected (FC) layers to determine the weight of each channel.However, the inclusion of FC layers increases the complexity of SENet.To address this issue, ECANet [12] replaces the two FC layers in the excitation component of SENet with 1D convolutions, resulting in an effective attention mechanism.This substitution technique prevents information loss caused by dimensionality decay, simplifies the model's complexity, and facilitates effective interaction of cross-channel information.
Inspired by [12], we introduce a method that adaptively selects the kernel size for 1D convolution and reorganizes the efficient way of capturing channel attention.Our Channel Attention Mechanism (CAM) module requires only a limited number of parameters, yet delivers substantial performance improvements.The CAM module, utilizing a small convolution kernel, extracts inter-channel relationships of features, as illustrated in figure 3.This enhances computational speed and may produce superior outcomes in situations that require real-time video processing or human-robot interaction.
The feature map initially utilizes Global Average Pooling (GAP) to capture the global context, and its underlying function can be represented as follows: Here,  is the input feature map, while  and ℎ represent the height and width, respectively. = ((())) (4) Where (. ) performs sliding window operations on the input tensor, splitting it into smaller segments and flattening them into a one-dimensional tensor.(.) represents one-dimensional convolution, (.)denotes the Sigmoid activation function.

Data Description and Training Details
We employed three publicly available hyperspectral imaging (HSI) datasets, namely Indian Pines (IP), University of Pavia (UP), and Pavia Centre (PC), for our experimentation.The IP dataset comprises images with a spatial dimension of 145×145 and 224 spectral bands spanning the wavelength range of 400-2500 nm.However, 24 spectral bands associated with water absorption were excluded from analysis.The ground truth information provided categorizes vegetation into 16 distinct classes.Moving to the UP dataset, it encompasses pixels with a spatial dimension of 610×340 and 103 spectral bands within the wavelength range of 430-860 nm.The ground truth labels correspond to 9 urban land-cover classes.Lastly, the PC dataset contains images with a spatial dimension of 1096×492 and 115 spectral bands within the wavelength range of 430-860 nm.Within this dataset, 13 spectral bands sensitive to water absorption were disregarded.Overall, this dataset consists of 9 distinct classes.To facilitate our experimentation, we randomly divided 10% of the IP data and 1% of the UP and PC data into training and testing sets, respectively.
The experiments were carried out on a system equipped with an Intel CPU i9 9900k, a single NVIDIA GeForce RTX 2080Ti GPU, and 64-GB DDR5 memory.The PyTorch framework was employed for conducting the experiments.The optimal learning rate of 0.001 was determined based on the classification results.

Comparison of Classification Accuracies
The performance evaluation of hyperspectral imaging (HSI) classification entailed the utilization of several evaluation measures, including the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa).Specifically, OA represents the ratio of correctly classified samples to the total number of test samples.AA denotes the average of the classification accuracies for each individual class.On the other hand, Kappa serves as a statistical metric that measures the level of agreement between the ground truth map and the classification map.In this context, let  ∈ ^( × ) denote the confusion matrix derived from the classification results.The formulas provided below allow for the calculation of the OA, AA, and Kappa values.where  + represents represents the sum of the column of  and  + is the sum of the row of the .
The performance of the proposed FCA-Net model is compared with widely adopted supervised methods, including SSRN [5], SPRN [6], GhostNet [9], DFFN [13],and DHCNet[14].Among these methods, DFFN and DHCNet utilize 2-D convolutional neural networks (CNNs), while SSRN is based on a 3-D CNN.To ensure a fair comparison, we extracted the same spatial dimension in the form of 3-D patches from the input volume across different datasets.In DFFN and DHCNet, the maxpooling layers were removed, and the convolution stride was set to 1 to accommodate the small patch size.The classification outcomes of the various methods can be found in Table 1.Table 1 presents the results in terms of OA, AA, and Kappa for the various methods.It is evident from the table that FCA-Net demonstrates superior performance compared to all the compared methods across each dataset.When compared to 2-D CNN-based methods like DFFN and DHCNet, the proposed approach exhibits notable improvements.This observation suggests that the CAM strategy can effectively enhance performance, given that DFFN also utilizes a residual network.In the case of 3-D CNN-based approaches, FCA-Net achieves higher overall accuracy on the Indian Pines dataset, surpassing SPRN and SSRN by 0.31% and 1.85% respectively.It should be noted that although FCA-Net does not achieve the highest average accuracy on the Indian Pines dataset, this can be attributed to the dataset's imbalanced class distribution, particularly in the seventh class with a limited number of samples for grass and oat crops.For the University of Pavia dataset, FCA-Net outperforms SPRN and GhostNet by 1.03% and 3.29% respectively.Similarly, for the Pavia Centre dataset, despite SPRN showing comparable results to SSRN, FCA-Net consistently achieves higher accuracy than the other networks.
Table 2 indicates that FCA-Net is an attractive low parameter and low FLOPs option.Its FLOPs are only 1/7th of those in SSRN.While most networks commonly increase the number of channels to improve performance, this approach also increases their latency.In contrast, our method shows comparable performance to GhostNet and SPRN, while exhibiting the lowest latency on both GPU and CPU.Therefore, we conducted a trade-off study between model complexity and network latency to optimize our proposed method.

The Impact of Input Patch Size
To optimize the accuracy of the classification, careful selection of the input patch size is crucial.If the patch size is too small, the network may not adequately capture features from the image, while a larger patch size can provide more local spatial information at the cost of increased computational workload.Additionally, larger patches may contain more interfering pixels, potentially impairing feature extraction and diminishing the model's classification performance.In order to determine the optimal patch size, we conducted experiments on three datasets, evaluating classification overall accuracy (OA) for patch sizes of 7×7, 9×9, and 11×11.The results presented in Table 3 demonstrate that the proposed method achieves the highest performance using a patch size of 7×7 for all three datasets.

Ablation Experiments
The FCA-Net is designed based on past experience and acquired expertise.To assess the influence of certain parameters, we conducted ablation experiments by modifying the FCA-Net structure.Specifically, we examined the following networks: Network A, which is trained with traditional convolution without incorporating the SFEmodule or SAMmodule; Network B, where the SFEmodule is removed from the FCA-Net to evaluate its impact on HSI classification accuracy; and Network C, which did not use a CAM module in the Bottleneck to analyze the influence on classification performance.As shown in table 4, Network C consistently outperforms Network B, but the performance of FCA-Net is always superior to Network C.This indicates that the addition of SFEModule convolution not only effectively reduces the network's latency, but also obtains feature maps with richer information after this convolution.When comparing FCA-Net and Network C, we found that the addition of the CAMmodule in the network generally improves the classification performance by a few percentage points, particularly for the Indian Pines dataset.Using the attention module effectively enhances the weight of useful features and suppresses irrelevant features.In summary, the FCA-Net, which is constructed by incorporating both spatial and channel feature extraction using SFEModule and CAMmodule , can effectively improve the performance of hyperspectral image classification.

Conclusion
This paper introduces FCA-Net, an efficient hyperspectral image classification network that incorporates both the SFEmodule and CAMmodule.The SFEmodule uses Pconv convolution operation to selectively operate on part of the input channels, reducing redundant calculations in spatial feature extraction.Additionally, the network employs a channel attention mechanism to maximize the extraction of features from the SFEmodule's spatial feature maps, improving the feature extraction and generalization ability of deep neural networks.Comparison with other state-of-the-art networks confirms that FCA-Net outperforms in terms of parameters and latency.The network achieves superior classification performance while reducing the computational complexity required.

Figure 1 .
Figure 1.Graphical overview of the proposed network.

Table 1 .
Classification accuracies (%) on IP , UP , and PC datasets using the proposed and state-of-theart methods.The best result is highlighted in bold font.

Table 2 .
The comparison of parameters, computional complexity and inference time.

Table 3 .
Impact of the spatial patch size on the performance of the proposed.

Table 4 .
Comparative experimental results of networks with or without SFEmodule and CAM module.