A Lightweight Semi-supervised Eye Semantic Segmentation Method Based on Iterative Learning

Eye semantic segmentation has a significant role in promoting biometric recognition. Most works improve the accuracy through designing complicated feature learning modules that usually require a high computation cost and are prone to over-fit. In this work, we propose a lightweight eye semantic segmentation method based on iterative learning, which only requires a small amount of labeled eye image data under visible light conditions. To be specific, we first propose a pseudo-label filtering strategy for eye semantic segmentation, which leverages a semi-supervised incremental learning technique to solve the problem of label scarcity. Besides, an attention-guided lightweight high-representation module is designed to boost precision while ensuring efficiency. Experiments on the Ubiris dataset show that the proposed method achieves state-of-the-art performance, with a mIoU of 86.18%.


Introduction
The eye semantic segmentation task aims to identify pupils, iris, sclera, etc., which has important applications in biometric identification, live body detection, and other fields.Over the past few decades, researchers have done a lot of exploration, including infrared-based and visible light-based methods.Compared with visible light-based methods, infrared-based methods usually require more expensive costs.This work focuses on eye semantic segmentation under visible light conditions, which can be roughly divided into traditional methods and deep learning methods.
Traditional methods are non-learning methods, generally locating iris boundaries based on low-level features, such as Hough transform [1,2,3] , gray intensity [4] , and then refining the edges iteratively [5,6] .These methods are highly dependent on the image preprocessing effects.With the development of deep learning, human eye segmentation technology has turned to the field of learning.Liu et al [7] first apply a convolutional neural network to iris segmentation and verify that the performance was superior to various non-learning methods.Based on the fact that the prediction of iris peripheral information was helpful to iris segmentation accuracy, Wang et al [8] propose a multi-task segmentation network to extract the iris, pupil, and iris outer boundary simultaneously.Arslan et al [9,10,11] study iris segmentation under unconstrained conditions, and the proposed model is well-verified in infrared and visible wavelength datasets.Lian et al [12] and Zhang et al [13] modify the model based on U-Net [14] , introducing an attention mechanism or dilated convolution to improve iris segmentation performance.Although the above methods realize acceptable segmentation results, they generally focus on singleelement eye segmentation.
Currently, there is little research on multi-element eye semantic segmentation, and the limited multi-element semantic segmentation dataset is the main reason.Rot et al [16] first transfer the network proposed SegNet [17] to complete eye segmentation under visible light conditions.Further, ORED-Net is proposed by Naqvi et al [15] optimizes the method through encoder-decoder strategy.However, they need to rely on high computation costs, and its model is easy to over-fit because of the supervised learning mode and small data amount.Aiming at the problem, we propose a lightweight multi-element eye semantic segmentation algorithm and construct a semi-supervised incremental learning strategy based on iterative learning.Specifically, a lightweight eye multi-element semantic segmentation network is presented, and a pre-trained model of the network is obtained based on a small number of labeled visible light imaging human eye images.Then, a pseudo-label filtering strategy is developed to acquire and filter the pseudo labels of unlabeled data, which will be added to the training set for iterative data incremental learning.
The main contributions of this work are as follows:  A lightweight semi-supervised eye semantic segmentation network leveraging the attention mechanism is proposed, dubbed SemiESNet.It comprehensively utilizes spatial attention and channel attention to explore position cues and recalibrate different dimensional features to achieve the purpose of enhancing effective features with a small number of parameters. A semi-supervised incremental learning strategy is developed, which constructs a pseudolabel filter based on the based on Hu moments of the eye image and continuously increases trainable data with high confidence to improve model reliability.It provides a novel solution for eye semantic segmentation under visible light conditions using only a small amount of annotated data. The proposed SemiESNet is validated through comparison and ablation experiments on the Ubiris dataset, and the results demonstrate that SemiESNet achieves state-of-the-art accuracy with a small number of parameters.

Related work
Rot et al [16] constructed a network based on SegNet [17] firstly to realize eye segmentation in the visible light band, but the model parameters are huge and the evaluation method is single.Naqvi et al [18] proposed residual ScleraNet based on an encoder-decoder structure that uses skip connection to obtain high-frequency information and has good results on the SBVPI dataset [19] .Based on ScleraNet, ORED-Net [15] adds convolution and activation functions to the skip connection to connect the two ends of the encoder-decoder network, which aims to enhance the transfer effect of features and obtains the best results on SBVPI [19] and Ubiris [20] datasets.However, the parameters of these models are above 1.0 MB, which is far from the lightweight model standard (<1MB).
Attention mechanisms focus on essential features and suppressing unnecessary features, which can be divided into channel attention and spatial attention mechanisms according to the different attention goals.Channel attention mechanisms, such as SE [21] , model channel relationships and work well in object classification tasks.Spatial attention mechanisms model relationships between spatial points, such as NL [22] , which improve object detection classification and pose estimation.This work is achieved by embracing spatial attention and channel attention in branch and aggregation for the human eye multi-element semantic segmentation task.

Method
To achieve lightweight and high-precision multi-element eye semantic segmentation, we propose a novel framework based on iterative learning.The proposed SemiESNet is shown in Figure 1, which mainly includes data processing, pre-trained model generation and pseudo-label filtering, and iterative learning.In the following, we will elaborate them.
(a) Data processing.
According to containing labels or not, the data is divided into labeled, unlabeled, and pseudolabeled datasets.The sticky notes in each image in the labeled eye dataset are uniformly processed into a single file, which is convenient for counting the categories.At the same time, the subjects are divided into two groups equally, which was convenient for cross-validation experiments.One group serves as the training set, and the other is the validation set.The pseudo-label dataset is initially empty.

Training Network: SemiESNet
Figure 2 shows the structure of the proposed SemiESNet, which adopts an asymmetric encoderdecoder to extract features while taking into account efficiency.Encoder-decoder structure is one of the common frameworks for semantic segmentation tasks.The encoder-decoder is generally constructed in a strictly symmetric manner, such as U-Net [14] , SegNet [17] , etc.In recent years, asymmetric encoder-decoder networks such as ESPNet [23] have emerged for lightweight.ESP module has the characteristics of small parameters and a large receptive field, and the construction process simulates the form of a pyramid and sets the same contribution for the pyramid features.In fact, the pyramid features should not have the same contribution.Therefore, we embrace an attention mechanism to adaptively adjust the contributions and thus propose an ATTESP module.
The ATTESP module is shown in Figure 3.Each branch feature of the module has its uniqueness with different representations.Hence, the spatial attention mechanism is introduced into the five branches, as shown in Figure 4.Then, the information of five branches are integrated, and the weights of each layer of the feature pyramid are learned using the channel attention mechanism, as shown in Figure 5.

Pseudo label filtering strategy
The pseudo-label filtering strategy determines the quality of pseudo-labels in incremental learning and has an important impact on the final segmentation results.The overall filtering process is shown in Figure 6.Firstly, the training data part (data   , labels   ) of the labeled dataset is obtained and sent to the filter parameter formulation module, and the standard parameters are obtained through statistics of data characteristics.At the same time, the candidate pseudo-label data (candidate data   and candidate pseudo label   ) are obtained and sent to the pseudo-label parameter extraction module, and the pseudo-label parameters are obtained by statistically summarizing the data characteristics of a single data and output for storage.The standard parameters and the pseudo-label parameters are sent into a data filter, and whether the pseudo-label meets the condition is judged by comparing the difference between the pseudo-label parameters of a single pseudo-label and the standard parameters.Finally, all pseudo-label data (data   , pseudo label   ) that meet the conditions are output as supplementary training data for the next round of training.

Hu moment-based pseudo-label filtering strategy
The Hu moment reflects the shape features of the image by describing the gray value distribution.Inspired by Hu moment, we design a Hu moment-based pseudo label filtering strategy, which mainly utilizes Hu moment seven variables to describe the image gray distribution, to achieve pseudo-label filtering.It consists of filter parameter formulation and filter parameter extraction modules, which are detailed as follows.
Figure 7 shows the Hu moment-based filter parameter formulation module.In order to avoid the influence of the rich color of the human eye image, the foreground object will be positioned and the element-wise masks of the human eye will be extracted.In this paper, Hu moment is used to calculate the center Hu moment of each element of training set, and output as standard parameters.where  ∈ {eye, iris, pupil}, and  denotes the specific element.Then the pseudo-label data meeting the conditions are sorted according to the total error value, and the top 10% of data are selected and added into the pseudo-label dataset (data   , pseudo label   ).

Dataset and loss function
The proposed model SemiESNet is implemented based on the Pytorch framework and trained on NVIDIA Quadro P5000.Two-fold cross-validation was used in the experiments.The labeled data were divided into two groups according to the number of subjects in the dataset.One group was trained and the other was validated.
The method presented in this work is conducted on UbirisV2 [20] .According to the ORED-Net experimental setup [15] , the pupil, iris, and sclera labels of 300 images as a labeled dataset.For the experiments on semi-supervised incremental learning strategy, a total of 3414 images of 108 persons in UbirisV2 [20] were used as the unlabeled dataset.
The loss function in this work is a combination of the cross-entropy loss (CEL) cel L and the generalized dice loss function (GDL) gdl L [24] .CEL is designed to max the probability that each pixel of the entire image belongs to a class, and is generally used when the class distribution is uniform.In the human eye segmentation task, there is an uneven distribution of categories, and GDL should be introduced to solve the unbalanced segmentation problem.

Experimental results and analysis
To verify the effectiveness of the proposed network and semi-supervised incremental learning method, the comparison experiments and ablation experiments are conducted on the UbirisV2 dataset.
In comparison experiments, the proposed method is compared with other methods, such as SegNet [17] , ScleraNet [18] , ORED-Net [15] , ESPNet [23] , etc. ESPNet is composed of ESP modules, which is also used as our base structure.SegNet is the classical model used for eye multi-element semantic segmentation in the visible light environment.Both ScleraNet and ORED-Net are state-of-the-art networks for the visible light eye multi-element semantic segmentation task on the UbirisV2 dataset.Besides, the parameter quantities of these models are provided and analyzed.
Further, ablation experiments are conducted to verify the effectiveness of the proposed iterative semi-supervised incremental learning method.Considering that the model needs to filter out pseudolabels for sufficient learning, the data is set to be filtered once every 5 epochs.For the sake of fairness, following the ORED-Net, we use mIoU, Err, precision (P), recall (R), and F1 (F) as evaluation metrics.

Comparison experiments with other methods
1.23 1.24 84.93 94.95 88.82 >1MB ScleraNet [18] 82.95 1.15 86.17 95.69 90.06 >1MB ORED-Net [15] 85.12 1.07 88.12 95.49 91.18 >1MB ESPNet [23] 85 Compared with ESPNet, our method improves mIoU by 1.13 percentage points, F by 0.77 percentage points, and Err by 0.07 percentage points.In general, recall (R) and precision (P) are negatively correlated, so compare their arithmetic mean F1 (F) to judge whether the overall performance of the model is improved or not.Compared with ORED-Net, metrics mIoU and F are improved by 1.06 percentage points and 0.95 percentage points respectively, and metric Err decreases by 0.18 percentage points.Although the recall rate is not optimal, which is probably due to the limited number of parameters, the introduction of the attention mechanism makes the method pay more attention to effective features, accounting for increasing accuracy.In addition, it can be concluded that the proposed method has improved performance compared with the current best method, and the number of parameters required by the network is smaller.

Ablation experiments
This section designs ablation experiments to test the effectiveness of semi-supervised incremental learning and Hu moment-based pseudo-label filtering method.Two-fold cross-validation was used in the experiment, and the training set and validation set of each fold experiment were consistent with those of the control experiment.Table 2 reports the detailed results.Compared with the model without pseudo-label, the introduction of the Hu moment-based pseudolabel filtering method improves mIoU, R, and F by 0.66 percentage points, 0.88 percentage points, and 0.43 percentage points respectively.And the Err reduces by 0.03 percentage points.Although P decreases by 0.07 percentage points, the overall segmentation performance is still improved by the large increase in R of recall rate.This illustrates the effectiveness of the iterative learning strategy proposed in this work in eye semantic segmentation under visible light conditions task.
Figure 9 shows two visualization examples, indicating that our semi-supervised pseudo-label method achieves better segmentation performance than the base model without the pseudo-label.

Conclusions
This paper proposes a lightweight semi-supervised eye semantic segmentation method based on iteration.First, supervised training is performed by a lightweight network leveraging small amounts of labeled data.Then, based on the completed pre-trained model, initial pseudo-labels are made, input into the pseudo-label filtering module for filtering, and the screened pseudo-label data and training data are mixed for incremental training.Finally, the above process is iterated until the conditions are satisfied, and the final model is obtained.Comparative experiments on the Ubiris dataset demonstrate the superiority of the lightweight eye semantic segmentation method based on iterative learning.The ablation experimental results prove that the iterative semi-supervised incremental learning method can improve the model performance.
(b) Pre-trained model generation and pseudo-label filtering.Based on the training set of the labeled dataset, we perform the proposed network to obtain the pretrained model.Then pseudo-labels   of unlabeled eye datasets   are produced by forward inference based on the pre-trained model.According to the data characteristics of the training set (  ,   ), the data meeting specific conditions and corresponding pseudo-labels are selected from the unlabeled eye datasets (  ,   ), which is added to the pseudo-labels dataset as a supplementary training set.(c) Iterative learning.Mix the labeled training set (data   , labels   ) and the pseudo-labeled dataset (data   , pseudolabels   ) for update training of the proposed network.If a specific condition is reached, the training ends, otherwise, jump to (b) iterative learning.

Figure 1 .
Figure 1.The pipeline of semi-supervised eye semantic segmentation method based on iterative learning.

Figure 6 .
Figure 6.The pipeline of pseudo-label filtering.

Figure 8 .
Figure 8. Hu moment-based filter parameter extraction module.In the Hu moment-based data filter, based on standard parameters and pseudo-label parameters, the element-wise error i class e of single pseudo-label data is obtained.The process can be formulated as()

Table 1 .
Quantitative comparative experiments on the UbirisV2 dataset.Parameters are labeled only for lightweight models, i.e. models smaller than 1MB, and larger models are labeled ">1MB".Bold indicates the optimal precision.

Table 2 .
Ablation experiment results for semi-supervised incremental learning and Hu moment-based pseudo-label filtering method.Ours (w/o pseudo-label) is the benchmark network proposed without the pseudo-label filtering strategy.Ours denotes the method that adopts the pseudo-label filtering strategy based on the Hu moment.Bold indicates the optimal results, and underline indicates the sub-optimal results.