LUNet: deep learning for the segmentation of arterioles and venules in high resolution fundus images

Jonathan Fhima; Jan Van Eijgen; Marie-Isaline Billen Moulin-Romsée; Heloïse Brackenier; Hana Kulenovic; Valérie Debeuf; Marie Vangilbergen; Moti Freiman; Ingeborg Stalmans; Joachim A Behar

doi:10.1088/1361-6579/ad3d28

1. Introduction

The eye provides a direct view of the eye microvasculature and thereby constitutes a unique non-invasive window to the cardiovascular system. Several studies have shown that abnormalities in the retinal microvasculature can reflect the cardiovascular health of a patient (Gunn 1892, Keith 1939, Scheie 1953, Sharrett et al 1999, Witt et al 2006). Digital fundus imaging uses a specialized low-power microscope with a fundus camera to capture high resolution red green blue digital fundus images (DFI) of the eye's interior surface. Analysis of the retinal microvasculature using DFI may allow the study of cardiovascular diseases. However, manual segmentation by an experienced annotator, of the vascular tree in a single DFI typically requires 1–2 h. This limits large quantitative analysis of retinal vasculature changes in a certain diseases, discovery of disease-specific digital vasculature biomarkers and translation of such findings to clinical practice. Modern advances in computer vision have increased the precision of segmentation in various domains. Several studies focused on DFI-based blood vessel segmentation achieved accurate results with a dice score in the range 80–89 (Tolias and Panas 1998, Hoover et al 2000, Walter and Klein 2001, Jiang and Mojon 2003, Staal et al 2004, Ronneberger et al 2015, Liskowski and Krawiec 2016, Dasgupta and Singh 2017, Guo et al 2021, Kamran et al 2021). However, the sizes of the test sets in these studies were generally small with typically 10–20 DFIs and the authors did not distinguish between arterioles and venules (A/V). Despite numerous studies illustrating the link between certain medical findings and vasculature biomarkers computed on arterioles and venules independently (Gunn 1892, Sharrett et al 1999, Witt et al 2006, Sabanayagam et al 2009, Hanssen et al 2011), the task of retinal A/V segmentation remains a challenging endeavor due to the modest size of public DFI datasets with reference A/V segmentation. For instance, the Digital Retinal Images for Vessel Extraction (DRIVE) dataset, the largest available, contains only 40 A/V segmented DFIs (Staal et al 2004).

1.1. Prior works

Several attempts have been made to perform automatic A/V segmentation based on DFIs. Grisan and Ruggeri used proprietary data (2003) (Grisan and Ruggeri 2003) to extract vessels and classify A/V based on differences in the luminosity and contrast in the DFIs coupled with a vessel-tracking algorithm. The first use of convolutional neural networks for the A/V classification task was made by Welikala et al in 2017, who used 100 proprietary annotated DFIs from the UK Biobank dataset as the training dataset, and the DRIVE (Staal et al 2004) dataset for testing (Welikala et al 2017). In 2018, Hemelings et al reported on the first use of a fully convolutional network with an auto-encoder architecture, similar to a U-Net without skip connections, to perform the A/V segmentation task (Hemelings et al 2019). Two models were trained independently using the DRIVE (Staal et al 2004) and the high-resolution fundus (HRF) (Budai et al 2013) datasets. In 2021, Hu et al developed VC-Net, another U-Net variant incorporating a vessel constraint module to improve their model segmentation (Hu et al 2021). They trained their model on the DRIVE (Staal et al 2004), HRF (Budai et al 2013), and LES-AV (Orlando el al, 2018) datasets and used Tongren and Kailuan, two private datasets, for external validation (Hu et al 2021). In 2022, Galdran et al (2022) used a concatenation of two U-Net named Little W-Net to train two models for the A/V segmentation task; the first trained on the DRIVE dataset (Staal et al 2004) and the second oon the HRF dataset (Budai et al 2013). They used the LES-AV dataset (Orlando el al, 2018) as external validation of their model trained on the DRIVE dataset. Finally, Zhou et al developed BFN, which uses three U-Net models trained within an adversarial framework; one artery segmenter, one vein segmenter, and one multi-class segmenter (Zhou et al 2021). They trained three independent BFN on the DRIVE (Staal et al 2004), HRF (Budai et al 2013), and LES-AV (Orlando el al, 2018) before releasing a final BFN (Zhou et al 2022) trained on these three datasets. They evaluated their final model on the IOSTAR (Abbasi-Sureshjani et al 2015, 2016) dataset.

1.2. Research gaps and objectives

While previous work has focused mainly on DRIVE, HRF, and LES-AV, three public datasets that contain 40, 45, and 22 DFIs with manual reference segmentations, respectively, these datasets have different resolution, framing, field of view (FOV) and population sample. More specifically, the DRIVE dataset contains DFIs centered on the macula, with a FOV of 45° and a resolution of 584 × 565 pixels, acquired during a screening for diabetic retinopathy in the Netherlands. The HRF dataset contains DFIs centered on the macula, with a resolution of 2336 × 3504 pixels, acquired in Germany from patients with glaucoma or diabetic retinopathy and from healthy individuals. LES-AV contains DFIs centered on the optical disc, with a FOV of 30° and a resolution of 1444 × 1620. Patient age distribution is available for the LES-AV dataset only. The heterogeneity of these datasets led most researchers to use them to train an independent model for each dataset; poor generalization performance was observed for external datasets. Thus, there is a need for a robust A/V segmentation model with high performance that can generalize across external test sets with varying distribution shifts.

In this research, we focused on optic disc-centered DFIs, with a FOV of 30° and a high resolution of 1444 × 1444 pixels. In addition, a new DFI dataset, named UZLF, consisting of 240 crowd-sourced A/V segmentations performed by 15 medical students and subsequently corrected by a senior annotator (JVE) was created. LUNet, a novel robust deep learning (DL) algorithm tailored to the A/V segmentation task, is introduced. The generalization performance of LUNet was evaluated using 30 newly manually segmented A/V DFIs from UNAF and INSPIRE-AVR (Niemeijer et al 2011, Benítez et al 2021) as well as on the publicly available LES-AV dataset (Orlando el al, 2018) and HRF dataset (Budai et al 2013) which have reference A/V segmentations. LUNet was benchmarked against Little W-Net (Galdran et al 2022), VC-Net (Hu et al 2021) and BFN (Zhou et al 2021), three open-source state-of-the-art (SOTA) algorithms.

2. Methods

DFIs provided by the University Hospitals of Leuven (UZ) were manually segmented using Lirot.ai (Fhima et al 2022a). These segmentations were used to train LUNet. Figure 1 provides an overview of the experiments.

**Figure 1.** Overview of the experiments. DFIs were manually segmented using Lirot.ai (Fhima *et al* 2022a) and used to train LUNet, a novel DL model for automatic arteriole/venule segmentation.
Download figure:
Standard image High-resolution image

2.1. Datasets

A total of five datasets were used in our experiments. The University Hospital UZ Leuven Fundus (UZLF) dataset was used for developing the models while the four other datasets were used as external test sets to evaluate the generalization performance of LUNet and benchmark algorithms. The datasets are summarized in table 1.

Table 1. Summary of the optic disc-centered DFI datasets with A/V segmentation, including the average percentage of unknown blood vessels per image. Median (Q1–Q3) are provided for Age.

Name	${{\rm{N}}}^{\underline{{\rm{o}}}}$ DFIs	Country	Age	Original FOV	Purpose	Unknown
UZLF	240	Belgium	62 (50–73)	30°	Train & test	0.2%
UNAF	15	Paraguay	—	45°	External test	0%
INSPIRE-AVR	15	United State	—	30°	External tes	0%
LES-AV	20	Belgium	71 (62–80)	30°	External test	3%
HRF (centered version)	28	Germany	—	60°	External test	13.5%

2.1.1. University Hospital UZ Leuven Fundus UZLF dataset

Human data were obtained within the context of the study "Automatic glaucoma detection, a retrospective database analysis" (study number S60649). The Ethics Committee Research UZ/KU Leuven approved this study in November 2017 and waived the need for informed consent. A total of 115 237 optic disc-centered DFIs from 13 185 unique patients, captured between 2010 and 2019 were provided by the UZL in Belgium. These DFIs were taken with a Visucam Pro NM camera with 30° FOV (Zeiss). The resolution of these DFIs was 1444 × 1444 pixels which is higher than most public DFIs datasets and enables the visualization of smaller blood vessels. The median age of the DFIs and the interquartiles (Q1–Q3) were 64 (52–75) years and 52% were female. The exclusion criteria included patients under 18 years of age and low-quality DFIs (FundusQ-Net <6) (Abramovich et al 2023) (figure 2). Active learning was used to proactively select DFIs for manual segmentation. Specifically, among the extracted DFIs, an exploration-exploitation strategy was applied to select 240 DFIs for A/V segmentation. The exploration step consisted of randomly stratified sampling of some DFIs to annotate according to the patient's sex and the imaged eye (right/left). The exploitation steps, or active learning steps, consisted of selecting a set of DFIs with low LUNet A/V segmentation performance. Low segmentation performance was defined as a lack of vessel continuity. A subset of 240 DFIs denoted UZLF, from 232 unique patients, were selected and manually segmented. The patients included in UZLF were between 18 and 90 years of age (median (interquartile): 62 (50–73) years) and 58% were female. UZLF was 57% composed of left eye DFIs. Patients who belong to the UZLF dataset were separated into the following classes: (1) normal ophthalmic findings, (2) normal tension glaucoma (NTG), (3) primary open angle glaucoma (POAG), and (4) other condition.

**Figure 2.** UZLF dataset elaboration. Patients under 18 and low-quality DFIs are excluded. Among the remaining DFIs, a total of 240 DFI of 232 unique patients were selected by active learning and manually segmented.
Download figure:
Standard image High-resolution image

2.1.2. Universidad Nacional de Asunción Fundus (UNAF) external test set

The dataset of fundus images for the study of diabetic retinopathy contains 757 adult patient DFI acquired in the Department of Ophthalmology of the Hospital de Clínicas of San Lorenzo, Paraguay (Benítez et al 2021). The DFIs were acquired using a Visucam 500 camera with 45° FOV (Zeiss). The eyes included in this study were classified as: (1) no signs of diabetic retinopathy, (2) mild or early, non-proliferative diabetic retinopathy (NPDR), (3) moderate NPDR, (4) severe NPDR, (5) very severe NPDR, (6) proliferative diabetic retinopathy (PDR), (7) advanced PDR. The resolution of the original DFIs is 2124 × 2056 pixels. In order to benchmark LUNet on this dataset, the DFIs were padded by zeros to a squared resolution of 2124 × 2124 and then cropped to a 1444 × 1444 resolution. From the resulting DFIs, 15 optic disc-centered DFIs were randomly selected to form the UNAF external test set. The UNAF set included four NDR, two moderate NPDR, seven patients with severe NDPR, and two patients with very severe NDPR. No additional clinical data were available for this dataset.

2.1.3. University of Iowa Hospitals and Clinics (INSPIRE-AVR) external test set

This dataset contains 65 DFIs acquired from patients with POAG at the University of Iowa Hospitals and Clinics. DFIs were acquired using a 30° Zeiss fundus camera (Niemeijer et al 2011). The images were centered on the optic disc. The original DFIs resolution was 2392 × 2048. In order to benchmark LUNet on this dataset, the black border of the DFIs were padded to a squared resolution of 2048 × 2048 pixels and then resized to a 1444 × 1444 pixels resolution. From the resulting DFIs, 15 optic disc-centered DFIs were randomly selected to form the second external test set. No other additional metadata were provided in the open source dataset.

2.1.4. LES-AV dataset

This dataset contains 22 optic disc-centered DFIs with A/V segmentation acquired at the UZL in Belgium (Orlando el al, 2018). Of these, 21 were captured with a resolution of 1444 × 1620 and a field of view of 30°. In order to benchmark LUNet on this dataset, the black borders of the 21 DFIs were cropped to a squared resolution of 1444 × 1444 pixels and was used as external test set. The DFI numbered 275 was excluded due to its inclusion in the UZLF dataset. This resulted in 20 images in LES-AV including DFIs from 10 healthy participants, 6 NTG patients and 4 POAG patients. Among the patients, 45% were women and the median and interquartiles age of the DFIs were 71 (62–80).

2.1.5. HRF dataset

This dataset contains 45 non optic disc-centered DFIs with A/V segmentation (Budai et al 2013). The original DFIs resolution was 3504 × 2336 with a field of view of 60°. In order to benchmark For benchmarking LUNet on the HRF dataset, we extracted an ROI of 1444 × 1444 pixels centered around the optic disc whenever feasible. This was the case for 28 out of the 45 available images in HRF. This resulted in 28 DFI which were used as an additional external test set. The eyes included in this study were classified as: (1) healthy, (2) glaucoma, and diabetic retinopathy. No additional clinical data were available for this dataset.

2.2. Reference segmentations

The INSPIRE-AVR, UZLF and UNAF datasets were manually segmented by the retinal experts of the UZ Leuven Hospital using the Lirot.ai app developed by Fhima et al (2022a) and following the protocol described in Fhima et al (2022b).

A total of sixteen annotators experienced in micro-vascular research worked between July 2021 and January 2023 to build the UZLF dataset. Annotators were divided into two groups; (1) one experienced senior annotator, an ophthalmology resident with a PhD in retinal vascular biomarkers, and (2) 15 junior annotators who were all graduate students in medicine with a research internship of >1 month in the Research Group of Ophthalmology and trained by the senior annotator. The UNAF and INSPIRE-AVR datasets were annotated by the senior annotator. For the UZLF dataset, the junior annotators performed the first segmentation of the 240 DFIs, of which 174 were later corrected by the senior annotator to avoid mistakes and to correct for different annotation styles.

2.3. Data preparation

A DFI was modeled by a couple (X, y), where:

X ∈ R^w×h×3 represents a DFI with a width of w, a height of h pixels and 3 channels, i.e. red, green and blue;
y ∈ R^w×h×3 where y=[y_a, y_v, y_ukn] with y_a, y_v, y_ukn ∈R^w×h×1, i.e., y_a, y_v and y_ukn are binary images, where y_a is the segmented arterioles, y_v the segmented venules and y_ukn represent the blood vessel that were not distinguishable, with white pixel if part of a blood vessel, and black pixel otherwise.

Where h = w = 1444.

2.3.1. Train-validation-test set elaboration

The test set was elaborated by randomly selecting 14 patients from each of the POAG, NTG and Normal subgroups from the 174 expert-reviewed segmentations and by selecting 50% women for each subgroup. This resulted in a test set of 50 DFIs from 42 patients. In addition, a total of 6 DFI randomly selected among ophthalmology diseases were manually segmented and added to the test set to evaluate the generalization performance of LUNet for other diseases. Among the remaining DFIs, i.e. 184 images, 85% were used for training and 15% for validation while images were stratified by patient thus ensuring no information leakage.

2.3.2. Preprocessing

LUNet is based on convolutions with same padding and max-pooling 2D (2 × 2). LUNet uses six levels of depth, which means that the height and width of the input are divided by 2 six times. Thus, LUNet's input size needs to be divisible by 2⁶. To address this requirement, zero-padding is applied to the inputs and outputs to obtain an input tensor of size 1472 × 1472 × 3 and an output tensor of size 1472 × 1472 × 2. Furthermore, the input and output images were normalized.

2.4. DL architecture

The attention U-Net architecture was used as the backbone for our DL algorithm (Oktay et al 2018). The A/V segmentation task is challenging due to the long-range dependencies of the blood vessels and their small width. To achieve accurate A/V segmentation, it is necessary to have a model that can identify both local and global dependencies. The local dependencies help detect the smaller blood vessels more accurately, while the global dependencies enable the reconstruction of the full blood vessel. In convolutional neural networks, the local dependencies are captured by the convolution operation, while the larger range dependencies are dependent on the receptive field of the network. The receptive field can be increased by augmenting the depth of the model and increasing the kernel size of the convolutional layers, using dilated convolution, or using some max pooling layers. The first two solutions would lead to an augmentation of the model parameters with a small increase of the receptive field, while the max pooling would not add any parameter and would lead to a much larger receptive field. Nevertheless, due to the small size of the blood vessel in a classical DFI, the max pooling could be an excessively aggressive strategy, which would result in the loss of small-blood-vessel visibility after several iterations. To tackle this problem, LUNet was designed to include several improvements, including; (1) a new double dilated convolution block with an increased receptive field, (2) a long tail that works on the full pixel resolution, (3) an increased depth, (4) an over-representation of the feature extracted by the encoder compared to the one reconstructed by the decoder at each level of depth of the LUNet auto-encoder. The LUNet architecture is shown in figure 3.

**Figure 3.** LUNet architecture based on an attention U-Net backbone with double dilated convolution blocks, an increase depth, a long tail and an over representation of the features extracted from the encoder compared to the one extracted from the decoder at each level of depth.
Download figure:
Standard image High-resolution image

2.4.1. Double dilated convolution block

Dilated convolution enables a larger receptive field but would lead to a less accurate feature representation of small details (due to the dilation rate). To tackle this limitation, both classical and dilated convolutions with a kernel size of 7 were used in the model. The double dilated convolution block also incorporated a spatial dropout 2D regularization (Tompson et al 2015) and a batch normalization layer with the ReLU activation function. A diagram of the double dilated convolution block is presented in figure 4.

**Figure 4.** LUNet double dilated convolution block with parameter K.
Download figure:
Standard image High-resolution image

2.4.2. Long tail

Classical convolutional neural network architecture uses max pooling to increase receptive fields and obtain a larger long-range dependency. Nevertheless, some blood vessels may be too small to remain detectable after one or more max poolings. To tackle this limitation, four double-dilated convolution blocks were added at the end of LUNet to: (1) perform several computations at the full image resolution, to increase the possibility of detection of small blood vessels that may not be detectable at lower resolutions; (2) increase the receptive field of the full-resolution-computed features. Overall the long tail helps to refine the segmentation.

2.5. Loss

Other approaches have described the A/V segmentation problem as a multiclass segmentation problem leading to a multiclass cross-entropy loss minimization (Hu et al 2021, Hemelings et al 2019). We formulated the problem differently, as a binary multi-label segmentation because it is common to see superimposition of arterioles and venules on several pixels of a DFI. Furthermore, some manual segmentations may contain pixels labeled as unknown. These pixels were not penalized for being classified as A or V. However, LUNet was still trained to detect these pixels and eventually classify them. A regularization term was added to the loss function to learn to detect blood vessels without A/V distinction. Accordingly, the LUNet loss function, L_LUNet, was defined as the sum of three losses computed independently on the arterioles, the venules, and the overall blood vessels

$\begin{eqnarray}&&\begin{array}{l}{L}_{\mathrm{LUNet}}(y,\hat{y})=L({y}_{a}\times (1-{y}_{ukn}),{\hat{y}}_{a}\times (1-{y}_{ukn}))\\ +\,L({y}_{v}\times (1-{y}_{{ukn}}),{\hat{y}}_{v}\times (1-{y}_{{ukn}}))\\ +\,L({\max }({\hat{y}}_{a},{\hat{y}}_{v}),{\max }({y}_{a},{y}_{v},{y}_{{ukn}}))\end{array}\end{eqnarray} \tag{ 1 }$

with

$\begin{eqnarray}&&\begin{array}{l}L(y,\hat{y})={\lambda }_{1}\times {L}_{\mathrm{BCE}}(y,\hat{y})+{\lambda }_{2}\times {L}_{\mathrm{dice}}(y,\hat{y})\,+\\ {\lambda }_{3}\times {L}_{{cl}\mathrm{Dice}}(y,\hat{y})+{\rm{\nabla }}(\hat{y})\end{array}\end{eqnarray} \tag{ 2 }$

where y is the ground truth segmentation, $\hat{y}$ is the predicted probability map, L_BCE is the binary cross-entropy loss, L_dice is the dice loss, L_clDice is the centerline dice loss (Shit et al 2021) and where the gradient of $\hat{y}$ is minimized to favor the continuity of the blood vessels by constraining the change of value in the probability map to be smoother.

2.6. Training protocol

LUNet hyperparameters were manually fine-tuned using the validation set. LUNet was trained for up to 1300 epochs, while retaining the model which obtains the minimum validation loss. A batch size of 8 and an Adam optimizer with 1e⁻⁴ for the learning rate were used. The L_LUNet loss function was used with λ₁ = λ₂ = 1 and λ₃ = 0.3.

For each training DFI loaded, random online data augmentation was performed with (1) horizontal and vertical flip, (2) transposition, (3) rescaling of the input and output to a lower resolution uniformly sampled between 800 × 800 and 1472 × 1472, and (4) color jittering. Test time data augmentation was used for the test set prediction (Wang et al 2018) with the following rotation angles [0, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330], as well as transposition of the DFI which resulted in 24 predicted segmentations for a single original DFI. The inverse transform for each prediction was performed. The final predicted probability map for the segmentation was the pixel-wise average of all the 24 segmentations.

2.7. Benchmark

Benchmark against SOTA algorithms: LUNet performance on the UZLF dataset was compared to that of Little W-Net (Galdran et al 2022), BFN (Zhou et al 2021) and VC-Net (Hu et al 2021), three SOTA algorithms, that were trained on the UZLF-train dataset. On the external datasets, LUNet was compared to Little W-Net (Galdran et al 2022), VC-Net (Hu et al 2021) and BFN (Zhou et al 2021) trained on the UZLF dataset as well as the publicly trained version of Little W-Net* (Galdran et al 2022) and BFN* (Zhou et al 2022).

Benchmark against junior performance: The DFIs of the UZLF-test dataset were annotated by a junior annotator and then reviewed by the senior annotator. This enables a comparison of the dice score between an individual junior annotator and the senior annotator. This dice score statistic allows to benchmark LUNet to the performance of a human junior annotator.

2.8. Performance measures

A dice score was computed for the arterioles segmentation (dice_a) and for the venules segmentation (dice_v). In order to estimate the 95% confidence interval (CI), a bootstrap method was employed by repeatedly sampling 80% of the test set with replacement, and computing the mean score of each sample. This procedure was carried out 1000 times, and the resulting distribution of means was used to determine the lower and upper bounds of the 95% confidence interval. In order to monitor the performance of LUNet with respect to the number of annotated DFIs, two learning curves one for the dice_a and one for the dice_v were computed. The reported performance was computed on the UZLF-test set as the training set size increased. For the LES-AV and HRF datasets, some of the ground truth blood vessels were annotated as unknown. The unknown pixels were not taken into account in the computation of the two dice scores. Furthermore, due to the imbalanced nature of blood vessels in DFIs, we also report the Matthews correlation coefficient (Chicco and Jurman 2020) for LUNet and the benchmark models for both the local and external test sets.

Finally, LUNet was evaluated on the local test set with respect to the ability to correctly estimate vasculature biomarkers (VBMs). VBMs were computed using the open source PVBM toolbox (https://pvbm.readthedocs.io/) (Fhima et al 2022b). The VBMs include: the area (AREA), the tortuosity index (TI), the median arc-chord tortuosity (TOR), the length (LEN), the median branching angles (BA), the number of blood vessels that intersect with the optic disc (START), the number of endpoints (END), the number of intersection points (INTER), the fractals dimension (D0, D1 and D2). Each biomarker was computed independently on the arterioles and the venules and the Pearson correlation between the ground truth and the estimated VBMs were computed. The average Pearson correlation between all the VBMs was also computed in order to provide an overall performance measure.

3. Results

3.1. Performance on the UZLF-test

LUNet achieved dice scores of 81.99/84.54 for A/V segmentations on the local test set. These were higher than those of the SOTA benchmark algorithms Little W-Net (Galdran et al 2022), VC-Net (Hu et al 2021) and BFN (Zhou et al 2021) which achieved dice scores of 79.76/82.62, 78.31/81.89 and 78.14/80.79, respectively. The performance of LUNet were better for A (81.99 versus 81.50) and for V (84.54 vs 83.74) than an average junior annotator. Table 2 and figure 5 summarizes the quantitative results on the UZLF-test. Figure 6 shows examples of segmentations performed by LUNet, a junior annotator, an expert annotator, alongside those from the top-performing generalizable benchmark algorithm, VC-Net (Hu et al 2021). Specifically, the DFI with the best, intermediate and worst LUNet performance was selected. Furthermore, LUNet outperforms Little W-Net, VC-Net and BFN in estimating most VBMs (table 3). LUNet got the highest Pearson correlation score in 8 out of 22 VBMs and the second highest in 12 out of 22 VBMs. Meanwhile, the junior annotators had the best performance in 9 VBMs and were the second best in 2 VBMs. The learning curves of LUNet are shown in figure 7.

**Figure 5.** Performance of models and humans for A/V segmentations (*dice*_a and *dice*_v) on the UZLF-test. The results for LUNet and benchmark SOTAs Little W-Net (Galdran *et al* 2022), BFN (Zhou *et al* 2021) and VC-Net (Hu *et al* 2021) are shown. Performance results of junior annotators who segmented at least 5 DFIs on the UZLF-test are also shown. For the junior annotators, the size of the marker is proportional to the number of DFIs that they annotated in UZLF-test. LUNet performs better than the other DL algorithm and has comparable performance to an average junior annotator.
Download figure:
Standard image High-resolution image

**Figure 6.** Examples of segmentations on the UZLF dataset. Segmentations performed by the junior annotator, expert annotator, LUNet and VC-Net (Hu *et al* 2021). A1/A2: DFI with the best dice scores for LUNet, B1/B2: DFI with an intermediate LUNet dice scores, C1/C2: DFI with the worst LUNet dice scores. Arterioles are represented in red, venules in blue and unknown blood vessels in green.
Download figure:
Standard image High-resolution image

**Figure 7.** (A): LUNet's learning curves. Dice scores are shown for the UZLF-test set. The junior dice scores correspond to the average dice score of the junior annotators for the test set.
Download figure:
Standard image High-resolution image

Table 2. Quantitative A/V segmentation results on the local (UZLF-test) test sets, including the average performance and (CI) for the dice_a and dice_v. The table includes the performance of the junior annotators, LUNet, Little W-Net (Galdran et al 2022), BFN (Zhou et al 2021) and VC-Net (Hu et al 2021) trained on the UZLF-train dataset and evaluated on the UZLF-test dataset.

Model	UZLF-test
	dice_a	dice_v
BFN (Zhou et al 2021)	78.14 (76.31–79.66)	80.79 (79.49-81.90)
VC-Net (Hu et al 2021)	78.31 (76.79-79.74)	81.89 (80.95-82.93)
Little W-Net (Galdran et al 2022)	79.76 (78.21-81.02)	82.62 (81.61-83.54)
Junior	81.50 (78.00-84.41)	83.74 (80.91-86.28)
LUNet (our work)	81.99 (80.63-83.11)	84.54 (83.59-85.42)

3.2. Generalization performance

Figure 8 presents the segmentations produced by LUNet alongside those from the top-performing generalizable benchmark algorithm, VC-Net (Hu et al 2021), as well as the ground truth for comparison for a DFI example for each of the external test sets. Table 4 summarizes the quantitative results on the external test sets.

Table 3. Pearson correlation between the ground truth and estimated VBM based on a given segmentation algorithm. VBMs are computed independently for arterioles (a) and for venules (v). For a given VBM, the best performance is bold and the second best is underlined.

	LUNet	VC-Net	Little W-Net	BFN	Junior
AREA_a	0.823	0.837	0.845	0.833	0.744
TI_a	0.922	0.888	0.888	0.932	0.935
TOR_a	0.796	0.713	0.792	0.799	0.752
LEN_a	0.811	0.770	66.22	0.757	0.796
BA_a	0.711	0.512	0.420	0.640	0.798
START_a	0.685	0.561	0.535	0.581	0.677
END_a	0.806	0.690	0.643	0.731	0.869
INTER_a	0.811	0.644	0.752	0.732	0.887
D0_a	0.867	0.850	0.881	0.863	0.805
D1_a	0.875	0.867	0.871	0.856	0.810
D2_a	0.867	0.865	0.844	0.856	0.808
AREA_v	0.883	0.878	0.855	0.869	0.773
TI_v	0.884	0.812	0.877	0.888	0.847
TOR_v	0.861	0.840	0.844	0.803	0.761
LEN_v	0.771	0.557	0.520	0.758	0.822
BA_v	0.370	0.245	0.336	0.442	0.480
START_v	0.671	0.486	0.427	0.442	0.741
END_v	0.785	0.728	0.620	0.731	0.882
INTER_v	0.731	0.627	0.645	0.662	0.838
D0_v	0.864	0.822	0.769	0.871	0.783
D1_v	0.862	0.816	0.755	0.813	0.730
D2_v	0.848	0.805	0.740	0.794	0.700

Table 4. Performance in A/V segmentation for the external test sets, including the average performance and (CI) for the dice_a and dice_v. The table includes LUNet, Little W-Net (Galdran et al 2022), BFN (Zhou et al 2021) and VC-Net (Hu et al 2021) trained on the UZLF-train dataset. It also include the performance of Little W-Net* (Galdran et al 2022) and BFN* (Zhou et al 2022) publicly available version, i.e. pretrained by the original authors work. Results for BFN* are not provided for LES-AV because the model was originally trained on this dataset.

Model	LES-AV		UNAF		INSPIRE-AVR		Cropped HRF
	dice_a	dice_v	dice_a	dice_v	dice_a	dice_v	dice_a	dice_v
BFN* (Zhou et al 2021)	—	—	64.13	72.78	62.98	68.10	70.95	75.47
			(59.93–68.07)	(70.24–74.98)	(57.41–66.88)	(63.06–71.30)	(68.73–73.26)	(73.75–77.12)
Little W-Net* (Galdran et al 2022)	65.32	70.27	52.58	51.32	58.08	66.37	57.93	61.42
	(60.12–69.45)	(65.74–73.76)	(45.97–59.43)	(41.33–59.69)	(55.50–60.87)	(64.10–69.01)	(53.72–61.86)	(55.71–65.73)
BFN (Zhou et al 2021)	78.68	80.34	68.01	72.67	68.61	71.92	72.90	76.99
	(76.34–80.98)	(78.85–81.75)	(63.45–71.92)	(69.67–75.47)	(66.23–70.84)	(67.74–75.04)	(70.52–75.37)	(75.74–78.38)
Little W-Net (Galdran et al 2022)	80.19	82.97	66.71	68.61	70.56	74.05	59.01	47.44
	(77.90–82.32)	(81.63–84.23)	(61.72–70.96)	(64.16–72.76)	(68.27–72.88)	(70.99–76.37)	(55.74–62.01)	(41.94–53.10)
VC-Net (Hu et al 2021)	78.75	82.49	68.71	74.43	71.28	75.34	75.78	79.99
	(76.19–81.11)	(80.82–84.17)	(64.91–71.86)	(72.72–76.19)	(69.50–72.93)	(74.13–76.60)	(73.77–77.55)	(79.03–81.04)
LUNet (Our work)	82.30	84.75	73.31	79.04	73.58	77.53	78.12	80.39
	(80.26–84.43)	(83.02–86.28)	(70.69–76.41)	(77.84–80.39)	(70.94–75.69)	(74.99–79.84)	(76.10–80.11)	(79.20–81.72)

LUNet achieved average dice scores of 82.30/84.75 for A/V segmentation on the LES-AV dataset. This performance was superior to those of the SOTA benchmark algorithms Little W-Net (Galdran et al 2022), VC-Net (Hu et al 2021) and BFN (Zhou et al 2021) which achieved dice scores of 80.19/82.97, 78.75/82.49 and 78.68/80.34, respectively. Its performance was also higher than that of the public version of Little W-Net* (Galdran et al 2022) which achieved A/V dice scores of 65.32/70.27. The public version of BFN* (Zhou et al 2022) was not evaluated on LEV-AV because it was originally trained on it. LUNet achieved dice scores of 73.31/79.04 for A/V segmentations on the UNAF dataset. This performance was superior to those of the SOTA benchmark algorithms Little W-Net (Galdran et al 2022), VC-Net (Hu et al 2021) and BFN (Zhou et al 2021) which achieved dice scores of 66.71/68.61, 68.71/74.43 and 68.01/72.67, respectively. Its performance was also higher than those of the public version of Little W-Net* (Galdran et al 2022) and BFN* (Zhou et al 2022) which achieved A/V dice scores of 52.58/51.32 and 64.13/72.78, respectively. LUNet achieved dice scores of 73.58/77.53 for A/V segmentations on the INSPIRE-AVR dataset. These were higher than the SOTA benchmark algorithms Little W-Net (Galdran et al 2022), VC-Net (Hu et al 2021) and BFN (Zhou et al 2021) which achieved dice scores of 68.61/70.56, 71.28/75.34 and 68.61/71.92, respectively. They were also higher than those of the public version of Little W-Net* (Galdran et al 2022) and BFN* (Zhou et al 2022) which achieved A/V dice scores of 58.08/66.37 and 62.98/68.10, respectively. LUNet achieved dice scores of 78.12/80.39 for A/V segmentations on the cropped HRF dataset. This performance was superior to those of the SOTA benchmark algorithms Little W-Net (Galdran et al 2022) and, VC-Net (Hu et al 2021) and BFN (Zhou et al 2021) which achieved dice scores of 59.01/47.44, 72.90/76.99 and 75.78/79.99, respectively. Its performance was also higher than those of the public version of Little W-Net* (Galdran et al 2022) and BFN* (Zhou et al 2022) which achieved A/V dice scores of 57.93/61.42 and 70.95/75.47, respectively. Performance in terms of Matthews correlation coefficient are reported in table A1 and are consistent with our observations in terms of DICE score.

3.3. Ablation study

An ablation study was conducted on the validation set by removing the following architecture components, LT: long tail, CL: custom loss, DDCB: double dilated convolution block (figure 9). This ablation study demonstrates the importance of the individual components.

**Figure 9.** Ablation study. Performance is reported for arterioles (blue) and venules (red) on the validation set. Numbers of parameters are expressed in millions.
Download figure:
Standard image High-resolution image

4. Discussion and future work

The first main contribution of this work is the creation of UZLF, a new dataset of high resolution DFIs with A/V segmentation which is six times larger than existing open datasets. Despite having access to a large DFI dataset, the considerable manual effort required for accurate A/V segmentation led us to annotate only a subset of those DFIs. This was done using Lirot.ai, with an active learning pipeline, to optimize the quality-time trade-off in developing the UZLF dataset. Additionally, 30 DFIs from the UNAF and INSPIRE-AVR public datasets were A/V segmented. The new DFI dataset with reference A/V segmentations is made open access.

Our second main contribution is the development of a novel and robust DL model, denoted as LUNet, for the automated segmentation of A/V based on optic disc-centered and high-resolution DFIs. LUNet outperformed two open source SOTA algorithms, namely VC-Net (Hu et al 2021) and BFN (Zhou et al 2021, 2022). LUNet performance (81.99/84.54 for A/V) was comparable to an average trained junior annotator (81.50/83.74 for A/V) on the local test set. There was no significant difference in performance on the local test set for male vs female DFIs (A/V p-values are 0.83/0.34), left vs right eye DFIs (A/V p-values are 0.55/0.62) and POAG vs non POAG DFIs (A/V p-values are 0.68/0.59). LUNet consistently generalized better than the benchmarked algorithms for all four external test sets. However, performance dropped on the external test sets versus the local test set. Several factors can affect the rendering of a DFI and thus the performance of LUNet on the external test set. One factor that can contribute to the drop in performance is the lower quality of the DFI. FundusQ-Net (Abramovich et al 2023) referred a lower quality for the DFIs issued from the UNAF dataset with a median (Q1–Q3) of 7.24 (6.74–7.60) while the quality observed for the UZLF test set was 7.86 (7.51–8.38). Additionally, the original framing of the images in the UNAF dataset differs from the one in the UZLF dataset, which can affect the model performance. The quality of the INSPIRE-AVR dataset was slightly higher than the one of the UZLF with a median (Q1–Q3) FundusQ-Net score of 8.30 (7.77–8.45) and thus cannot explain the drop in performance observed for this external test set. Another possible reason is the difference in the population studied in the UNAF and INSPIRE-AVR datasets compared to the UZLF dataset. The UNAF datasets have a more significant number of patients with diabetic retinopathy, which can lead to distribution shifts related to the presence of hemorrhage and/or exudates. Furthermore, UNAF and INSPIRE-AVR datasets come from different countries; Paraguay for UNAF and United States for INSPIRE-AVR. The ethnic differences may also affect the performance of the model. Different ethnicities can lead to different variations in the retinal vasculature and overall retinal structure and pigmentation to which the model has to adapt (Li et al 2013). The acquisition protocol used to capture the images can also play an important role in the performance drop. Factors such as the amount of light in the room, the quality of the equipment used, and the skill of the operator capturing the image can all affect the appearance of the images and, thus, the model's generalization performance. Overall, the drop in performance on the UNAF dataset may come from a different original framing, a lower quality of the DFIs and a different population sample in terms of pathology and ethnicity.

There are some limitations to this research and opportunities for improvement. Although LUNet performed better than other DL models on external datasets, there was a significant drop in performance compared to the local test set. This drop in performance could be reduced by training LUNet on a more diverse dataset using either supervised or self-supervised learning. Despite the relatively high number of manually segmented DFI, performance can continuously improve with the training set size, as shown in figure 7. This suggests that increasing the training set size further may lead to improved performance for LUNet.

The manual A/V segmentation process is time-consuming, which limited our methodology to involve only a single junior annotator for initial segmentations and a single senior annotator for corrections. In future studies, it would be beneficial to engage multiple senior annotators to reannotate a subset of our dataset, which would allow us to assess inter-rater agreement as demonstrated by Jin et al (2022). This step would enrich the reliability of our segmentation benchmarks. In this research we focused on 30° FOV images capturing a uniform physical region of the retina. However, to enhance the model's generalization capabilities across various FOV, centering of the image take, future developments will be needed in order to train a model that is robust across these distribution shifts. Finally, while recent studies such as those by Lin et al (2023) and Jiang et al (2022) have explored the potential of transformer architectures for blood vessel segmentation, these works did not address the task of A/V distinction.

Conclusion

In conclusion, we developed LUNet a robust, i.e. high performing and generalizable, deep learning model for the segmentation of venules and arterioles in fundus images. We demonstrate how LUNet can be used to estimate vasculature biomarkers, allowing large-scale research on the effect of cardiovascular diseases on the eye vasculature.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://www.nature.com/articles/s41597-024-03086-6

Appendix A.: Additional performance measures

Table A1. Matthews correlation coefficient performance in A/V segmentation. The table includes LUNet, BFN (Zhou et al 2021), Little W-Net (Galdran et al 2022) and VC-Net (Hu et al 2021) trained on the UZLF-train dataset. It also include the performance of BFN* (Zhou et al 2022) and Little W-Net* (Galdran et al 2022) publicly available version, i.e. pretrained by the original authors work. Results for BFN* are not provided for LES-AV because the model was originally trained on this dataset.

Model	UZLF		LES-AV		UNAF		INSPIRE-AVR		Cropped HRF
	MCC_a	MCC_v	MCC_a	MCC_v	MCC_a	MCC_v	MCC_a	MCC_v	MCC_a	MCC_v
BFN* (Zhou et al 2021)	74.21	78.64	—	—	64.70	72.73	62.00	66.97	66.63	70.31
Little W-Net* (Galdran et al 2022)	60.30	66.92	64.54	69.45	51.60	54.15	57.17	65.19	56.62	60.35
BFN (Zhou et al 2021)	77.51	80.10	77.57	79.27	67.00	72.41	67.53	70.77	68.89	71.52
Little W-Net (Galdran et al 2022)	79.18	82.08	79.74	82.43	65.86	69.91	69.73	73.42	59.17	49.17
VC-Net (Hu et al 2021)	77.66	81.32	77.39	81.37	67.97	74.27	70.43	74.53	70.58	74.98
LUNet (Our work)	81.46	83.97	81.10	83.49	73.01	78.50	72.47	76.68	72.66	74.73

LUNet: deep learning for the segmentation of arterioles and venules in high resolution fundus images

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

1.1. Prior works

1.2. Research gaps and objectives

2. Methods

2.1. Datasets

2.1.1. University Hospital UZ Leuven Fundus UZLF dataset

2.1.2. Universidad Nacional de Asunción Fundus (UNAF) external test set

2.1.3. University of Iowa Hospitals and Clinics (INSPIRE-AVR) external test set

2.1.4. LES-AV dataset

2.1.5. HRF dataset

2.2. Reference segmentations

2.3. Data preparation

2.3.1. Train-validation-test set elaboration

2.3.2. Preprocessing

2.4. DL architecture

2.4.1. Double dilated convolution block

2.4.2. Long tail

2.5. Loss

2.6. Training protocol

2.7. Benchmark

2.8. Performance measures

3. Results

3.1. Performance on the UZLF-test

3.2. Generalization performance

3.3. Ablation study

4. Discussion and future work

Conclusion

Data availability statement

Appendix A.: Additional performance measures

LUNet: deep learning for the segmentation of arterioles and venules in high resolution fundus images

Article metrics

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

1.1. Prior works

1.2. Research gaps and objectives

2. Methods

2.1. Datasets

2.1.1. University Hospital UZ Leuven Fundus UZLF dataset

2.1.2. Universidad Nacional de Asunción Fundus (UNAF) external test set

2.1.3. University of Iowa Hospitals and Clinics (INSPIRE-AVR) external test set

2.1.4. LES-AV dataset

2.1.5. HRF dataset

2.2. Reference segmentations

2.3. Data preparation

2.3.1. Train-validation-test set elaboration

2.3.2. Preprocessing

2.4. DL architecture

2.4.1. Double dilated convolution block

2.4.2. Long tail

2.5. Loss

2.6. Training protocol

2.7. Benchmark

2.8. Performance measures

3. Results

3.1. Performance on the UZLF-test

3.2. Generalization performance

3.3. Ablation study

4. Discussion and future work

Conclusion

Data availability statement

Appendix A.: Additional performance measures