All-in-SAM: from Weak Annotation to Pixel-wise Nuclei Segmentation with Prompt-based Finetuning

The Segment Anything Model (SAM) is a recently proposed prompt-based segmentation model in a generic zero-shot segmentation approach. With the zero-shot segmentation capacity, SAM achieved impressive flexibility and precision on various segmentation tasks. However, the current pipeline requires manual prompts during the inference stage, which is still resource intensive for biomedical image segmentation. In this paper, instead of using prompts during the inference stage, we introduce a pipeline that utilizes the SAM, called all-in-SAM, through the entire AI development workflow (from annotation generation to model finetuning) without requiring manual prompts during the inference stage. Specifically, SAM is first employed to generate pixel-level annotations from weak prompts (e.g., points, bounding box). Then, the pixel-level annotations are used to finetune the SAM segmentation model rather than training from scratch. Our experimental results reveal two key findings: 1) the proposed pipeline surpasses the state-of-the-art methods in a nuclei segmentation task on the public Monuseg dataset, and 2) the utilization of weak and few annotations for SAM finetuning achieves competitive performance compared to using strong pixelwise annotated data.


I. INTRODUCTION
The foundation models have recently been proposed as a powerful segmentation model [1], [2].Segment Anything Model (SAM), as an example, was trained by millions of images to achieve a generic segmentation capability [3].SAM This research was supported by NIH R01DK135597 (Huo), NSF CAREER 1452485, NSF 2040462,NCRR Grant UL1-01 (now at NCATS Grant 2 UL1 TR000445-06), NVIDIA hardware grant, resources of ACCRE at Vanderbilt University can automatically segment a new image, and it also accepts the prompts input of foreground/background points or the box regions for better segmentation [4]- [7].However, recent studies have revealed SAM's limited performance in specific domain tasks, such as medical image segmentation, particularly when an insufficient number of prompts are available [4].The main reason is that medical data was rare to see in the training set of SAM while the medical segmentation tasks always in requirement of higher professional knowledge than natural image segmentation [8].
Finetuning strategy utilizes the power of the generic model in detecting low-level and general image patterns but adjusts the final segmentation based on the characteristics and highlevel understanding of downstream tasks, which provides a promising solution in adapting the generic segmentation model to downstream tasks.Previous approaches [6], [9] have proposed finetuning methods to improve SAM's performance in downstream tasks.However, these methods mostly require complete data annotation for finetuning and did not explore the impact of weak annotation and few training data on the finetuning of the pretrained SAM model.
Nuclei segmentation is a crucial task in biomedical research and clinical applications, but manual annotation of nuclei in whole slide images (WSIs) is time-consuming and laborintensive.Previous works attempted to automatically segment nuclei with supervised learning [10], [11].More recently, some methods used self-supervised learning to further improve the model performance [12], [13].SAM has great potential to benefit nuclei segmentation if it can be adapted appropriately.This paper investigates the performance of transferring the SAM to nuclei segmentation.A previous study has indicated that SAM performed poorly in nuclei segmentation without box/point information as manual prompts, but achieved promising segmentation when the bounding box of every nuclei was provided as the prompt in the inference stage [4].However, manually annotating all the boxes during inference remains time-consuming.To address this issue, we introduce a pipeline for label-efficient finetuning of SAM, with no requirement for annotation prompts during inference.Also, instead of relying on complete annotations for finetuning, we leverage weak annotations to reduce annotation costs further while achieving comparable segmentation performance to SOTA methods.
In this work, we proposed the All-in-SAM pipeline, utilizing the pretrained SAM for annotation generation and model finetuning.Also, no manual prompts are required during the inference stage (Fig. 1).The contribution of this work can be summarized into two points: 1) Utilization of weak annotations for cost reduction: Rather than relying exclusively on fully annotated data for finetuning, we demonstrate the effectiveness of leveraging weak annotations and the pretrained SAM.This approach helps to minimize annotation costs while achieving segmentation performance that is comparable to the current state-of-the-art methods.
2) Development of a pipeline for label-efficient finetuning: We propose a method that allows SAM to be finetuned for nuclei segmentation without the requirement of manual prompts during inference.This significantly reduces the time and effort involved in manual annotation.
Overall, this work aims to enhance the application of SAM in nuclei segmentation by addressing the annotation burden and cost issues through label-efficient finetuning and the utilization of weak annotations.

II. METHOD A. Overview
Motivated by the promising performance of the SAM model in interactive segmentation tasks with sparse prompts and the potential for finetuning, we propose a segmentation pipeline that leverages weak and limited annotations and apply this pipeline to the nuclei segmentation task.
The proposed pipeline consists of two main stages: SAMempowered annotation and SAM finetuning.In the first stage, we utilize the pretrained SAM model to generate high-quality approximate nuclei masks for pathology images.This is achieved by providing the bounding boxes of nuclei as input to the pretrained SAM model.These approximate masks serve as initial segmentation masks for the nuclei.In the second stage, the generated approximate masks are employed to finetune the SAM model, which allows the model to adapt and refine its segmentation capabilities specifically for nuclei segmentation.The proposed pipeline is displayed in Fig. 2. Two stages are introduced in detail in II-B and II-C.Furthermore, we evaluate the performance of the model when only a small number of annotated data for downstream tasks.By decreasing the number of annotated samples, we aim to reduce annotation labor while still achieving satisfactory segmentation results.

B. SAM-empowered annotation
The SAM model consists of three key components: the prompt encoder, the image encoder, and the mask decoder.The image encoder utilizes the Vision Transformer (ViT) as its backbone.The prompt encoder can take two forms: sparse or dense.In the sparse form, prompts can be in the form of points, boxes, or text, whereas in the dense form, prompts are represented as a grid or mask.The encoded prompts are then added to the image representation for the subsequent mask decoding process.In a previous study [4], it was observed that when only automatically generated dense prompts were used, nuclei segmentation sometimes failed to produce satisfactory results.However, significant improvement was achieved when weak annotations such as points or boxes were provided during the segmentation inference.Notably, when the bounding box of the nucleus was available as a weak annotation, the segmentation achieved a dice value of 0.883 in the public Monuseg dataset [14], significantly surpassing the results obtained from supervised learning methods.It indicates that SAM has strong capabilities in edge detection, enabling clear detection of nuclei boundaries within focus regions.This makes it a potential tool to generate precise approximate masks, which can enhance supervised learning approaches with lower annotation costs.

C. SAM-finetuning
SAM has been trained on a large dataset for generic segmentation tasks, giving it the ability to perform well in general segmentation.However, when applied to specific tasks, SAM may exhibit suboptimal performance or even fail.Nonetheless, if the knowledge accumulated by SAM can be transferred to these specific tasks, it holds great potential for achieving better performance compared to training the model from scratch using only downstream task data, especially when the available annotated data for the downstream task is limited.
To optimize the transfer of knowledge, rather than finetuning the entire large pretrained model, a more effective and efficient approach is to selectively unfreeze only the last few layers.However, in our experiments, this approach still yielded inferior results compared to some baselines.Recently, there has been growing attention in the natural language processing community toward the use of adapters as an effective tool for finetuning models for different downstream tasks by leveraging task-specific knowledge [15].In line with this, Chen et al. [9] successfully adapted prompt adapters [16] in the finetuning process of SAM.Specifically, they automatically extracted and encoded the texture information of each image as handcrafted features, which were then added to multiple layers in the encoder.Additionally, the proposed prompt encoder, along with the unfrozen lightweight decoder, became learnable during the finetuning process.Following their work, we implement such finetuning strategy in the nucleus segmentation task, but we explore more about its performance in different numbers of training data scenarios.

A. Data and Task
In this work, we employ the MICCAI 2018 Monuseg dataset [17].It consists of 30 training images and 14 testing images, all with dimensions of 1000×1000 pixels.Each image is accompanied by corresponding masks of nuclei.To ensure a fair and comparable evaluation, we use the same data split as a recent study [18].The 30 training images are divided into two subsets, with 24 images assigned to the training set and the remaining 6 images forming the validation set.To evaluate the model performance of nucleus segmentation, Dice, AUC, Recall, Precision, best F1 (maximized F1 score at the optimal threshold), IoU (Intersection over Union) and ADJ (Adjusted Rand Index) are calculated.

B. Experiment Setting
In this work, we designed 3 sets of experiments to explore the performance of finetuned SAM on the nucleus segmentation task.
1) Finetuned by complete annotation or weak annotation.For complete annotation, the pixel-wise complete annotations were provided for training data to finetune the pretrained SAM model.As for the weak annotation, only the bounding boxes of nuclei were provided.In this work, the bounding boxes were automatically prepared by using the complete masks.And then, these bounding boxes were used as the prompts in the pretrained SAM to generate pixel-level pseudo labels for finetuning.
2) Finetuned by different numbers of annotated data.To evaluate the performance of the proposed pipeline finetuned with different numbers of data, we adjusted the number of annotated images and the area of annotated regions.The complete training set contains 24 1000×1000 image patches with corresponding annotations.In the 4% training data set, only a 200×200 random patch was selected from each large patch for annotation.To keep the parameters unchanged for a fair comparison, the rest area without annotation was set to intensity 0. In the extreme cases, only 3 patches (1 from each image) in the size of 200×200, taking up 0.5% of the original complete dataset, were randomly selected.
3) Comparison with other SOTA.In this study, we conducted a performance comparison between our proposed pipeline and other SOTA methods.LViT [18] is a recently

Fig. 1 .
Fig. 1.This figure shows the overall idea of the proposed All-in-SAM pipeline.First, the AI foundation model SAM is used in the annotation phase to convert weak annotations (bounding boxes) to strong annotations (pixel-wise labels), which reduces the time consumption during the labeling process.Then, the SAM model is fine-tuned with fewer strong annotations.The ultimate goal of the All-in-SAM pipeline is to enable efficient few-shot and weak annotation for AI model adaptations.

Fig. 2 .
Fig.2.The proposed pipeline using weak and few annotations for nuclei segmentation.In the training stage, only the bounding boxes (in green color) of nuclei were provided as the weak annotation label to generate the approximate segmentation masks.Then, with the supervision of the approximate segmentation masks, the prompt-based finetuning was applied to the pretrained SAM model.In the inference stage, nuclei can be segmented directly from images without box prompts.