Extraction of building footprint using MASK-RCNN for high resolution aerial imagery

Extracting individual buildings from satellite images is crucial for various urban applications, including population estimation, urban planning, and other related fields. However, Extracting building footprints from remote sensing data is a challenging task because of scale differences, complex structures and different types of building. Addressing these issues, an approach that can efficiently detect buildings in images by generating a segmentation mask for each instance is proposed in this paper. This approach incorporates the Regional Convolutional Neural Network (MASK-RCNN), which combines Faster R-CNN for object mask prediction and boundary box recognition and was evaluated against other models like YOLOv5, YOLOv7 and YOLOv8 in a comparative study to assess its effectiveness. The findings of this study reveals that our proposed method achieved the highest accuracy in building extraction. Furthermore, we performed experiments on well-established datasets like WHU and INRIA, and our method consistently outperformed other existing methods, producing reliable results.


Introduction
Building extraction from the satellite imagery is a complex and significant task in remote sensing.It involves automatically detecting and delineating building footprints or outlines from high-resolution satellite imagery, playing a crucial role in applications such as urban planning, disaster management, and environmental monitoring.
Satellite imagery provides valuable information about the Earth's surface, allowing for the study of various environmental aspects.Effective urban planning supported by precise building footprint data can lead to better infrastructure development, route planning and navigation for transportation and a more resilient economy.Accurately extracting building footprints from high-resolution aerial imagery is crucial for providing the insights into the spatial distribution of buildings facilitating the development of sustainable urban planning strategies.
Building extraction faces challenges due to the variability in building types, sizes, shapes, architectural styles, occlusions, shadows, and illumination variations.Overcoming these challenges requires sophisticated methodologies and techniques.Researchers have devised various approaches which includes pixel and object based approach.
Pixel-based approach operates at pixel level, employing image processing techniques like thresholding, edge detection, and mathematical morphology.However, they often struggle with noise, clutter, and variability in building appearances.Object-based methods consider spatial relationships between pixels and employ advanced image segmentation algorithms to extract potential building candidates.Classification and refinement steps differentiate buildings from other objects, utilizing shape, texture, and contextual features.Object-based approaches for building classification commonly employ machine learning algorithms like convolutional neural networks (CNN),support vector machines (SVM) and random forests(RF) to produce the desired result.
Promising results in building extraction have been demonstrated through the application of advanced deep learning methodologies, particularly focusing on CNN.The hierarchical representations of satellite images, enabling automatic feature extraction and classification can be learnt by CNN.They capture complex patterns and contextual information, improving accuracy in building extraction.
Traditionally, manual extraction by human experts was time-consuming and labour-intensive.Advances in remote sensing technology and machine learning have led to automated methods for building extraction, leveraging computer algorithms and artificial intelligence.These methods analyze satellite images to identify building footprints accurately and efficiently.
Building extraction from satellite images provides valuable information for urban planners and city officials.It helps identify areas of urban development, monitor building construction and changes over time, and plan future infrastructure and development projects.Although manual techniques are precise, they are impractical for extensive use due to their time-consuming nature.Traditional image processing methods face challenges with intricate situations like fluctuating lighting conditions or extracting irregularly shaped structures [1], resulting in inaccuracies when extracting building footprints.Deep learning models like CNNs have shown promising results in addressing these challenges.CNNs, especially those designed for semantic segmentation tasks like Mask R-CNN, can automatically learn to identify and delineate objects in images, including buildings.It presents a promising solution for overcoming the limitations of traditional approaches, enabling scalable, accurate, and efficient building extraction from satellite imagery.
The domain of building extraction from satellite imagery faces several challenges, including issues related to diverse datasets, false positives and negatives, small and irregularly-shaped buildings, computational resource demands, data scarcity, and sensitivity to image quality.Existing methods, including CNN, exhibit strengths but also encounter limitations in addressing these challenges comprehensively.The research gap lies in the need for an improved, scalable, and robust approach that overcomes these challenges, providing a more effective solution for building extraction tasks.Over decades of exploration in this field, researchers have grappled with numerous challenges [1], including variability in building appearances, obstructions, computational resource demands, and sensitivity to image quality.
Extracting building footprints from satellite imagery has been a focus of remote sensing research for decades, employing various methodologies from traditional pixel-based and object-based approaches to modern machine learning techniques.In the present study we propose an efficient method using Mask R-CNN, a powerful deep learning model that integrates Faster R-CNN for precise building footprint delineation.This study leverages the strengths of modern deep learning to overcome traditional challenges such as noise, clutter, and variability in building appearances.
The significance of this research in urban environments is profound, as accurate building extraction informs urban planning, disaster management, and environmental monitoring.By rigorously comparing Mask R-CNN to models like YOLO V5 and YOLO V7, and YOLO V8 this study highlights unique advancements, identifies challenges, and offers opportunities for future research, ultimately supporting sustainable urban development and resilient urban environments.
The main objective is to establish a strong basis for building extraction using Mask R-CNN, accurately identifying small and irregularly-shaped buildings thereby improving performance.This approach seeks to open new avenues for ongoing exploration and advancement within the dynamic realm of building extraction.The ongoing discourse in remote sensing by introducing an approach to building extraction from satellite imagery is addressed.

Related work
CNN has become instrumental in extracting buildings from both high-resolution and low-resolution images in recent times.However, existing pixel-wise prediction methods suffer from accuracy issues in building delineation.A topography-aware multi-resolution fusion learning technique was proposed to learn building boundaries in remote sensing imagery [2].Nonetheless, this strategy does not account for incorrect annotations in the building footprint dataset, potentially leading to inaccurate evaluation.
The proposed Adaptive Polygon Generation algorithm (APGA) addresses this by directly generating polygons to accurately outline building instances, enhancing both geometric consistency and practical applicability [3].The study [4] introduces a multiscale building extraction method using refined attention pyramid networks (RAPNets), integrating atrous and deformable convolutions, attention mechanisms, and pyramid pooling to enhance feature extraction and fusion, demonstrating superior performance on the Inria and xBD datasets.The Refine-UNet [5] architecture was introduced to enhance the building extraction accuracy.The approach integrates an encoder-decoder module with a refined skip connection method, incorporating advanced components such as atrous spatial convolutional pyramid pooling and a depth-wise separable convolution.This integration of innovative components refines the performance of the system, leading to improved accuracy in building extraction.To classify objects at multiple shapes and spatial scales, a USPP module was proposed to maintain global contextual information and extract features at various scales [6].
Despite their impressive predictive capabilities, CNNs often demand a substantial number of pixel-level annotations, leading to a need for labour-intensive efforts.To address this, PiCoCo was proposed, a method capable of learning building footprints with just 1 percent of pixel labels [7].MHA-Net was developed to enhance the precision and resilience of building footprint extraction by combining multiple attention mechanisms with CNNs [8].Additionally, SiU-Net was introduced to enhance scale invariance for extracting buildings of different sizes, particularly by segmenting large buildings on a coarser scale [9].
The U-Net architecture [10,11] is widely recognized as effective for semantic segmentation in various domains.It is particularly useful for tasks with limited training data, such as biomedical image segmentation.The architecture introduced in [12] utilized satellite images acquired from the TripleSat Sensor, which includes both panchromatic and multispectral bands.The panchromatic band provides superior spatial resolution compared to the multispectral band.Researchers introduced MAP-Net [13], a specialized network aimed at acquiring multi-scale contextual information and extraction of precise building footprint .The enhancement Module [14] enhances boundary information and overall model performance.The selective spatial pyramid dilated (SSPD) [15] network, with its novel encoder-decoder structure and L-shape weighting loss, significantly improves multiscale feature extraction and outperforms existing methods on SAR datasets.Extraction of building footprint from low-resolution satellite images [16] involved the use of instance segmentation to extract building footprints.CG-Net, a Conditional GIS-Aware Network that leverages GIS data to enhance the segmentation process.CG-Net effectively integrates spatial context and conditional information to improve the precision and accuracy of individual building segmentation in VHR SAR images [17].
To reduce dependence on manual labour for accurate footprint delineation, an automated building segmentation technique employing a concentric loop convolutional neural network (CLPCNN) was employed on remote sensing images [18].The challenge of building extraction at scale was addressed using a deep learningbased method with a CNN architecture specifically designed for satellite imagery [19].LRAD-Net, uses an efficient RegNet backbone, multiscale depthwise separable atrous spatial pyramid pooling, and innovative attention mechanisms, achieving superior performance and faster computation on various high-resolution datasets [20].Another proposed architecture utilizes multi-subgraph matching to achieve accurate building footprint extraction from high-resolution imagery [21].Additionally, a bi-channel bi-spatial (B2-CS) feature extraction method combined with a fully convolutional network (FCN) architecture was utilized [22].Lastly, a multitask-driven deep neural network (MD-Net) comprising a shared feature extraction unit, a building segmentation module, along with a building boundary refinement module proved effective for extracting building footprints using high-resolution imagery [23].FMAM-Net includes a Feature Refine Compensation Module (FRCM) to enhance boundary clarity and attention modules (TAM and PAM) to improve feature distinction and generalization.FMAM-Net outperforms existing methods on the Inria and WHU datasets, achieving higher IoU scores and better visualization results [24].
The existing prediction methods often suffer from accuracy issues in building delineation.Various strategies have been proposed to enhance extraction results, such as topography-aware multi-resolution fusion learning, which improves building boundary detection but can be hindered by incorrect annotations.Techniques like the Refine-UNet architecture integrate advanced components to refine performance, yet still require substantial pixel-level annotations, which are labour-intensive.Methods like Pixel wise Contrast and Consistency Learning aim to reduce this burden by learning with minimal pixel labels, while Multipath Hybrid Attention Network and SiU-Net introduce attention mechanisms and scale invariance to enhance precision and robustness.Despite these advancements, challenges remain in accurately delineating building boundaries, handling diverse scales and shapes, and reducing dependency on extensive manual labelling.Our proposed approach using Mask R-CNN addresses these gaps by offering a scalable solution with robust object detection capabilities, generating precise segmentation masks for building instances, and automating feature extraction to improve accuracy and efficiency in building footprint delineation from satellite imagery.This positions our research to significantly advance the state-of-the-art in urban analytics and remote sensing applications.

Methodology
The proposed work focuses on utilizing MASK RCNN to identify and segment individual buildings in satellite images.This framework offers flexibility and can be adapted to various building extraction tasks, including different building shapes, sizes, and terrains.Transfer learning is employed by fine-tuning a pre-trained model, which helps enhance accuracy and reduces the required training data.
The main contribution of our project includes: • Mask R-CNN refines Faster R-CNN, adding instance segmentation for identifying individual object instances and assigning unique segmentation masks.
• The backbone network of Mask R-CNN captures semantic details using a deep CNN, generating a highdimensional feature map encoding spatial and semantic information.RPN identifies potential objectcontaining regions by sliding a small network across spatial dimensions, estimating objectness scores, and refining anchor box coordinates.Anchor boxes serve as bounding box proposals, utilized by the mask prediction network for precise instance segmentation at the pixel level.
• Comparison of different object detection models like R-CNN, Fast R-CNN, Faster R-CNN, YOLOv5, YOLOv7 and YOLOv8 with Mask RCNN • YOLOv5 specializes in real-time object detection, excelling in detecting small objects with high accuracy, even with limited training examples.YOLOv7, featuring higher resolution processing, enhances the detection of smaller objects but falls short of surpassing other YOLO models in our study.YOLOv8, incorporating innovative transformer architectures, is evaluated in our research, contributing to the comprehensive analysis of building extraction models.
• Mask R-CNN accurately identifies and extracts individual building instances using instance segmentation.Through transfer learning, the backbone network is fine-tuned, enhancing accuracy while reducing the need for extensive training data, thus increasing practicality.A multitask loss function is incorporated optimizing performance and ensuring precise localization and high-quality extraction results.It excels at capturing finegrained details even under challenging conditions such as noise and clutter.
The data preparation for training the model involves collecting a large dataset of images and annotating them with bounding boxes and masks.To enhance dataset diversity, pre-processing steps, including resizing the images while maintaining the aspect ratio, normalization, data augmentation methods such as rotation, flipping and color distortion are implemented.The model is then trained using this annotated dataset.The training process typically involves multiple iterations over the dataset, with model parameters adjusted after each iteration to improve performance.The model comprises three key modules: the backbone network (ResNet), the Region Proposal Network (RPN), and the Region of Interest (RoI) Pooling.The system architecture of MASK RCNN is illustrated in figure 1.
The backbone network functions as a feature extractor, tasked with capturing semantic details from the image fed as input.Typically, a deep convolutional neural network (CNN) like ResNet or VGG is utilized for this task.By processing the input image, the backbone network generates a high-dimensional feature map that encodes spatial and semantic details about the objects present in the image.The region proposal mechanism (RPN) operates on the underlying network's feature map to identify potential regions that may contain significant objects within the image.Achieving this involves sliding a small network, typically a convolutional network, across spatial dimensions of feature map.For every position of sliding window position, the RPN estimates the likelihood of an object being present (objectness score) and refines the coordinates of a set of predefined anchor boxes.These anchor boxes serve as potential bounding box proposals for enclosing objects.The RPN utilizes these anchor boxes to generate a collection of region proposals.The mask prediction network effectively utilizes the region proposals to achieve precise instance segmentation, making predictions at the pixel level and refining the coordinates of bounding box obtained from the RPN.A fixed-size feature map called Region of Interest (RoI) is extracted for processing each proposed region.A fixed-size feature map called a Region of Interest (RoI) is extracted for each proposed region.The RoI feature maps undergo convolutional layers, leading to the creation of binary masks for each class.These masks represent precisely segmenting object instances within the proposed bounding boxes.The training process optimizes the model using a multitask loss function that encompasses three crucial components: bounding box regression loss, object classification loss, and mask segmentation loss.The bounding box regression loss assesses the accuracy of the predicted bounding box coordinates compared to the ground truth, ensuring the model learns to precisely localize objects.The object classification loss penalizes misclassifications of objects and backgrounds.Lastly, the mask segmentation loss evaluates the pixel-wise accuracy of the predicted masks in contrast to the ground truth.During the inference phase, the trained Mask R-CNN model is deployed to carry out object detection and instance segmentation on new images.The model applies the learned knowledge to generate predictions for bounding box coordinates, objectness scores, and pixel-level masks for each object instance present.Mask R-CNN has achieved outstanding performance in various benchmark datasets and has been widely adopted in numerous computer vision applications.Its capability to simultaneously detect objects and provide pixel-level segmentation masks makes it a powerful tool for tasks such as object recognition, semantic segmentation, image understanding, and scene understanding.It begins with a convolutional backbone network, extracting features from the input image, leveraging residual blocks with skip connections for efficient gradient propagation.The RPN then identifies potential building locations by operating on multiple feature map levels.Subsequently, RoI refines proposals for accurate mask prediction employing bilinear interpolation to improve feature accuracy.Finally, binary segmentation masks for each building are generated, progressively increasing spatial resolution through deconvolutional layers.

Backbone network (resnet)
In MASK R-CNN, the employed backbone network does the task of feature extraction from the input image to pass to the region proposal network.The backbone typically consists of a convolutional neural network (CNN), such as ResNet, trained on a large dataset to learn generic features applicable to various tasks.It can be customized to different depths, such as ResNet-50 or ResNet-101.The ResNet-50, backbone network in Mask R-CNN captures the salient features from the input image, playing a crucial role in discovering discriminative features that significantly impacts the performance of Mask R-CNN .The usual practice involves pre-training the ResNet-50 backbone on an extensive image classification undertaking, such as ImageNet, to obtain robust and versatile feature representations.ResNet-50, which belongs to the ResNet (Residual Network) family of architectures, is a popular convolutional neural network designed to address challenges in training deep networks.ResNet-50 specifically refers to a ResNet model with 50 layers.The core innovation of ResNet architecture is the introduction of residual blocks, which incorporate skip connections to enable the direct flow of information through the network.By incorporating skip connections, the vanishing gradient problem is alleviated, enabling smoother training of deep networks.
The basic residual block in ResNet-50 comprises two 3x3 convolutional layers, accompanied by batch normalization and a ReLU activation function in sequence.The input passes through initial convolutional layer, and the output is then fed into the second convolutional layer.Input is directly added to output of the second convolutional layer using skip connections.This residual connection facilitates gradients to flow directly through the block, improving information propagation.The ResNet-50 architecture consists of multiple stages, each containing several residual blocks.The initial stage performs initial convolution, max pooling, and a series of residual blocks.In subsequent stages, the feature maps undergo further downsampling in spatial dimensions and increased depth, leading to the final stage, which incorporates global average pooling and a fully connected layer for classification purposes.This progressive process allows the network to capture hierarchical patterns and representations in the data, contributing to more effective feature extraction and classification capabilities.
By leveraging the ResNet-50 backbone along with the Feature Pyramid Network, features are extracted from input images at multiple scales, facilitating subsequent processing by components like the region proposal network and the mask head within the model.This multi-scale feature extraction enhances the model's ability to capture object details and context, leading to improved performance in instance segmentation tasks.The multiscale feature extraction capability enables the model to capture object details at different levels of granularity, facilitating accurate instance segmentation and robust object detection.
Algorithm for backbone network (resnet) (ii) Convolutional Layer: apply an initial convolutional layer to the input image.Let X i (0) be the input feature map of size H i XW i XC i .Apply a series of convolutional operations with learnable filters of size F1 x F1 x C1, resulting in an output feature map , , i i and C1 i depends on specific convolutional parameters.
X i (1) = Conv(X i (0), W i (1)) (iii) Residual Blocks: The ResNet-50 architecture consists of several residual blocks.Let X i (l) represent the feature map output of the lth residual block.Within each residual block l, there are multiple convolutional layers interconnected with shortcut connections.Here's the formula for the lth residual block: (a) Convolutional Layers: Apply a series of convolutional operations with learnable filters of size F2 x F2 x C2 to the input feature map X i (l − 1), producing an intermediate feature map (b) Shortcut Connection: take the input feature map X(l-1) and map it to the output of the convolutional layers H(l).(iv) Global Average Pooling: global average pooling is applied to the to the feature map X(L) after the residual blocks, where L denotes the index of the last residual block and P represents the number of output classes.Z = GlobalAvgPool(X i (L)) Here, Z is a vector of size C2 i representing the average values across each feature map channel.
(v) Fully Connected Layer: Connect the global average pooled features Z to a fully connected layer with weights W(fc) of size C2 i x P. A = Z * W(fc) Here, A is a vector of size P representing the pre-softmax logits.
(vi) Softmax Activation: Apply the softmax activation function on the logits A to obtain the class probabilities.
Here, Y pred is a vector of size P representing the predicted class probabilities.
(vii) Output: the final output is the predicted class label or the class probabilities Y pred .
The ResNet backbone in Mask R-CNN can be summarized as follows: The input image or feature map is processed through the ResNet's residual blocks, which involve convolution operations, skip connections, and element-wise activation.These steps enable the transmission of information within the network and enhance the model's capability to extract features.Figure 2 illustrates the work flow diagram of Mask R-CNN.

Region proposal network
The Region Proposal Network plays a crucial role in both instance segmentation and object detection within the MASK R-CNN model.Operating on the backbone network's feature maps, it generates object proposals by identifying regions likely to contain objects of interest, seamlessly integrating this process into the overall architecture.
The RPN is a critical component in the MASK R-CNN architecture, generating region proposals that represent potential bounding box locations for objects, aiming to identify candidate regions of interest (ROIs) for accurate object detection and instance segmentation.The RPN, leverages the feature maps from the backbone network and functions as a fully convolutional network (FCN).It consists of the classification and regression branches.These branches work collaboratively to perform object proposal generation by predicting objectness scores and refining bounding box coordinates, contributing to the model's effective region proposal mechanism.
In the MASK R-CNN framework, the predefined bounding boxes called anchors with various aspect ratios and scales are densely tiled over spatial dimensions of the feature maps.These anchors serve as reference boxes, enabling the RPN to predict object presence and bounding box adjustments.Utilizing a sliding window mechanism, the RPN processes a fixed-size window centered on each anchor position on the feature maps, generating two outputs: the objectness score (indicating object presence within the anchor) and the bounding box regression offsets (fine-tuning the anchor's coordinates).
During RPN training, ground-truth annotations are employed to label the anchors.Anchors that possess high Intersection over Union is overlapped with the ground truth bounding box are assigned a positive label, indicating the presence of an object, while those with low IoU receive a negative label, indicating the background.The RPN loss function is then computed based on the predicted objectness scores and regression offsets, in comparison to the ground truth labels.After obtaining objectness scores and bounding box regressions for all anchors, an algorithm called nonmaximum suppression (NMS) is used to remove overlapping and redundant proposals.NMS retains proposals with the highest objectness scores while suppressing others that significantly overlap with them.The remaining proposals, post-NMS, are ranked based on their objectness scores.A fixed number of top-scoring proposals are selected as region of interest (ROI) candidates for further processing in subsequent stages of Mask R-CNN pipeline, such as ROI pooling or ROI align.
After the RPN generates the proposals, they are passed on to subsequent stages, such as ROI pooling or ROI align.During these stages, feature maps of fixed-size are separated from ROIs, which subsequently are used for bounding box refinement,object classification and mask generation tasks.
Algorithm For Region Proposal Network (RPN) (v) Non-Maximum Suppression (NMS): Apply NMS on the predicted bounding boxes to remove duplicate and overlapping proposals.NMS selects the most confident proposals while suppressing others based on a predefined threshold.R=NMS(P, T) Here, R represents the set of final region proposals after applying non-maximum suppression.
(vi) Output: the output of the Regional Proposal Network algorithm is the set of region proposals R, which serves as input to the subsequent stages of object detection.
This algorithm presents a high-level overview of the Regional Proposal Network (RPN).It outlines the process of generating anchor boxes, classifying them based on objectness, regressing bounding box adjustments, and applying non-maximum suppression to choose the ultimate region proposals.

Region of interest (roi) align
ROI Align is a technique employed in Mask R-CNN methodology to extract accurate and detailed features from region proposals.It effectively addresses the challenge of misalignment that arises with traditional ROI pooling methods, which can result in information loss and diminished performance in tasks like instance segmentation.
In Mask R-CNN, the Region Proposal Network (RPN) generates region proposals that identify potential objects within an image.These proposals are then processed through a network to extract features.Traditional ROI pooling methods discretize the region proposals into a fixed-size grid on the feature map, resulting in misalignments between the extracted features and the original image.ROI Align was introduced to overcome this issue and enhance the precision of feature extraction.
The ROI Align technique involves subdividing the region proposals into smaller spatial bins and applying bilinear interpolation within each bin to compute feature values at fractional positions.These interpolated feature values are then extracted and concatenated to form aligned features for each bin.These aligned features are subsequently utilized for tasks such as object classification, bounding box regression, and mask prediction.
ROI Align offers several advantages compared to conventional ROI pooling methods.It ensures accurate localization of features by leveraging precise spatial alignment achieved through bilinear interpolation.It also enables subpixel accuracy, allowing feature extraction at fractional positions within each spatial bin.The finegrained features obtained through ROI Align significantly contribute to improved mask generation and enhanced performance in object detection and instance segmentation tasks.
By employing bilinear interpolation, ROI Align effectively preserves precise spatial information and mitigates misalignment concerns.It enables feature extraction at fractional positions within each spatial bin, thereby providing subpixel accuracy in localization.The fine-grained features acquired through ROI Align play a pivotal role in generating more accurate and detailed instance masks.
Algorithm For Region Of Interest (RoI) Align 1. Compute the bin's sub-region coordinates within the RoI: Interpolate features within the bin: 3. In the feature map, for every channel c: (A) Extract the features corresponding to the four corners of the bin: Compute the interpolation weights for each corner: • alpha = (xmin-floor(xmin))/s • beta = (ymin-floor(ymin))/s • gamma = 1-alpha The ROI Align operation is an integral part of the Mask R-CNN framework, specifically designed to extract accurate and fine-grained features from region proposals.It plays a crucial role in enhancing the feature extraction process, leading to notable improvements in instance segmentation, mask generation tasks and object detection.
ROI Align seamlessly integrates into the Mask R-CNN architecture during the phase where region proposals are processed.By replacing the quantization step used in traditional ROI pooling methods with bilinear interpolation, ROI Align effectively addresses the challenge of misalignment and significantly improves the accuracy of feature extraction.
In the ROI Align process, region proposals are divided into smaller spatial bins, and within each bin, bilinear interpolation is applied to compute feature values at fractional positions.These interpolated feature values are then extracted and combined to create aligned features for each bin.These aligned features are subsequently employed for various tasks, including object classification, bounding box regression, and mask prediction.
The effectiveness of ROI Align has been demonstrated through empirical evaluations, showcasing its superiority over traditional ROI pooling methods, especially in scenarios where precise localization and finegrained feature extraction are critical.It offers improved accuracy by precisely aligning spatial information and enables subpixel accuracy in feature extraction within each spatial bin.These fine-grained features obtained through ROI Align contribute significantly to more accurate and detailed instance mask predictions.
Although ROI Align comes with an additional interpolation step, making it computationally more expensive than ROI pooling, the benefits it provides in terms of performance justify this computational cost.After training and evaluating the model using images, the trained model can generate object detection and segmentation outputs, which can be applied in various domains such as autonomous driving, robotics, or medical imaging.

Experimental setup
The experimentation for building extraction using Mask R-CNN on Wuhan University (WHU) dataset involves a systematic series of steps.These steps include data preparation, model selection and configuration, model training, and performance evaluation.The aim is to accurately identify buildings in satellite or aerial imagery.
The initial step involves acquiring dataset, which is then split into distinct subsets used for the purpose of training and testing.It is crucial to ensure that both subsets represent a wide range of building types, sizes, and orientations.Datasets like COCO or ImageNet which is been trained by Mask R-CNN model, is selected for the task.The model's weights and configuration files are downloaded for further use.
To set up the framework, the Python environment is configured with the necessary libraries such as TensorFlow, Keras, NumPy, and OpenCV.The implementation of the Mask R-CNN framework, such as the Matterport Mask R-CNN repository, is installed to leverage its functionalities.Data pre-processing like resizing the images while maintaining the aspect ratio and image augmentation methods such as rotation, flipping and color distortion were performed.Ground truth annotations are converted into binary masks, labeling pixels within building boundaries as foreground and the remaining background as background.The dataset is partitioned into training and validation subsets to support model training and evaluation.
During the model configuration stage, the Mask R-CNN configuration file is modified to accommodate the specific requirements of the WHU dataset.The file is updated with the number of classes (buildings) and the image dimensions of the WHU dataset.Hyperparameters are fine-tuned based on experimentation and validation.
Model training begins by initializing the Mask R-CNN model with the pre-trained weights obtained earlier.The training dataset is loaded, and the model is trained using images and their corresponding annotations.The training progress is monitored, and loss values are tracked.The model's performance is analyzed on the validation subset to refine hyperparameters and configurations if necessary.
By applying 10-fold cross-validation to the training data, the models performance is assessed on the testing subset and key metrics like mean average precision (mAP), precision, recall, and F1-score were evaluated to gauge the accuracy and effectiveness of building extraction.The model's results are compared against existing benchmarks or manually generated ground truth annotations.
In the result analysis phase, the model's output is visualized by overlaying predicted masks on input images and comparing them with the ground truth annotations.The model's performance in terms of correctly identified buildings, false positives, and false negatives is assessed.The impact of different hyperparameters and training configurations on the model's performance is analyzed.Areas for improvement are identified, and potential refinements to the training process or model architecture are considered.
By employing this systematic experimental setup, accurate and efficient building footprint extraction using Mask R-CNN on the WHU dataset is achievable.The documented methodology, experimental settings, and results significantly contribute to advancing building extraction techniques, providing valuable insights for future research endeavors.Moreover, future research may explore integrating additional data sources like LiDAR for enhanced accuracy.Extending the proposed work to analyze various urban elements and combining Mask R-CNN with techniques like 3D modeling holds promise for advancing precise building extraction and uncovering new prospects in this dynamic domain.

Dataset description
To train and evaluate the building extraction model, we utilize building datasets like WHU and INRIA .The Wuhan University (WHU) building dataset has gained recognition within the building extraction domain and comprises high-resolution satellite images along with corresponding binary masks indicating the precise location of building pixels.The binary masks align with the images and highlight the building pixels.The WHU dataset encompasses over 8,188 individual buildings extracted from high-resolution aerial images of size of 512 x 512 pixels with a spatial resolution of 0.3 meters, covering an area of 450 square kilometers in Christchurch, New Zealand.Training, validation and the test dataset contains 4,736, 1,036, and 2,416 images, respectively.
Similarly, the INRIA building dataset serves as another popular dataset for extraction of building footprint from satellite data.It contains high-resolution images captured over the city of Austin, Texas, USA, accompanied by pixel-level annotations indicating the presence or absence of buildings.The dataset comprises a total of 360 images, 180 images in the training set and 180 images in the test set which has 0.3 meters spatial resolution with a dimension of 1500 x 1500 pixels for each image.These images encompass diverse urban areas with varying building densities and sizes.The pixel-level annotations are binary masks with the same dimensions as the corresponding images, distinguishing between building and non-building regions.
The dataset is divided in a 60:20:20 ratio, allocating 60% of the data for training, 20% for validation, and 20% for testing.During model training, progress and tracking metrics such as loss and performance are monitored.Upon observing fluctuations in both training and validation loss, we implemented the regularization technique known as dropout, effectively resolving the issue of overfitting.Figure 7 shows the loss incurred by mask RCNN after applying regularization.
Both the WHU and INRIA datasets have been widely utilized in building extraction and segmentation research.They are commonly employed to evaluate deep learning models' performance.The input images of the WHU dataset are illustrated in figure 3, providing visual representations of the dataset samples.Figure 4 showcases the ground truth images of WHU dataset.Figure 5 displays the predicted images generated by Mask R-CNN.The comparative analysis of various models is presented in figure 6.

Evaluation metrics
MaskRCNN's performance is assessed by calculating recall, precision and mean average precision (mAP).The mean average precision (mAP) is calculated using 3.Here 'm' represents the total count of object classes in detection.mAP serves as a widely used metric in the evaluation of a model's performance across different levels of overlap between the ground-truth and predicted values in detecting objects .It computes the average precision (AP) for each class (APi), which involves calculating the area which is under the precision-recall curve specific to that class.mAP is typically calculated at various IoU thresholds, such as 0.5 and 0.75, and its performance metric ranging from 0 to 1 and the performance of the model being better for higher values.g in equation (4) represents ground truth class label (a binary indicator) for a given region of interest (RoI).p represents the predicted class probability for the corresponding class of the RoI.The summation is taken over all classes.
The precision of predicted bounding box coordinates for each region of interest (RoI)is enhanced by utilizing the bounding box regression loss.By predicting offsets (deltas) that adjust the bounding box coordinates relative to default anchor boxes, the network refines its predictions.The loss function quantifies disparity between the ground truth bounding box coordinates and predicted values, enabling network to acquire accurate object localization skills.
= Bounding Box Regression Loss In equation (7), 'mk' represents the ground truth mask for a given region of interest(ROI), and this mask is binary, meaning it contains only two values.pm represents the predicted mask probability at each pixel location for the corresponding RoI.The summation is taken over all pixels in the mask.Total Loss: In Mask R-CNN, the total loss encompasses classification loss, mask prediction loss and bounding box regression loss serving as a comprehensive measure of the network's performance.Each loss contributes to the overall loss, and their individual contributions are combined using suitable weighting factors that reflect their respective significance.By aggregating the classification,mask prediction losses and bounding box regression total loss provides unified objective function for training the network and guiding the learning process towards accurate object classification, precise bounding box refinement, and high-quality mask generation.

Comparative analysis
In conducting our research, we undertook a thorough comparative analysis of various instance segmentation models, with a particular focus on our proposed system.Among the models scrutinized, YOLOv5 stood out as a specialized solution tailored for real-time object detection in both video streams and images.Renowned for its proficiency in detecting small objects and achieving high accuracy, even with limited training examples, YOLOv5 proved to be a robust contender in our evaluation.
Another model we assessed was YOLO V7, distinguished by its higher resolution processing capabilities, handling images at 608 by 608 pixels.This elevated resolution enables effective detection of smaller objects, leading to an overall higher accuracy.However, it is noteworthy that YOLO V7, while incorporating innovative transformer architectures in computer vision tasks, did not surpass other YOLO models in terms of accuracy in our study.
In our research work, we meticulously trained and tested MASK R-CNN using satellite images, allowing us to compare and analyze its output with other deep learning models, including R-CNN, Fast R-CNN, Faster R-CNN, YOLO V5, YOLO V7, and YOLO V8.To gauge performance, we scrutinized the total loss in MASK R-CNN.Additionally, we conducted a comprehensive comparative analysis of the results obtained from various models for building extraction, summarized in table 1.While YOLO models achieve competitive mAP, Mask R-CNN surpasses them in both detection and segmentation.The backbone with skip connections addresses vanishing gradients, allowing deeper networks.RoI Align tackles quantization errors with bilinear interpolation, extracting features from high dimensional feature maps.Finally, it utilizes deconvolutional layers to progressively increase feature map resolution.It excels in both object localization and pixel level segmentation, making it the optimal choice for applications requiring detailed building footprint extraction.Our research involved a thorough comparative analysis of instance segmentation models, focusing on our proposed Mask R-CNN system.Mask R-CNN outperformed R-CNN, Fast R-CNN, Faster R-CNN, YOLOv5, YOLOv7, and YOLOv8, achieving the highest F1 score of 81.2 and mAP50 79.4 for WHU dataset and F1 score of 79.4 and mAP50 of 78.6 for INRIA dataset respectively.Compared to manual digitization and traditional image processing algorithms, Mask R-CNN offers a more scalable and precise solution.Our findings highlight Mask R-CNN's superior effectiveness in building footprint extraction and its significant contribution to the remote sensing and computer vision literature.
Mask R-CNN, an extension of Faster R-CNN, outperforms its predecessors-R-CNN [25], Fast R-CNN [26], and Faster R-CNN [27,28] in both architecture and performance.Unlike R-CNN's sequential processing of region proposals or Fast R-CNN's reliance on external methods for proposal generation, Mask R-CNN integrates a Region Proposal Network (RPN) and parallel segmentation mask prediction branch, enabling endto-end training and simultaneous instance segmentation.This architectural enhancement allows Mask R-CNN to achieve state-of-the-art performance in object detection and instance segmentation tasks, making it suitable for building footprint extraction.Figure 7 shows the total loss incurred by MASK RCNN.

Conclusion
Highlighting the positive aspects of our study, we have successfully employed Mask R-CNN as an innovative solution for extracting buildings from satellite imagery.This approach effectively addresses the issue of delineating boundaries of a building by utilizing instance segmentation.Moving forward, there are several challenges in the building extraction domain that require focused attention.One of these challenges involves enhancing Mask R-CNN's performance on large-scale datasets characterized by diverse image resolutions, lighting conditions, and environmental factors.Additionally, addressing false positives and false negatives in building extraction is crucial.This can be achieved by exploring different network architectures and optimizing hyperparameters.Another significant challenge is effectively detecting and segmenting small and irregularlyshaped buildings, which often necessitate complex networks and higher-resolution images to capture their intricate details accurately.
However, despite its promising performance, the Mask R-CNN model encountered some challenges and limitations during the evaluation process.The main challenge faced was the computational resource demands associated with training and testing the model on large-scale datasets with high-resolution imagery.Training deep learning models like Mask R-CNN requires substantial computational resources and time, which may limit its scalability and accessibility.This may limit its practical use on devices with limited computational power.Moreover, the model's performance depends highly on the availability of ample training data, posing challenges in scenarios where data scarcity or data acquisition difficulties exist.Furthermore, Mask R-CNN exhibits sensitivity to image quality, potentially struggling with low-resolution or noisy images, thereby complicating tasks where image quality may vary.
Nevertheless, despite these challenges, Mask R-CNN has exhibited promising results for identifying building footprints from aerial and satellite imagery.Its utilization of region-based convolutional neural networks and instance segmentation has facilitated accurate building detection and pixel-level segmentation.The challenges identified provide valuable insights for further refinement and optimization in the application of Mask R-CNN to building extraction tasks.In conclusion, the successful implementation of Mask R-CNN marks a significant stride in the field of building extraction, showcasing its potential for accurate and detailed identification of structures from satellite imagery.The challenges identified present opportunities for continued research and improvement, emphasizing the need for innovative solutions to enhance the model's robustness and applicability in various scenarios.The proposed work has demonstrated the potential of Mask R-CNN in diverse applications which include urban planning, disaster management, and environmental monitoring.Future research could explore the integration of additional data sources like LiDAR to improve accuracy.Moreover, extending this proposed work to other objects of interest, such as roads, bridges, and vegetation, could enable comprehensive analyses of the urban environment.The combination of Mask R-CNN with other techniques such as 3D modeling and point cloud analysis holds promise for achieving precise results in building extraction.Mask R-CNN can automate building extraction from imagery, aiding urban planning and disaster management by providing accurate building footprints.It supports environmental monitoring by tracking urbanization impacts and helps in creating precise cadastral maps.The model enhances smart city initiatives with real-time infrastructure monitoring and benefits real estate and insurance sectors by offering accurate property data for valuation and risk assessment.Overall, the current work lays a solid foundation for building extraction using Mask R-CNN, unveiling new prospects for continued exploration and advancement within this dynamic domain.Our study successfully employs Mask R-CNN for accurate building extraction from satellite imagery, addressing key challenges like boundary delineation and small, irregular structures.This aligns with research objectives to improve remote sensing techniques, contributing significantly to the literature by demonstrating Mask R-CNN's versatility and effectiveness.Despite challenges like computational demands and sensitivity to image quality, our findings highlight Mask R-CNN's potential in urban planning, disaster management, and environmental monitoring.Future research integrating additional data sources like LiDAR and extending applications could further enhance its robustness and applicability.
c) Activation Function: Apply an activation function, like the ReLU, element-wise to the output feature map.X i (l) = ReLU (Y i (l))(d) Repeat: A stack of residual blocks is created by repeating steps (a) to (c) multiple times.

Figure 2 .
Figure 2. Work flow diagram of Mask R-CNN.

( i )
For each of the RoI: (a) Extract the RoI coordinates (x, y, w, h) (b) Divide the RoI into a grid of fixed spatial bins (e.g., 2 × 2, 3 × 3) (c) Determine the size of each spatial bin (s = RoI width/grid width) (d) Initialize an empty tensor to store the RoI-aligned features (e) For each bin (i, j) in the grid: C) Interpolate the features using the weights (D) Append the interpolated feature to the RoI-aligned features tensor (ii) Flatten the RoI-aligned features tensor (iii) Return the RoI-aligned feature representations for all RoIs

2
This equation(1) represents the calculation of precision and equation (2) the calculation of recall involves using the true positive(TP) and the false positive (FP) values' count.

4. 3 . 1 .
Loss functionsClassification Loss: the classification loss plays a important role in training the network thereby accurately assigning class labels to each region of interest (RoI).To achieve this, network's output logits are passed through a softmax function, which yields class probabilities.The computation of the classification loss involves comparing the probability values of the predicted classes with their respective ground truth classes.In this context, the widely utilized cross-entropy loss function helps to quantify the disparity between the predicted and actual class probabilities.

Figure 6 .
Figure 6.Comparative analysis of results from different models.
ii) Convolutional Layer: the input is fed to a 3 × 3 convolutional layer with multiple filterswith 2K filters.Let F be the set of intermediate feature maps obtained from the convolutional layer.F = Conv3 × 3(Input)(iii) Anchor Generation: Generate a collection of predefined anchor boxes at every spatial location on the intermediate feature maps.Let A be the set of anchor boxes generated at every spatial location.
A = GenerateAnchors(F) (iv) Convolutional Layers for Classification and Regression: Apply separate 1x1 convolutional layers on the intermediate feature maps to predict two sets of outputs for each anchor box.(a)Classification: a 1 × 1 convolutional layer with 2K filters is applied to predict the foreground/ background class probabilities for each anchor box.Let P be the predicted probabilities for each anchor box to belong to the foreground class.P=Conv1 × 1(F) (b) Regression: a 1 × 1 convolutional layer with 4K filters is applied to predict the bounding box adjustments (offsets) required to refine the anchor boxes.Let T be the predicted bounding box adjustments for each anchor box.T = Conv1 × 1(F) Here, K denotes the number of anchor boxes generated at each spatial location.
(5) in equation(5)represents the predicted bounding box regression offsets for a given RoI, while Δ g denotes the ground truth bounding box regression targets corresponding to the RoI.The summation is taken over all RoIs.Equation (6) describes the smooth L1 loss.Mask Prediction Loss: The mask prediction loss is crucial in training the network to generate accurate pixel level segmentation masks for every RoI.For every RoI, the network generates a binary mask that indicates the presence or absence of an object at each pixel location.The mask prediction loss assesses the disparity between the masks that the model predicts and the mask corresponding to ground truth data.By comparing both the ground truth and the predicted masks, the network learns to generate accurate pixel-wise segmentation masks, enabling fine-grained object delineation and precise localization.

Table 1 .
Comparative analysis of building extraction results using various models.