Tackling the small data problem in medical image classification with artificial intelligence: a systematic review

Though medical imaging has seen a growing interest in AI research, training models require a large amount of data. In this domain, there are limited sets of data available as collecting new data is either not feasible or requires burdensome resources. Researchers are facing with the problem of small datasets and have to apply tricks to fight overfitting. 147 peer-reviewed articles were retrieved from PubMed, published in English, up until 31 July 2022 and articles were assessed by two independent reviewers. We followed the Preferred Reporting Items for Systematic reviews and Meta-Analyse (PRISMA) guidelines for the paper selection and 77 studies were regarded as eligible for the scope of this review. Adherence to reporting standards was assessed by using TRIPOD statement (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis). To solve the small data issue transfer learning technique, basic data augmentation and generative adversarial network were applied in 75%, 69% and 14% of cases, respectively. More than 60% of the authors performed a binary classification given the data scarcity and the difficulty of the tasks. Concerning generalizability, only four studies explicitly stated an external validation of the developed model was carried out. Full access to all datasets and code was severely limited (unavailable in more than 80% of studies). Adherence to reporting standards was suboptimal (<50% adherence for 13 of 37 TRIPOD items). The goal of this review is to provide a comprehensive survey of recent advancements in dealing with small medical images samples size. Transparency and improve quality in publications as well as follow existing reporting standards are also supported.


Introduction
Data-driven intelligent models have gained immense popularity in recent years, achieving satisfactory performance.The essence behind these achievements is that the behavior in unknown domains can be accurately estimated by quantitatively learning the latent patterns behind the data from sufficient training samples [1,2].
Researchers nowadays are capable of designing and developing network structures with even more and wider layers than before also thanks to the availability of much more powerful computational resources.The trend of artificial neural networks points towards the idea that deeper or more complicated networks perform better.However, these techniques are built up on the assumption of sufficiently large data samples for appropriate model training, i.e.Big Data.Usually, the term Big Data indicates a massive volume of data that is too large or complex to be effectively analyzed using traditional software [3,4].
In numerous real-world applications, the number of samples in a dataset can be relatively limited, constrained by the complexity, ethnicity, high cost or difficult to obtain in practical, leading to sharply decreases the performance of deep learning models.This is the main restriction of the deep learning models: they need tens of thousands of well-labeled samples for training.This Small Data challenge would call for a completely different approach from the existing Big Data one, and the axiom 'the deeper and wider we go, the better the performance' is no longer as robust [3].The limited quantity of available data prevents the use of large models: indeed, training smaller models is a safer choice since they are less prone to overfit data.Very large models, if not properly regularized, tend to memorize the whole dataset causing serious overfitting and a poor generalization ability of the model [5].In fact, the small data challenge is not only about the size of the training database in absolute terms and therefore when the train data is deficient the learned feature representations are limited and the model only fits well on train data.But it is essential to contemplate the small data issue in relative terms with respect to the complexity of the model to be trained.A large, deep and complex learning algorithm with millions of free parameters to optimize can obtain an effective knowledge of the available dataset achieving good train performance, albeit at the expense of heavily parameterizing the available data and loosing model generalizability.
Another aspect that needs to be brought into view concerns the quality of the data.In the clinical context, only expert physicians can give high-quality sample annotations, and such large amounts of annotated data will inevitably be laborious, costly and time-consuming.This prevents the creation of sufficiently large samples in most cases [4,6].In this perspective, small sample size issue is of particular interest when neural networks are applied to medical images, including MRI, CT, dose distributions, ultrasounds, and histopathological images, which often have limited sample size restricted by the availability of the patient's population, scarcity of annotated datasets and experts' labeling.In general, for medical images, high-quality annotated datasets are scarce and require specialized medical knowledge, standardized protocols and considerable time and effort.For this purview, labeling of data by domain experts is still one of the key issues and often it may take more time and effort than the algorithm development itself.Moreover, the intrinsic heterogeneity of retrospective data accumulated in daily clinical practice creates a trade-off between the quality and the dataset sizes, ranging from a few dozens to a few hundreds of patients [7,8].
Moreover, constructing sufficiently large data sets in the field of medical imaging is difficult due to the patient privacy and regulations.For this reason, starting multicenter studies is often a difficult path to take and individual clinical centers try to train, validate and test artificial intelligence algorithms with the few available data.But a small sample size from a single study database produces fundamental limits.Deep learning techniques generally require more than a million samples to train without overfitting.However, another important aspect present in clinical studies must also be emphasized.In this context, rare diseases are often studied and therefore lack data per se, or they have to deal with classes or categories that are numerically very unbalanced [9].Consequently, many deep learning researchers agree that a small sample size is insufficient to test the effectiveness of the proposed method.In recent years, some international competitions have released rich labeled medical images, which provides a potential data source to train models specific to medical applications.
The small data issue can be mainly solved with two approaches: data augmentation-based and transfer learning/domain adaptation-based, respectively.These methods try to expand the data volume but in a different fashion.The first method is based on the generation of new synthetic data from the available data while the second one resorts on knowledge learned from other domains.These methods could effectively improve the results and reduce the data size requirement in order to overcome the Small Data challenge.They are illustrated in detail below.

Data augmentation.
The data augmentation-based strategy aims to synthetically and artificially increase the number of available samples for training deep learning models miming the distribution of the original dataset, providing more general information from the dataset to solve the small data problem.It is a data preprocessing method and a type of regularization which can effectively improve the performance of model reducing the possibility of overfitting [10,11].
Two very simple augmentation processes are generally employed: gray level disturbance and shape disturbance.In the first case, Gaussian noise or something similar is added to the original images.In the second one, the data is increased by oversampling images with translations, rotations, brightness modification, rescaling, flipping, shearing or stretching and other affine transformations.In general, the idea behind these operations is that they will assist the learning algorithm to acquire more comprehensive and robust features which will then be useful in conditions where the data could be incomplete and/or noisy, favoring generalization.
One such more objective and promising technology that recently has been introduced for data augmentation, are the generative adversarial network (GAN) which involves generative models and adversarial learning [12,13].The GAN attempts to approximate the true data distribution through a minimax game between two subnetworks in competition with each other, called the discriminator and the generator.The generator attempts to create data samples as similar as possible to the true data while the discriminator seeks to distinguish true from fake-generated samples.The two subnetworks evolve together during training; the generator tries to deceive the discriminator by improving its output more and more, in other words, it learns to approximate better and better the distribution of the original data.Thereby new completely synthetic data samples can be generated and used for training in the main task.In general, as a generative model, a well-trained GAN is used to provide additional fake and synthetic samples that has the same distribution with the original training data [14][15][16][17].
Transfer learning.Another possible way to solve the small sample size problem is transfer learning, that is to use a pre-trained network, which cleverly applies knowledge gained from a source domain to facilitate the learning problem in a partially related or unrelated target domain.Transfer learning provides an effective framework for deep learning with small datasets; it pretrains a model by using existing massive dataset and then uses the trained model either as an initialization or as it is for a new task [18][19][20].
The idea is to initialize the neural network with the weights trained from some previous task and fine-tune parameters within the current task when the current task has insufficient training data.This approach provides a reasonable initial state and may speed up training, slightly different form the traditional learning process where it tries to learn each task from scratch.There may be three different approaches to reuse the parameters (weights and biases) of pre-trained network: (1) reusing the parameters in pre-trained deep neural network directly to initialize the new network and fixing without retraining, called freezing.(2) Reusing the parameters in pre-trained deep neural network directly to initialize the new network and fine-tuning the parameters using target domain data, called fine-tuning.(3) Initializing network parameters randomly and tuning parameters using target domain data, called random initialization and training [1].
The source domain can pertain to a connected sphere of the target task as well as to a completely different one.As a matter of fact, most studies have made use of models pretrained from the large-scale ImageNet database [21], containing 1.2 million natural images.These models trained from the ImageNet have a strong capability for feature extraction.Thus, they are suitable to be transferred to other context having small number of image data and can produce significant advanced performances better than shallow algorithms.Such a strategy reduces the need and effort to recollect a large training data, saving data resources and training time.Transfer learning cloud be very effective in the field of medical images where pretraining can mitigate the drawback of having a very large labeled datasets and can prove very useful in building complex and robust models.In general, the use of deep neural networks even with small data samples can occur thanks to the pre-training on data-rich domains that share affinities in statistical properties with the target dataset [22][23][24].
The aim of this work, and the related research question, is to present a systematic review to provide an overview of the state of the art of deep learning research for clinical applications on small samples and to highlight the different strategies for working in this scenario.Specifically, we sought to describe the study characteristics, and evaluate the methods and quality of reporting and transparency of deep learning studies that compare diagnostic algorithm performance with the ground truth.

Preferred Reporting Items for Systematic reviews and Meta-Analyse Prisma
This manuscript has been prepared according to the guidelines and a checklist is available in the supplementary material [25].

Literature search and inclusion criteria
We performed a comprehensive search by using free text terms for various forms of the keywords 'small' , 'data base' and 'deep learning' to identify eligible studies.PubMed MEDLINE database was thoroughly searched to identify original research articles that investigated the performance of AI algorithms analyzing small medical images samples.We used the following search query: ('small' OR 'limited') AND ('sample' OR 'samples' OR 'database' OR 'databases' OR 'dataset' OR 'datasets' OR 'data sample' OR 'data samples') AND ('medical images' OR 'medical imaging') AND ('artificial intelligence' OR 'radiomics' OR 'machine learning' OR 'deep learning') AND ('classification' OR 'prediction' OR 'clustering').PubMed search engine was questioned without imposing time filters (literature search update until 31 July 2022).
We selected publications for review if they satisfied several inclusion criteria: a peer reviewed scientific report of original research; English language; assessed a deep learning algorithm applied to a clinical problem in medical imaging; application of the AI techniques on declared small datasets; and compared algorithm performance with the ground truth.
We included studies when the aim was to use medical imaging for predicting absolute risk of existing disease or classification into diagnostic groups (e.g.disease or non-disease).In machine learning, regression and classification are closely related concepts in that they both involve making predictions from data and they both play crucial roles in medical image analysis.Even though they can be used together in a cascaded or integrated approach, these two procedures differ in terms of their objectives and the nature of the output they produce.Regression aims to predict a continuous numerical value as the output.In the context of medical image analysis, they provide quantitative information about various aspects of patient health like tumor size, bone mineral density, blood flow quantification, etc. Classification, on the other hand, focuses on assigning inputs to predefined categories or classes.In medical image analysis, these classes might represent different diseases or conditions (normal or abnormal, malignant or benign).Fundamentally, regression is about predicting a quantity and classification is about predicting a label.Since in the final analysis they can be considered as two very distinct tasks and that the choice between them depends on the nature of the assignment and the information required from the analysis, we have decided to focus only on the classification task to narrow the research and be able to obtain a more homogeneous set of results than allows us to reach the most rigorous assessments.
We defined medical images as radiologic images and other medical photographs (e.g.endoscopic images, retinal images, pathologic photos, and skin photos) and did not consider any line art graphs that typically plot unidimensional data across time, for example, electrocardiogram and A-mode ultrasound.Case reports, review articles, editorials, letters and comments were left out.Exclusion criteria included also AI algorithms that performed image-related tasks other than direct diagnostic decision-making, such as image segmentation, databases description and manage data preprocessing.

Screening of collected studies
After removal of clearly irrelevant records, two reviewers independently screened abstracts for potentially eligible studies.Abstracts with any degree of ambiguity or that generated differences in opinion between the two reviewers were re-evaluated at a consensus meeting, for which a third reviewer was invited.
The admissibility of the full text articles was then assessed by the same reviewers as before who will then extract the data from the study reports.After this second screening, articles belonging to one of the following categories were excluded: methodological works, object detection tasks, focus on explainability and out of the topic.

Adherence to reporting standards-transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD)
We evaluated the quality of the studies according to the TRIPOD statement [26].This statement rates the transparency of the reporting of a prediction model study regardless of the survey methods used and in all medical settings [27].It is composed of a 22 items checklist (37 total points when all subitems are included), which analyzes the development, validation, or the updating of a prediction model, whether for diagnostic or prognostic purposes.The aim was to assess whether studies broadly conformed to reporting recommendations included in TRIPOD, and not the detailed granularity required for a full assessment of adherence [28].

Data synthesis and analysis
Aware of heterogeneity of specialties, metrics and outcomes, we reported in table 1 the basic qualitative and quantitative characteristics such as anatomical region, AI technique, sample size, number of classes, best performance, type of images, programming language and sharing code and database.
Two-sided Mann-Whitney-Wilcoxon statistical test was conducted with Bonferroni correction and an alpha value of 0.05 was used to determine significance.

Study selection
Our electronic search carried out considering only the filter 'titles and abstracts' , which was last updated on 31 July 2022, retrieved 147 records.Of the 147 initially collected studies, we assessed 105 full text articles; 28 were excluded, which left 77 works for analysis (figure 1).
Compared to existing reviews, our work is original and contemporary, and faces a very important hot topic of artificial intelligence in the field of medical images.In fact, none of the reviews that emerged from the search query identified by us detailed the problem of the small dataset.In particular, most of the reviews did not handle the problem of small dataset, and only six named this issue, without however entering in detail and discussing the topic thoroughly.In these cases, the authors of the reviews faced how artificial intelligence is applied to a specific task, simply summarizing in the conclusions that one of the main drawbacks of the analyzed studies concern the use of small databases for the training of the AI algorithms.

General characteristics
Table 1 summarizes the basic characteristics of the 77 studies.All of them are about the development and the validation of a prediction model, specifically, 75 (97%) publications deal with diagnostic model and only 2 (3%) with prognostic model.Most of the works make use of deep learning techniques (86%), only 6% applies purely machine learning techniques and 8% mix both methodologies.
The top five imaging modalities are x-ray 23/77 (30%), MRI 19/77 (25%), CT 18/77 (23%), histological 9/77 (12%), and ocular images 5/77 (6%).The remaining types concern ultrasound, endoscopic, PET and SPECT images (figure 2(a)).Zooming on the first three categories, x-ray images take care of lungs (12), breast (8), skeleton (2) and adenoid (1); MR images focus on brain (13), prostate (3), knee (2) and liver (1); CT images pay attention on lungs (10), H&N (2), colon (2), liver (2), heart (1) and brain angiography (1).As regards the number of samples in the databases, they present a distribution with an average population of 16 600 ± 45 700 samples (mean ± one standard deviation), a minimum of 16 and a maximum of 299 000 (figure 2(b)).Most of the studies develop AI techniques by exploiting the clinical images of the anatomical regions most investigated in the clinic and therefore with the greatest probability of finding adequate databases: brain, breast, lung (figure 2(c)).Furthermore, as can be expected, given the scarcity of data in small samples and the difficulty of the tasks, more than 60% of the authors perform a binary classification (figure 2(d)).Concerning reproducibility, data are public and available in 47 studies.In 25 analysis the collected data are private and 7 operate over both types of databases.50% of the studies managed only one repository, 31% acted on 2, 10% employed 3 databases and the rest of the publications more than three.Additional plots relative to the quantity of available data with respect to the anatomical region, the imaging technique and the dataset origin can be found in the supplementary materials (supp.figure 1).
To solve the small data issue transfer learning techniques, basic data augmentation and GANs are applied in 75%, 69% and 14% of cases, respectively.All three methodologies are exploited simultaneously in only 8 studies, while 26 used none of these techniques.The two main metrics used are accuracy and area under curve (AUC).The first was used in 65/77 studies to evaluate the performance of the algorithm on the test set, obtaining an average value of 0.90 ± 0.11, while the second was used in 48/77 works with an average value of 0.90 ± 0.10.Relative to transparency and sharing, code (for preprocessing of data, modeling and reproducing the evaluation) is available in only 13 studies (17%).Funding was predominantly academic (45/77, 58%) and mixed with commercial supporters in 3 cases (4%).Ten studies stated they had no funding and another 19 did not report on funding.
In the following analysis, in order to better interpret the results and since most of the works take into consideration a binary classification as mentioned before, we focused only on these studies and we wanted to verify a possible increase in the performance of AI algorithms in terms of accuracy and AUC as function of publication year (figure 3).None of the data is statistically significant but a growing trend can be visually appreciated.This could be due to the growing use of transfer learning and data augmentation (figure 4).By comparing the performance metrics with respect to the use or not of these techniques, differences can be noted (figure 5).For both accuracy and AUC, if transfer learning, data augmentation or both AI techniques are exploited, the dispersion of data is more limited, both in terms of interquartile range and whisker extension.Furthermore, even if for accuracy the median values of the distributions with and without the use of the different techniques are comparable, for the AUC the difference between these values is considerable.In point of fact, the use or not of data augmentation is statistically significant (p = 0.03).
This analysis presents potential biases and confounders, such as different methods, different tasks or different numbers of initial data that could influence the performances, due to the presence of some limitations in the available data.Here are the assumptions for which we considered the data as consistent and coherent for comparison.It is very difficult to find a homogeneous set with a specific task but all the works examined for these plots have binary classification as a common task.Furthermore, as regards models and databases, based on what was declared by the authors, the initial databases can be considered small compared to the parameters of the models to be optimized during training, regardless of the type and imaging modalities examined.

Adherence to reporting standards
Adherence to reporting standards less than 50% is present in 13 of 37 TRIPOD items (figure 6).Overall, publications adhered to between 52% and 88% of the TRIPOD items: median 68%, interquartile range  61%-71%, confidence level at 5 and 95% are 55 and 81%, respectively, corresponding to two studies below the 5% threshold and three studies above the 95% threshold.
Two items deserve deep comment: number 1 (identify the study as developing and/or validating a multivariable prediction model, the target population, and the outcome to be predicted) with an adherence of 3% and number 16 (report performance measures with confidence intervals for the prediction model) with an adherence of 29%.In the first case such low adherence has found because in the title the authors have  not reported the words development, validation, incremental/added value (or synonyms).While in the second one, the confidence interval (or standard error) of the discrimination measure and/or the measures for model calibration are often not indicated.
The full results of TRIPOD adherence assessment form for this study are available in the online supplement materials.
For the moment, quantity and quality have not helped to improve performances (figure 7).On one hand, perhaps the quality of the data needs to be boosted and/or even if a large database is available, it is not guaranteed to obtain excellent performance because it probably contains greater heterogeneity by representing the real variability in a more objective way.On the other, having a high TRIPOD index is not a guarantee of having good performances since it mainly evaluates the reliability and transparency of the studies.Additional plots relative to the performances with respect to the quantity (available data) and the

Discussion
We have conducted an appraisal of the methods and adherence to reporting standards.These studies are constantly increasing and are pushing more and more to introduce AI algorithms into clinical practice as quickly as possible.The potential consequences for patients for immature implementation of these systems without a rigorous evidence base could be catastrophic.For the moment, the efforts should focus on improving design, validation, transparency and sharing [82].
All the selected works declare that the database at their disposal was small and therefore limited for an optimal achievement of their objective.But as can be seen from table 1, certain databases are difficult to classify as small in absolute terms having more than 100 000 data.It is therefore essential to declare the term 'small' in relative terms with respect to the number of free parameters to be optimized.In this way it is more evident how difficult it is the task of training a complex model prone to overfit the data and without an appropriate regularization [90].
Working with small databases there is the risk of creating a bias in the optimized model due precisely to the few samples available and this negatively affects its generalizability and reliability.Even if the algorithm is tested on a subset of data not used during training, if not handled properly, when testing the algorithm on an external dataset this can lead to a poor performance [5,91,92].
The works we encountered are retrospective studies and only four explicitly stated that they have carried out an external validation of the developed model, meaning using a completely independent database compared to the previous one, with another patients' distribution, coming from a different geographical region or using a real hospital database.For this reason, they should be considered only a proof of concept and there is still a long way to go before being able to arrive at an effective clinical implementation.There are comparisons of the AI performance with respect to clinicians, but unfortunately they are still minimal and the very good performances obtained in silico may not lead to an effective clinical benefit, such as an unacceptably high false positive rate.Entering in more detail in this area, one should verify or at least be aware of how clinical ground truths are defined.First, because there is variability between intra and inter expert clinicians and the most likely value would be that generated by a suitably large sample of experts to ensure reliability.Second, because the inclusion of non-experts is starting to take hold, especially in segmentation tasks.Such a tendency can lower the average human performance and potentially make the AI algorithm perform better than it otherwise might [93].In this perspective, particular attention should be paid if public databases are used; however useful and sometimes essential, before throwing yourself headlong into training AI algorithms, it is better to inquire in detail about how the database was built and how the ground truths were obtained.In addition to the quantity, the quality and certifiability of the data should also begin to be considered a must.
Developing AI systems employing tens of thousands of training samples leads to onerous investments since high level knowledge is required to prepare such data.Therefore, designing AI algorithms under small amounts of quality data with high accuracy is of great significance and an important direction of current artificial intelligence research.To overcome the main drawbacks and pitfalls in this field, reliable and efficient strategies must be considered and applied [31,47,59].
With medical images, the dimensional differences between 2D and 3D medical data present several challenges and aspects, especially when training neural networks for medical image analysis.The most obvious consideration concerns the fact that a single 3D volume can be seen as a stack of several hundreds of 2D images, which can lead to a significant increase in the amount of available data.In addition, other aspects must be taken into consideration which concern the intrinsic distinctions in the quantity of information, spatial relations and complexity.
The choice of patient classification based on 2D images, as opposed to 3D volumes, is a strategy that is taking root and spreading in literature [2,94,95].In many medical settings, the acquired and readily available images are typically in the form of 2D slices with a notable slice gap.This practice is prevalent in various imaging modalities such as CT and MRI.So long as 3D volumes encompass a stack of consecutive slices, the main strategic advantage of adopting a 2D-based approach is the ability to leverage a larger pool of training samples for deep neural networks.Instead of considering the entire 3D volume for each patient, researchers can extract transversal 2D slices.This extraction process enables the generation of multiple training samples from a single patient, equal to the number of transversal slices available for analysis.When counting the overall number of training samples, it is feasible to go from several tens in the original dataset to thousands after slicing the patients.Consequently, the dataset for training the ML models is significantly enriched, enhancing the ability of the model to generalize and to learn from diverse perspectives within each patient's imaging data.This approach addresses the potential limitations associated with limited datasets, especially in the context of medical imaging where obtaining labeled data for training can be challenging.The increased number of training samples contributes to the robustness and adaptability of complex ML models, such as deep neural networks, contributing to more accurate and clinically relevant outcomes.
Certainly, the above-mentioned approach is cunning, but other aspects must be kept in mind if someone chooses to disarticulate a 3D volume into 2D.The sheer volume of data in 3D is significantly larger than its 2D counterpart and this can pose challenges in terms of storage, computational resources and complexity, and training time.But there are also intrinsic distinctions in the amount of information and spatial relations associated with 2D and 3D modalities.Volumetric medical images preserve spatial relationships and context that may be lost in 2D representations and therefore may pose challenges in capturing the continuous and detailed information necessary for certain medical tasks.Neural networks trained on 3D data can potentially catch more comprehensive information about the three-dimensional structure of anatomical features, leading to better performance in tasks requiring spatial understanding.Addressing the dimensional differences between 2D and 3D medical data requires careful consideration of the specific task, available resources, and the nature of the medical imaging data.Developing effective ML and DL architectures and data augmentation strategies is crucial for achieving optimal performance in medical image analysis tasks.
As the systematic review revealed, researchers rely mostly on data augmentation and transfer learning.Inherently to the first solution to enrich the dataset via the augmentation strategies, it should be underlined how the use of affine transformations to create new (similar) versions of existing samples without adding any morphological variations cannot fully resolve the overfitting problem.The generated images become much correlated to each other offering modest improvement for further generalization over unseen samples.On the other hand, the spread of GANs with their astounding abilities can help to address overfit, creating morphological variations in augmented samples while preserving the key characteristic.Ahmad et al [81] proposed a framework based on unsupervised deep generative neural networks to solve the need for a large amount of medical images.They combined two generative models in the proposed framework: variational autoencoders and GANs.Artificially generated brain tumor images were used to augment the real and available images during the classifier training performed with ResNet50.By using brain tumor images generated artificially, classification average accuracy improved from 72.63%, without classic augmentation and generative images, to 96.25%, with classic augmentation and generative images.Wang et al [71] proposed an automatic classification system for subcentimeter pulmonary adenocarcinoma, combining a homemade convolutional neural network (CNN) and a GAN.For GAN-based image synthesis, the visual Turing test showed that even radiologists could not tell the GAN-synthesized from the raw images (accuracy: primary radiologist 56%, senior radiologist 65%).The experiments indicated that GAN augmentation method improved the classification accuracy by 23.5% (from 37.0% to 60.5%) and 7.3% (from 53.2% to 60.5%) in comparison with training methods using raw and classic augmented images respectively.Very similar results were also found with fine-tuning VGG16 under the same conditions, obtaining a classification accuracy of 37.7%, 48.3% and 60.2% for a training with the raw dataset, common augmentation dataset and GAN-synthesized dataset, respectively.Zebin and Rezvy [16] implemented a transfer learning pipeline for classifying COVID-19 chest x-ray images.The classifier effectively distinguishes inflammation in lungs due to COVID-19 and Pneumonia from the ones with no infection (normal).They have used multiple pre-trained (on ImageNet dataset) convolutional backbones as the feature extractor and achieved an overall detection accuracy of 88%, 94.3%, and 96.8% for VGG16, ResNet50, and EfficientNetB0 respectively when a basic data augmentation was employed.Additionally, they generated synthetic COVID-19 images with a CycleGAN to balance the three classes and then applied classic augmentation to all data.VGG16 model fine-tuned over this expanded database, produced an accuracy of 90%.
With regards to the second method, transfer learning has an incredible potential and can be fully applied when researchers have neither a sufficient volume of data nor the computational resources needed to train the algorithm.The resulting models will have an excellent features extraction capability learned from the large source datasets [31,68].However, they will be validated, tailored, and improved to the specific application to achieve optimal results.For brain tumor classification for MR images, Swati et al [2] used pre-trained deep CNN VGG-19 model, trained on ImageNet dataset, and proposed a block-wise fine-tuning strategy based on transfer learning, achieving the best average accuracy of 94.82% under five-fold cross-validation.They stated that thanks to transfer learning and fine-tuning it was possible to reduce overfitting and speed the convergence.Moreover, fine-tuning the last a few layers, it was be difficult for the CNN model to learn relevant medical brain MRI features from natural images.To achieve better performance, deep fine-tuning was required.As gradually increasing the layers for fine-tuning, the performance increased gradually.Hu et al [52] studied the diagnosis of prostate transition zone cancer (PTZC) versus benign prostatic hyperplasia on MRI.The deep CNN Alex-Net combined with transfer learning showed high efficacy in diagnosing PTZC on medical imaging, overcoming the challenge of limited data.Alex-Net was trained and compared between different transfer learning databases (ImageNet vs. disease-related images) and protocols (from scratch and fine-tuning).Using the model trained from scratch, authors obtained an AUC of 0.73.The efficacy of transfer learning from natural images was be limited (AUC of 0.75) but improved by transferring knowledge from the disease-related images (AUC of 0.86).Chougrad et al [51] aimed to classify mammography mass lesions as benign or malignant.To achieve this goal, they explored the importance of transfer learning and were able to fine-tune some of the most powerful CNNs (VGG16, ResNet50 and Inception v3, pre-trained on ImageNet).They also applied classic data augmentation and 5-fold cross validation during training.Due to the deep architectures and the small datasets used, they found that fione-tuning too many layers leads to worse results; the best fine-tuning strategy was to froze all the layers until the last or the two last convolutional blocks.The performance on the dataset used to fine-tune the model brought to a test accuracy of 98.64%, 98.77%, 98.94% for VGG16, ResNet50 and Inception v3, respectively.The best performing model was also tested on an independent database and got 98.23% accuracy.
Developing AI models that can learn from limited data is still an open research area, however these techniques not only tackle the insufficiency issue of data but can also provide a viable solution to class imbalance problem, which is also an important research area.
A central aspect that needs to be further explored is how the data augmentation affects the bias propagation.When the augmented data does not accurately reflect the real-world distribution, the model becomes biased.Bias refers to systematic errors or prejudices that exist in data, leading to unfair or discriminatory outcomes.When data augmentation techniques are applied, they can inadvertently amplify existing biases or introduce new biases into the augmented data.Data augmentation techniques modify the original data samples, potentially altering the distribution of the training data.Jain et al [96] in a recent study pointed out that, although one expects GANs to replicate the distribution of the original data, in real-world settings with limited data, finite training time and network capacity, the generated distribution can only capture a subset of the original distribution.In this scenario, GANs generate a distribution with significantly less diversity in one or several dimensions compared to the original data, bringing along the side-effect of amplifying the bias.The authors explored how the use of synthetic data generated by GANs, which are currently used in many different fields, are sensitive to this phenomenon.They analyzed how the societal biases, like gender and skin tone, present in a dataset of faces of engineering professors collected from a selection of U.S. Universities would be enhanced by using different types of GANs to generate synthetic data.The authors recommend a critical and conscious approach in the use of GANs for data augmentation.In fact, in some situations, even if the data might seem well balanced, they could be affected by some hidden bias and the augmented data might be under-representing some crucial feature of the real-world data.In those cases, the use of more reliable techniques should be considered.
Another important point that needs to be carefully investigated concerns the relationship between data augmentation and explainability.While data augmentation can significantly improve model performance by providing more varied and representative training examples, it can also have an impact on the explainability of machine learning models.Explainability refers to the ability to understand and interpret the decision-making process of a machine learning model.It is crucial in many domains where transparency, accountability, and trust are required, such as in healthcare.The impact of data augmentation on explainability can be examined from two perspectives: model interpretability and feature importance.In the first one, data augmentation can affect model interpretability by introducing additional complexity and non-linearity into the training process.When augmented data is used, the model is exposed to a wider range of input variations, making it more challenging to pinpoint the exact reasons for a particular decision.The transformations applied during augmentation can distort or alter the original features, making it harder to understand how the model is leveraging specific input characteristics to make predictions.In the second one, data augmentation can also influence feature importance analysis, which aims to identify the input features that have the most significant impact on the model predictions.By augmenting the data, the distribution and relationships between the features can change.This alteration can lead to changes in the perceived importance of certain features, as the model may rely more heavily on augmented features or combinations of features that were not present in the original dataset.
TRIPOD analysis brought out that most studies neither shared their source code nor included enough information about the model architecture, hyperparameters used, validation and evaluation methods followed to achieve such very good results.This leads to raising questions about the obtained results.Is not it that such exciting results were associated with some methodological bias that overestimates the performance of the resulting model?Moreover, limited accessibility of datasets and codes makes it difficult to assess the reproducibility of AI research.This approach is not constructive and affects external validity and denies implementation by other researchers that could improve the model.We strongly recommend more transparent reporting, sharing code, data (if possible) and detailing the hardware used.Only in this way can the replicability and robustness of the study be verified.Further, from the TRIPOD survey it emerged that it would be desirable to improve the drafting of the title and abstract by inserting more explanatory keywords.
Some limitations in our study can be highlighted.First, our search may have missed some studies that could have been included although comprehensive and systematic.Second, the guideline we used to assess the quality of the studies (TRIPOD) was not designed for AI studies, so some items and their adherence levels need some degree of interpretation.Third, we focused on studies that used small databases within clinical images; we believe it may not be appropriate to generalize our findings to other databases employed in the field of AI.Taking into account the main limitation emerged from this review, we feel compelled to underline the importance of the external validation of the developed models.This verification process aims to ensure the credibility, reliability, and accuracy of the results by subjecting them to scrutiny and evaluation by involving external, unbiased and independent validators.It helps mitigate biases and errors that might have been overlooked by the original researchers or developers.The external independent validation enhances the transparency and accountability of the research and development process and helps build trust among stakeholders, decision-makers, and the wider community.Overall, external validation is an important process for ensuring that models are performing as intended, and that the results are accurate and reliable against real-world data.In addition, it provides confidence in the decisions made based on the output of the model, essential in the clinical field.
As further suggestions for future directions, since data augmentation can impact bias propagation in machine learning models, caution must be exercised to ensure that biases are not amplified or introduced during the augmentation process.A thoughtful approach that includes diverse and representative data, bias detection and correction can help mitigate bias propagation.Furthermore, although data augmentation can pose challenges to model explainability, the following strategies can help mitigate these challenges: (i) careful consideration of methods specifically designed to improve the interpretability of models trained on augmented data, (ii) awareness of the impact of augmentation on feature importance, and (iii) controlled augmentation strategies to ensure that the augmented data samples preserve the salient characteristics of the original data.In our opinion this topic is not explicitly addressed in the literature and it should be developed in future works.Ultimately, balancing the benefits of improved model performance with the need for interpretability is essential, particularly in domains where transparency and accountability are critical.For this purpose, post-hoc interpretability methods should be employed by highlighting relevant features or generating saliency maps.

Qualitative summary
The research question of this systematic review is to highlight the different strategies for working with small data.On the basis of the research question, the following qualitative summary may be extracted from the surveyed papers.Working with AI in a small medical database requires careful consideration of various strategies to ensure effective utilization of the available data and the AI capabilities.First of all, ensuring the quality of your data is paramount and cleaning and preprocessing the data to remove noise, errors, and inconsistencies will improve the accuracy of any AI models you develop.Moreover, with limited data to make the most out of the available information, feature engineering and selection becomes particularly important by identifying the most relevant features.As for algorithms selection and training, the combination of regularization to penalize complex models and k-fold cross-validation to assess the generalization performance on unseen data could mitigate the overfitting with limited data.As emerged from the analysis of the examined publications the two most widely used approaches to cope with the issues in the use of the small databases are data augmentation and transfer learning, also thanks to their simple application and diffusion reflecting their effectiveness and versatility across various domains.The first one (i) supports to introduce variability into the training data, making AI models more robust to variations and noise present in real-world scenarios, (ii) encourages the model to learn more invariant and discriminative features, improving its generalization performance and (iii) helps prevent the model from focusing too heavily on idiosyncratic patterns in the training data, reducing the risk of overfitting.The second one (i) by initializing the model with weights learned from a pre-trained model and fine-tuning it on the target dataset, enables faster convergence and often results in better performance compared to training from scratch, (ii) can facilitate domain adaptation, where knowledge learned from a source domain is adapted to a target domain with different characteristics and (iii) can achieve state-of-the-art performance without the need for extensive computational resources or labeled data.
The analysis revealed that limiting the definition of 'small' to a database considering only the number of samples or records it contains is inadequate.Hence, it becomes imperative to recalibrate the understanding of scale, shifting the focus from absolute metrics to a more nuanced perspective.Instead of merely counting entries, the intricate interplay between data volume and the number of free parameters awaiting optimization must be considered.By contextualizing the term 'small' within the framework of relative proportions, it becomes apparent that the size of the database alone does not dictate the complexity of the task.Rather, it is the ratio between the volume of data and the degree of freedom within the model that truly defines the magnitude of the challenge.This is the fundamental aspect, but the analysis also outlined other common characteristics of the small databases.Specifically, they may lack diversity in terms of the range of instances or scenarios they cover, have a limited number of features or attributes for each sample, may suffer from imbalanced classes and may contain more noise or variability compared to larger datasets.
When dealing with machine learning models trained on small datasets, a key concern is the risk of overfitting, which may restrict their ability to generalize beyond the training data.However, employing an explainable artificial intelligence based solution ensures that those assessing the model can acquire the necessary insights to conduct specific evaluations of its reliability and effectiveness [97].Specifically, by tracking how the importance of features varies across different data segments, it becomes possible to gauge whether the factors driving model decisions are changing.
By incorporating the aforementioned strategies into AI workflows, it is possible to mitigate the challenges associated with small medical databases and develop robust and accurate AI models across a wide range of applications and domains.

Conclusions
Though AI requires a sufficient amount of quality data for training, the results obtained using small databases of medical images are promising but still not mature enough to be implemented in the clinical setting and be widely used.Transfer learning and data augmentation could represent the most reasonable choices to fight overfitting.Despite the good performances obtained so far, often too promising, there is still a lot of work to be done.First of all, to encourage the external validation of the models, using databases that are independent from those of the training.Consequently, it is necessary to sensitize researchers to be more transparent, sharing codes and data as much as possible.This attitude will help the reproducibility, the generalizability and the development of higher quality research.

Figure 1 .
Figure 1.Flow-chart.Flow-chart of article selection based on PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines.

2 a
Respect to the original available data.

Figure 2 .
Figure 2. General characteristics.General characteristics concerning the imaging modalities (a), distribution of the number of samples in the databases (b), the most popular anatomical regions (c) and the preferred type of classification (d).

Figure 3 .
Figure 3. Performance of AI algorithms.Performance of AI algorithms in terms of accuracy (left) and AUC (right) as function of publication year for binary classification studies.

Figure 4 .
Figure 4. Use or not of AI techniques.Use of transfer learning (a) and data augmentation (b) as function of publication year.

Figure 5 .
Figure 5. AI performances.AI performances (accuracy top, AUC bottom) for binary classification studies; with and without transfer learning (first column), with and without data augmentation (second column), with and without both techniques (third column).

Figure 6 .
Figure 6.Adherence to reporting standards.Completeness of reporting of individual TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) items.

Figure 7 .
Figure 7. Quantity & quality.AI performances with respect to the quantity (Available data) and quality (TRIPOD) of the data.