Automated Quantification of DNA Damage Using Deep Learning and Use of Synthetic Data Generated from Basic Geometric Shapes

Comet assays are used to assess the extent of Deoxyribonucleic acid (DNA) damage, in human cells, caused by substances such as novel drugs or nano materials. Deep learning is showing promising results in automating the process of quantifying the percentage of damage, using the assay images. But the lack of large datasets and imbalanced data is a challenge. In this study, synthetic comet assay images generated from simple geometric shapes were used to augment the data for training the Convolutional Neural Network. The results from the model trained using the augmented data were compared with the results from a model trained exclusively on real images. It was observed that the use of synthetic data in training not only gave a significantly better coefficient of determination (R 2), but also resulted in a more robust model i.e., with less variation in R 2 compared to training without synthetic data. This approach can lead to improved training while using a smaller training dataset, saving cost and effort involved in capturing additional experimental images and annotating them. Additional benefits include addressing imbalanced datasets, and data privacy concerns. Similar approaches must be explored in other low data domains to extract the same benefits.

The ability of a substance or material to damage the DNA within a cell is called its genotoxicity.Several genotoxic agents such as chemicals, viruses and radiation can impact the DNA.The damage that genotoxic agents cause can be in the form of mutations, DNA strand breaks or chromosomal aberrations, which lead to diseases such as cancer and neurodegenerative diseases, which have a significant social impact.][3] Nanoparticles, given their small size, can interact with the cellular DNA.This interaction can cause the strand breaks or modifications.Hence it is essential to test every new nanoparticle for genotoxicity. 1ingle-cell gel electrophoresis, also known as comet assay, is a popular method for measuring the DNA damage at the cellular level.It is simple to perform, sensitive and quick.This method can detect both single strand and double strand breaks.][6] As a first step, the cells are embedded in an agarose gel on a microscope slide.The agarose gel has a low melting point.By an alkaline treatment, DNA gets denatured, and this leads to the separation of the strands.The separation of strands exposes the breaks or damage sites in the DNA. 5,7he denatured DNA is then subject to an electric field, in a process is called electrophoresis.This leads to the separation of the DNA fragments.The damaged DNA fragments migrate towards the anode and the path of the migration resembles a comet.A typical resultant image of the process is shown in Fig. 1.Hence the process is known as "comet assay." 8n the comet assay process, there are three parameters that indicate the extent of DNA damage, namely tail moment, tail length and tail intensity.Being able to measure them accurately helps us quantify the percentage of DNA damage.But various experimental factors such as the duration of exposure to the electric field, the voltage applied and the temperature at which the assay was performed, can impact the results.These experimental conditions must be controlled adequately to get consistent results. 9he percentage of DNA damage is directly proportional to the DNA migration in the comet assay.The main scoring approaches are visual scoring and computer-based image analysis. 10isual scoring is a subjective qualitative assessment.Visual scoring the comet assays is done on a scale of 0 to 4, where 0 means no damage and 4 means severe damage.A set of reference images are used to score the assays visually by the researcher.Obviously, this method is not very reliable due to its subjective nature.
The image processing or computer-based approach involves obtaining digital images of the comets and using an image analysis software to score them.The image processing approaches involve two steps.In the first step identification of the head and tail in the images in done.The second step involves estimating the total intensity of pixels in the regions identified in the first step.This process is more accurate than visual scoring but is not fully automated.
The capture of comet images digitally allowed researchers to apply machine learning (ML) algorithms for scoring.Several ML algorithms require the user to extract features from the images.This is typically done programmatically, with the help of custom code written for each feature.And the process of developing features for ML models is called feature engineering.Use of traditional machine learning approaches required manual feature engineering.With the advent of deep learning (DL), the feature extraction process can also be automated.Some researchers have demonstrated that high accuracy can also be achieved using DL.Hence there is now a possibility of fully automated and accurate scoring of comets using DL. 11he most common parameters to quantify the DNA damage from a comet assay image are tail length, tail intensity, and tail moment. 12ail length is the distance from the edge of head to the end of the tail.It is an indicator of the extent of DNA migration and the degree of DNA damage.Tail intensity is proportional to the percentage of DNA present in the tail.It can be calculated as a ratio of the intensity (fluorescence) of the tail to the intensity of the entire comet and multiplying by 100.Tail intensity is a measure of the amount of damaged DNA that has moved or migrated from the head or nucleus. 13The product of tail intensity and tail length gives the Tail moment.It is a measure that considers the extent of migration of the DNA and the relative amount of DNA in the tail. 14z E-mail: snamu001@fiu.eduECS Sensors Plus, 2024 3 012401 Several factors contribute to the variation in comet assay images and hence, scoring them accurately is a challenge.Visual scoring is very subjective and leads to inaccurate scores and wrong interpretation of results. 15se of image processing software has addressed these issues by offering a quantitative and objective approach to the scoring process.But the sources of variation such as the voltage applied, the temperature and the length of electrophoresis time, can still impact the results. 16Reproducibility of results is another challenge due to genetic variations that naturally occur between individuals.And the results can also vary from one lab to another. 17eep learning (DL) has shown promise in biomedical image processing applications, including the scoring of comet assays.Convolutional Neural Networks (CNNs) can handle complex image data with inherent variation, such as comet assay images.The sources of variation include differences in image resolutions, noise levels and image intensity. 18Figure 2 shows the variation in comet assay images from real world datasets.Notice the variation in image quality, orientation and size of images.CNNs can learn to recognize patterns in the presence of such variation and can lead to more robust scoring.The accuracy of image processing techniques that do not employ DL, suffer due to the same variation. 19raditional machine learning algorithms require manual effort in extracting features from images.But DL models learn such features automatically.Hence the speed and accuracy of DL based approaches is higher and make them more suitable for applications where the volume and velocity of the data is higher. 20Reducing the amount of human intervention also leads to less subjectivity and bias.It makes it possible to incorporate varied sources of training data as well.Due to this, it can be expected that the results are robust and reproducible. 213][24][25][26] Being able to automate the scoring of the Comet images using Deep Learning can enable them.Automating the quantification of comet images can enable scoring on the edge.The importance and applications of computation on the edge for healthcare and biosensor related applications is discussed in papers such as. 27,28ven though DL shows a lot of promise, it has its own set of challenges in the context of comet assays.The most important one is availability of relevant data sets used for training.CNNs require large amounts of annotated data to achieve high accuracy and reliability.][31] Variation in image quality, orientation of the comets and size of images is another challenge. 32The computational complexity, need for expensive hardware and training time are other challenges, especially if real time analysis is needed. 33mbalanced datasets, where some of the classes have very few samples, add to the complexity and challenge of achieving robust results.For example, if most of the samples are from cells with low damage, then the model will not give robust results when scoring Figure 1.An image of a comet assay.In this case, the head is brighter and circular, and the tail is lighter.Note that the tail is not always separated from the head.samples with high extent of damage.Such datasets can also lead to overfitting.These challenges can be addressed using techniques such as data augmentation and regularization. 34ata augmentation is a collection of image transformation techniques that enhance the size of the datasets and lead to better models. 41Image transformations that do not impact the label or annotation of the image are called label preserving transformations.Data augmentation techniques include rotation, scaling, resizing, translation, flipping, brightness adjustment, noise injection and synthetic image generation.Sometimes data with noisy labels can also be used in deep learning to improve the models. 35f comet assay images can be generated artificially and not based on existing images, it can significantly enhance the training data. 42ynthetic data, which is artificially generated data, can have several benefits.
Synthetic data offers several advantages.Below are some of the advantages based on work of, 43 in the context of comet assay scoring.
Data scarcity: Comet assay images are time and labor intensive to obtain.There are very few publicly available datasets.And getting enough high-quality images is a challenge.Hence the ability to generate synthetic data can address this scarcity of data.
Controlled variability: Each variable related to the comet images can be changed at a time to generate additional images.This ensures that several sources of variation are considered in a controlled fashion.
Data privacy: Real world comet assay images could be mapped to patient data, unless de-identified.There is always a threat of personal information being compromised due to data breaches.With synthetic data, this concern does not arise.The images are not related to actual patients.
Cost reduction: The cost of generating images programmatically is less labor and time intensive than a researcher doing assays in the lab and then annotating them.
Reduced class imbalance: Sometimes it may not be possible to get enough samples in a particular class.In such cases, synthetic data generation might be the only option available.
Model performance / generalization: When models are trained on augmented datasets with all the factors of variation included and with reduced class imbalance, it will lead to better performance and generalization.
The following are some of the challenges identified in generating synthetic comet assays: Variability in comet appearance: Variation exists in the comet orientations, brightness, shape, extent of damage and several other parameters.It is a challenge to ensure all such sources of variation are mimicked in the generated data.
Computational complexity: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs) and Large Language Models (LLMs) are computationally intensive.Research groups or labs with limited access to compute infrastructure might not be able to fully take advantage of these approaches.
Model overfitting: Most of the synthetic data generators also require a training dataset.If they rely on a small set of real-world comet images, and the parameters cannot be varied, then it can lead to overfitting.
Lack of annotation: Generative models such as GANs, VAEs and LLM based text to image generators can produce images, but they do not produce labels.They produce images at random.
Several data generation techniques have been developed in recent years.We now describe some of the most common technique used to generate synthetic data.
Generative Adversarial Networks (GANs) have been proposed in 2014 by 36 and have continued to grow in popularity.Two neural networks, namely a generator and a discriminator work in tandem to train each other.The generator creates images which mimic the statistical properties of the training images, and the discriminator tries to classify whether the image is real or not.The output of discriminator is used to improve the generator with each run or epoch.GANs have proven their effectiveness in several domains including medical images.
Variational Autoencoders (VAEs) consist of an encoder network and a decoder network.They learn a representation of the training data and can generate synthetic data that has the same statistical properties as the input data 44,45  (Kingma & Welling, 2014, 2019).Neural style transfer is an approach where images are generated by combining the contents of two images.The shapes of the larger objects are borrowed from one image and the subtle or finer aspects, referred to as style, is borrowed from the other image. 46,47Domain adaptation or transfer learning is a way of leveraging the model's training in an unrelated domain.Synthetic data can be used to pretrain the DL networks and then real data is used to fine-tune the network.These techniques have been shown to improve model performance on real world tasks. 37,38n recent months, several Generative AI tools have been made commercially available.For example, DALL-E is a tool from OpenAI that can be used to generate images based on text prompts.It uses language models called Large Language Models (LLMs) to achieve this task. 39igure 3 shows comet like images generated by Dall-E.The drawback with this approach was that such generative AI tools cannot take specific parameters such as head size and generate images to a given requirement.They are randomly generated images without labels or annotation.ECS Sensors Plus, 2024 3 012401 The two questions that guided this research were: 1. Can synthetic comet images be generated from basic geometric shapes? 2. Can the synthetic images which are representative of comets, even if over simplified, help in improving the DL models?
There are no examples in the current literature, where synthetic images have been generated from geometric shapes.Hence, if the answer to the second question is a yes, then this novel method will address some of the challenges in applying DL to comet assay scoring, such as imbalanced datasets, low data availability, labor and cost involved in capturing experimental data, annotation of data, and privacy of patient information.Current literature does not have examples of applying synthetic data to train DL models in the context of comet assays as well.
Comet assays have a wide range of applications in toxicology, drug development, cancer research, and personalized medicine.Improving or automating the scoring process benefits these applications.An effective synthetic data generation approach can address challenges and gaps in other applicable domains as well.

Methods
Proposed data generation method.-Thetypical comet shape can be considered as a combination of two basic geometric shapes: a circle and a triangle as shown in Fig. 4. With this parametrization of the comet shape, we can vary the diameter of the head (the diameter of the circle) and the width and the length of the tail (the sides of the triangle), thus a lot of variations on the combination of these two basic shapes are possible.
To create the variations of the comet shape, a custom code was written in Python using the PIL library.The code generated 10,000 comet images of various head to tail ratios and implying various levels of damage.
To generate the comet, first a circle is placed in the left half of the image and then a triangle is added such that the base of the triangle coincides with one of the diameters.This process is repeated for various values of radius and height.The size of the head can be varied by changing the radius of the circle.And changing the height of the triangle would change the length of the tail.For this research, all the comets were generated with the tail being horizontal i.e., the base of the triangle is vertical.Other orientations of the comets were obtained by rotating the comets as part of data augmentation.
The length of the tail is the difference between the height of the triangle and the radius, due to the overlap.The image size needs to be chosen to fit even the largest comet generated.For example, a 500 × 500 pixel image can take a comet with radius 100 and a tail of length 300.If the radius and the tail length are both varied between 1 and 100 pixels, then there are 10,000 unique comets generated.These are the only two parameters that need to be varied while generating the images.Figure 5 shows a sample of images generated from the code.
Transformations of the generated data.-Theimages generated from the Python code were subject to various transformations from data augmentation techniques described in the previous section.These label preserving transformations were achieved with the help of the python library Keras.The ImageDataGenerator function allows users to specify the type of transformations to apply on the input images.Limits are set on the extent of rotation, translation, zoom and other transformation.The function randomly applies the actions within the limits specified.
A typical comet image can be used to generate several additional images by use of transformations as shown in Fig. 6.The specific transformations using in this research were rotation, horizontal shift/ translation, vertical shift/translation, zoom, horizontal flip and vertical flip.
Simplifying assumptions.-Thefollowing simplifying assumptions were made while generating the synthetic comet images.
1.The images are mostly empty except for the comet.2. The width of the tail is comparable to the diameter of the head at the intersection.The tail becomes narrow progressively away from the head.Hence the tail can be modelled as part of a triangle.3. The head can be modelled as a circle and the area covered by the head is the area of the circle.4. The triangle has the diameter of the circle as a base, with part of it covered by the head and the rest by the tail shown by shaded portion in Fig. 7. 5.The area covered by the tail can be obtained by subtracting the area of the semicircle from the area of the triangle.
These assumptions make it easy to calculate the percentage damage to obtain labels for training, even though they are approximations.
Assume that the radius of the circle is r.The height of the triangle is h.
The area of the head = πr 2  When statistical testing is intended in only one direction a onesided t-test is conducted, and the statistic of interest is a one-sided pvalue.For example, if the null hypothesis states that Model A and Model B are the same, we need a two-sided test.But if the null hypothesis is that Model A is better than B or that Model B is better than A, then a one-sided test is conducted.
The null hypothesis for this experiment is that "the coefficient of determination (R 2 ) obtained from Model B is not greater than the (R 2 ) obtained from Model A." Hence a one-sided t-test would be used.If the one-sided p-value is less than 0.05, then the null hypothesis must be rejected.ECS Sensors Plus, 2024 3 012401

Data Sources and Models
In this research, we used two data sources for training and evaluation of the model.
We used 260 (20%) of the 1300 images were used for fine-tuning the model.The rest of the images were used for testing.
Model architecture.-Lathuiliereet al (2020) 48 proved that a general-purpose network (e.g., VGG-19/VGG-16, which are named after the Visual Geometry Group) optimized to the full extent, can yield state-of-the-art results without the need for more complex models.A well known VGG-16 architecture, as seen in Fig. 8, was chosen for this research and the primary focus was directed towards fine-tuning the model to achieve best in class results.Details of the architecture can be found in Simonyan & Zisserman, 2015. 49A visual representation of the VGG16 network, generated using the Visualkeras package, can be found in Fig. 9.
Training and testing the model.-TwoVGG16 networks were trained.In the first network (Model A) only real images were used for the training, whereas in the second network (Model B) a combination of real and synthetic images was used.A schematic of the training workflow for the two networks is shown in Fig. 9.The training was repeated 30 times, each time with a different selection of training and test data for both the networks (Table I).

Results
Comparing the coefficient of determination (R 2 ) for the model trained only on real data and the model trained on a mix of real and synthetic data can be seen in the table below.
Figure 10 shows the results from the data in Table I as a scatterplot.Each point in the plot is the coefficient of determination on the test sets for Model A and B. Results of Model A on the X-axis and the results of Model B on Y-axis.Most of the points are higher on the Y-axis, indicating that very few test sets have a low R 2 .
Table II shows the summary of statistics between the two models.
To establish that the performance of Model B is significantly better than Model A, hypothesis testing was performed.The null hypothesis states that "The coefficient of determination obtained on the test data with Model B is not significantly better than the coefficient of determination obtained with Model A." Below are the results: t-statistic = 6.634 one sided P-value = 5.99 × 10 −9 (smaller than 0.05) The P-value being less than 0.05 disproves the null hypothesis.It can be concluded that the R 2 value is significantly greater when the model is trained with a combination of synthetic and real images (Model B), compared to only training with real images (Model A).One of the limitations of this result is that it was tested on a single architecture for a fixed number of training epochs.Further experimentation is needed with other datasets and other architectures for establishing the validity in a more general scenario.The best results were obtained with Model B, which was trained on both real and synthetic images, when the number of synthetic images were further augmented by label preserving transformations.An R 2 value of 0.89 was obtained which is comparable to the stateof-the-art results, while using only about 10k annotated images for training.
The accuracy of the model can be determined by how it predicts the data as compared to the actual data.For this purpose, error in the predictions made by Model B was calculated for all the test images.Table III shows the error magnitude for a sample of 10 test images.
Figure 11 helps visualize the results.The error is calculated as the difference between the predicted and actual values.For example, if the actual percentage of damage in an image is 30% and the prediction is 35%, then the error is 5.In the plot of the actual vs predicted values, the red line connects the coordinates where the X and Y axis have the same value, which is the ideal.Each blue dot corresponds to a test image.The histogram on the right shows the distribution of the errors.
Average error magnitude = 5.87 Variation (standard deviation) = 6.03The range of the absolute values of error is (0.01, 36.95).Assuming a normal distribution of error values, mean plus two standard deviations should contain 95% of the samples.Therefore, only 5% of the predictions have an error greater than 17.9 (percentage of DNA damage).Future work should focus on researching methods to improve the results for high error samples.

Conclusions
A well-balanced dataset with less bias was obtained for comet assay images.The computational complexity of the dataset is same or low compared to the existing methods.If the computational environment can handle them, an arbitrary number of synthetic images can be generated by this method.Not only was the R 2 value better when synthetic images were used, the variation in R 2 as measured by the standard deviation was smaller.Hence it can be concluded that the resultant model not only performed better and but is also more robust when synthetic images were used.
While the results are promising, further research is needed in quantifying the benefits of this approach.Scaling up the synthetic dataset used for training has the potential reduce the errors further.The impact of the number of synthetic images on the model performance must be studied as an extension.Further experiments are also needed to understand how well the approach generalizes to other datasets and other domains.
This approach of generating synthetic data from basic geometric shapes can be extended to fields where the training data is sparse and the objects in the images can be approximated to simple well-known shapes.This approach can lower the number of real or experimental images needed to obtain high accuracy but does not eliminate the need for real data completely.Real experimental images are needed for the final fine tuning of the model.
One potential application is in object detection.For example, satellite images of urban areas consist of rooftops that can be  ECS Sensors Plus, 2024 3 012401 approximated to polygons.Another example is the assessment of architectural designs, which are made up of several geometric shapes.Polygons, circles, and other geometric shapes can be generated easily in any programming language while taking the dimensions such as radius and side length as inputs.Other domains of potential application are medical imaging (microscopic images), astronomy (stars, galaxies, comets), infographics and computer aided design.On the other hand, there are a couple of limitations to this approach.The first is that geometric shapes may not occur in all domains.In such cases, the use of more complex patterns can be explored.Any mathematically generated shape can be used to generate synthetic images if they resemble a real-world image. 40is an example of using synthetic images that contain mathematically generated 2-D spectra to train a deep learning model.
The second limitation is that, even in the context of comet assays, the synthetic comets generated are only an approximation and do not mimic the original images in all cases.Some comet assay images do not show a well-formed comet.Some of the future approaches can include varying the shape of the head by using an ellipse in place of a circle, adding noise to the images to better mimic real world images, and training the network to identify images that are not suitable for scoring and skip them.Both the limitations discussed here are opportunities for further research.

Figure 2 .
Figure 2. Real world comet assay images taken from more than one dataset.Notice the variation in image quality, orientation and size of images.The images in the top row are taken from the dataset shared by Gladstone Industries.The images in the bottom row are.

Figure 3 .
Figure 3. Comet like images generated by Dall-E.Notice the random variation in shapes.An example starting prompt was "A comet shape in a microscopic image.Very sparse".

Figure 4 .
Figure 4. Visualizing a comet as a combination of a circle and a triangle.

Figure 5 .
Figure 5. Sample images from the synthetic comet assay images from the Python code.Notice the variability of the comet head and tail.

Figure 6 .
Figure 6.Image 1 shows the synthetic image generated.The rest of the Images in this figure show images obtained after application of label preserving transformations.

Figure 7 .
Figure 7. Overlap region of the geometric shapes of circle and triangle in comet generation.

Figure 8 .
Figure 8. Visualizing the Convolutional portion of the VGG16 network using visualkeras package.The yellow layers are convolutional layers, and the red layers are pooling layers.

Figure 9 .
Figure 9. Training workflow for the models used in the research.Model A was trained with only real images, while Model B was trained with a combination of real and synthetic images.

Figure 10 .
Figure 10.Comparing the results of the two models shown in Table I as a scatter plot.

Figure 11 .
Figure 11.The plot on the left shows the actual vs predicted values for Model B and the plot on the right shows the distribution of the magnitude or errors.Error is the difference between the predicted and the actual values.

Table I .
Results of 30 repeated experiments using different selections of training and test data.Comparing the R 2 values of the models trained on exclusively real images and combined image data.The relevant statistics of this data are presented in TableII.

Table II .
Summary of statistics between the two models from data presented in TableI.

Table III .
Error in predictions made by Model B, for 10 of the test images.