A Deep Learning Approach to Classification Pneumonia in Thorax Images

An algorithm of automatic learning was developed. That is abke to identify radiographic images with a pneumonia and not pneumonia diagnose based on a data set published in the “Kaggle” platform by the Radiological Society of North America, from which we obtained a set of specific images with their labels that were divided by 70% to train a convolutional neuronal network model consisting of two convolutional layers for the extraction of characteristics in each image, and ends with a classification stage in the training of the model, to conclude with 74% in metrics of accuracy given by test tests with 30% of the data set.


Introduction
Pneumonia is an acute infection of the lung parenchyma which can cause from inflammation to accumulation of fluid in the lungs, its etiology may be due to infection by viruses, bacteria or fungi.
It affects non-hospitalized and hospitalized patients, it is characterized by a series of clinical manifestations such as the appearance of fever and/or respiratory symptoms (cough, expectoration) among others, together with the presence of changes in the chest X-ray, it represents more than 15% of all deaths of children under 5 years of age globally [1].
In Colombia, according to Vital Statistics (DANE), in 2016 there were 7,570 deaths, whose cause of death (ICD-10) was Pneumonia, unspecified organism and 24 due to Bacterial Pneumonia, not elsewhere classified [2].According to the Hospital Clínico Universitario de Santiago de Compostela (Spain) in the journal Neumo Experts in Prevention, it is estimated that each year, 150 million children develop the disease and 11 million children are hospitalized due to pneumonia, and almost all of them live in developing countries, such as Colombia [3][4].As part of solutions to this type of problems on a global scale, technology has made a significant contribution to combat diseases where Artificial Intelligence (AI) is the protagonist because it proposes the combination of algorithms with the purpose of creating 1299 (2024) 012002 IOP Publishing doi:10.1088/1757-899X/1299/1/012002 2 machines that be of help and present the same capacities of the human being.Now, it is well known in the Health sector that Radiological Imaging (CXR) is used for the diagnosis of Pneumonia.However, currently projects focused on the study and/or prediction and data processing systems can help in the diagnosis.of Pneumonia, thus, advanced technology proposes the use of Neural Networks that belong to the discipline of Artificial Intelligence (Artificial Neural Networks (RNA), inspired by biology.The name given to the type of Neural Networks that are Processing images are Convolutional Neural Networks (Convolutional Neural Network "CNN") which are necessary to generate a prediction model to help provide solutions to a problem in charge in this case of the health area, specifically pneumonia.
Convolutional Neural Networks is implemented in this research project to obtain significant features on a set of CXR images of patients with and without pneumonia, which in turn were used to train a mathematical model with the ability to generalize predictions from the different pneumonia images.This type of training in the field of artificial intelligence is recognized as supervised learning, where the training algorithm is provided with a set of data and its corresponding labels (Labels) that denote what certain particular data corresponds to.It is about obtaining greater precision in the diagnosis and in the results so as to allow timely intervention and invasion in the body, improve the quality of care, provide greater comfort to patients and provide greater access to technology services in Medicine.
It is clear that Artificial Intelligence is not going to replace doctors, but it is of great help when making decisions in the diagnosis of pneumonia.It also helps to reinforce the knowledge of general practitioners who do not have enough experience as radiology specialists when interpreting a chest X-ray.
Although there are enormous challenges, the benefits of implementing technological advances in the medical sector are monumental.The challenge for the future in this sense is to achieve the full incorporation of technology in clinics, hospitals and laboratories that today are oblivious to it due to the lack of economic resources and with which a significant number of lives would be saved daily, by just as the care and wellbeing of all citizens would be improved.

Motivation
Health is one of the most deficient sectors in Colombia that, in the opinion of many analysts, "is in crisis" [5] and in this context, technology can provide certain benefits that would improve the quality of service of Healthcare Institutions.Pneumonia accounts for more than 15% of all deaths of children under 5 years of age internationally, in 2015, 920,000 children under 5 years of age died from this disease [6].Starting from the necessity, Artificial Intelligence (AI) is becoming more common in our lives and becomes a more sophisticated tool that performs the same tasks that humans do, but more efficiently, faster and at a lower costThis is a one of the reasons why including AI in the world of medicine becomes more and more popular in the world [7].Based on this need, the application of Artificial Intelligence (AI) techniques is becoming more and more common in the health area and is becoming a more sophisticated tool that supports clinical diagnosis more quickly and at a lower cost [8].This is how the development of an algorithm, supported by Artificial Intelligence, specifically in the use of Neural Networks that allow optimizing the diagnostic processes of Pneumonia, for which CNN Convolutional Neural Networks can help with greater precision through image recognition, the most benefited from this research would be from the doctors who want to carry out research based on it and in turn provide better service and care to patients.

Material and Methods
For the development of this work we used the "Team Data Science Process" (TDSP) methodology focused on the life cycle of data science projects.
A significant sample of radiological images of 10,000 images between pneumonia patients and healthy patients was taken for the training of our convolutional neural network, these data were taken from the National Institutes of Health of the United States (National Institute of Health of North America).
The selection criteria was given by the algorithm, the first images of the total set of images, where were mixed those that corresponded to pneumonia and those that did not correspond to pneumonia.The sample taken for the project is 630 training images, 270 images for test and 270 images for final evaluation.

Dataset
This section describes the construction of the data set, detailing step by step how the data set is constructed.In the following flowchart (see Figure 1) the tasks for data acquisition are presented.

Figure 1Dataset Construction
DICOM (Digital Imaging and Communication in Medicine) images are taken as input data, this DICOM data format contains the information in data sets, that means that a file of a chest X-ray image actually groups the identification of the patient within the file, so that the image can never be separated from this information.A DICOM data object is constructed from a series of attributes, which correspond to elements such as name, ID, etc. Also a special attribute containing the pixel data of the image, this information provides the algorithm with the necessary data for training and test data set creation.The original name of the DICOM images was previously changed manually to an ascending numerical order, starting from 1 to 10,000, to provide an orderly and sequential reading of each image in the DICOM image data extraction process.For the construction of the dataset, the "create_dataset" algorithm was used, in charge of obtaining the data of interest and creating the persistence of these data.In this way it was possible to obtain a given result in organized data, taking into account that the data of interest of each DICOM image is a matrix of pixels/8-bits with values from 0 to 255, which represents an image of size 1024 wide and 1024 high as can be seen in the illustration 7 that in each iteration the images were transformed to a size of 256 high and 256 wide, these images have been read from the local disk and stored in memory by iterations to give it a shape and order (size of the image given an array x number of images) until finally have them serialized in a file of type h5py as shown in Figure 1 .You will be able to understand more specific details in the annex "algorithm for the extraction of dependent and independent variables".
At the same time that the extraction of the image data corresponding to the independent variables was performed, the process of extracting the dependent variables from the "labels_unicos.csv"file, which contains all the diagnoses corresponding to each DICOM image in binary form, was also performed.Each piece of data in the file is in turn associated to each image by means of an identifier (Patientid).These values indicate whether the image has a pneumonia diagnosis or not, which allows the automatic supervised learning algorithm to be built, especially when training and evaluating the CNN (Convolutional Neural Network).

Modeling CNN
On the other hand, Figure 2 shows the modeling of the convolutional neural network and ends with the evaluations of the constructed model.The function "load_dataset" it was possible to load in memory the files train.hdf5(serialized images) and "data_set_Y.hdf5"(serialized labels), taking into account that 900 images were used due to lack of hardware and 270 images were reserved for a final evaluation of the model.In order to give continuity to the stages: 1. Normalization. 2. Conversion of labels to "one hot encode" format.

Split
The normalization stage received the matrix given in columns x rows (900 x 65536) to apply a division between the maximum value of a pixel (255), so that the values stored in the array were light at the time of training and adjustments of the convolutional neural network.
In the conversion of labels to "one hot encode" format, the classes were defined, which for this data set were 2 (non-pneumonia and pneumonia), to obtain a row x column matrix (900 x 2), in which each column represents the class of the image and each row belongs to the result of a specific image, its value in binary was allowing to determine to which category the image belongs in a position as shown in Figure 3 (where [10] = non-pneumonia and [01] = pneumonia).

Figure 3 One hot encode label
The next step is the division of the data.Figure 4 shows the Split technique, which operates with a division of the total data loaded in memory in a distribution of 70% for training data and 30% for test data, the objective of having these data divided is relevant at the moment of evaluating the training with the test data and if the model generalizes with unknown data.The feature extraction process starts with the application of convolution filters and the application of stride and padding parameters.The filters allow obtaining information from the image in numerical terms, in the form of a feature map and at the end of the convolution process the flattening is performed in the form of a one-dimensional vector (See Figure 6).The results obtained from the model are tested with different parameters and the one with the best performance is selected.Figure 8 shows the record of different parameters (learning rate, batch_size, epochs and steps) that came closest to a significant result, with the report of metrics that were given for each evaluation with the test data.Learning rate (learning rate), Batch size (group of images per epoch), Epoch (iteration in which training and error minimization is performed), Steps (steps taken in each epoch).

Figure 8 Test and validation Results
In a given evaluation with the test data there were a total of 270 images which determined the results in a confusion matrix, and 270 normalized images to perform another evaluation test against the results of the test.Figure 9 shows the classification report in terms of precision, accuracy, sensitivity and f1, in comparison with the other results it can be said that these are the highest with the selected parameters.
10 For the evaluation we applied the same parameters that were applied in the test which approached the best possible result with this set of images.The best fit parameters resulting in this confusion matrix (See Figure 10) are: Learning rate = 0.0001, Bacht size = 32, Steps = 12, Epochs = 150.

Conclusions
As a result, the construction of a Machine Learning algorithm was obtained based on an evaluation metric represented in terms of percentage with an accuracy of more than 73% and an error rate of less than 21%.During the course of the research, some limitations arose that were minimized so as not to affect the results so much.For example, the deficiency of computational equipment (hardware) which gave as an observation that it is advisable to acquire computer equipment with RAM memory capacity greater than 56gb, 500gb of storage and any processor of a DSVM (Data Sciencie Virtual Machine) of azure with capacity to work in parallel, given recommendation with the objective to carry out this project to a higher step.If this project scales to a hardware level, it is recommended to analyze the data balancing to decide if it is necessary to implement a data balancing with the "smote" technique of the "imblearn.over_sampling"library.

5 Figure 2
Figure 2 Model construction

Figure 4 Figure 5
Figure 4 Split dataAs a result of the process of dividing 900 "x" data, 630 records are obtained for training and 270 for testing.For the creation of the CNN (Definition) architecture (See figure5), a process divided into two sub-processes (feature extraction and classification) was implemented.The method for the construction of a pneumonia classification algorithm consists of two major processes, starting with feature extraction implemented with the CNN technique, which consists of two convolution layers.The second step is a neural network for classification. 8

Figure 6
Figure 6 Feature extraction

Figure 7
Figure 7 Train process

Figure 9
Figure 9 Ranking report with test data

Figure 10
Figure 10 Confusion matrix with evaluation data