Prediction of Patients’ Length of Stay at Hospital During COVID-19 Pandemic

Machine learning has been extensively used in diverse healthcare settings since the 21st century. Statistical models are proven to be powerful in detecting early disease symptoms and could potentially aid decision-making in the healthcare system. To help improve medical resource allocation during COVID-19 pandemic, we aim to develop machine learning models that predict each patient’s length of stay (LOS) in hospital. Three machine learning models, namely, K-nearest Neighbors Algorithm, Logistic Regression and Random Forest are implemented and optimized on the same healthcare dataset. The final accuracy of each model is 0.3442, 0.3524 and 0.3541 respectively, which are not very high. Our subsequent correlation analysis on the healthcare dataset shows the patients’ features used do not provide sufficient information for accurate LOS prediction. Yet, machine learning approaches could potentially yield much better results if the data quality can be improved by including additional relevant patient features and breaking LOS into more appropriate intervals. More detailed healthcare data should be obtained to make the LOS prediction useful for healthcare management.


Introduction
The world has been exposed to the COVID-19 epidemic since December 2019 [1]. The coronavirus disease is found to be highly contagious, causing serious lung infection and potentially respiratory failure [2]. The sudden outbreak of COVID-19 has tremendously overloaded the hospital system and there is a huge excess in demand for hospital beds [3]. Till now, the virus is still spreading rapidly, and better management of healthcare and treatments are required to deal with the continuing crisis [4]. Many work has been done to improve the management of healthcare previously, including some ways to reduce Health Disparities for Priority Populations [5], the Quality Improvement Initiative for Nursing Facilities [6]. Although there are many cases of us-ing data technology in the health system [7], the length of stay(LOS) in hospital is an important indicator to mon-itor and predict the health management process. Thus, aiming to promote efficient medical resource allocation during COVID-19 pandemic, we propose using machine learning approach to predict patients' LOS according to relevant patient features.
Machine learning models are proven to be helpful in improving many aspects of the healthcare system [8]. Tanuja et al. used multi-layer network(MLP), Naive Bayes, K-NN and decision tree to predict patients' LOS at admission. Their results showed that MLP and Naives Bayes had the highest classification accu-racy of around 85%, and K-NN performed poorly on the dataset with only 63.6% accuracy [9]. Several other studies performed LOS regression task, using support vector machine [10], classification and Regression Tree [11], and Poisson Regression [12].
In our project, we employed K-NN, SVM and Random Forest to classify COVID-19 patients' LOS in hospitals. In the dataset of AV: Healthcare Analytics II [13], the stay length is separated into 11 classes. The first 10 classes correspond to 0-10 days, 11-20 days...91-100 days respectively, and the last class is more than 100 days. For each model, we compared different data preprocessing techniques, such as encoding methods and principle component analysis(PCA). After model optimization and hyperparameter tun-ing, the final results showed that the three models had very similar classification accuracy of 0.3442, 0.3524 and 0.3541 respectively. We then further performed correlation analysis on the dataset and found that the patients' features used were weakly correlated with their LOS label. If the data quality can be improved by including more relevant patient features and using reasonable LOS intervals, the classification accuracy of our models is likely to be greatly boosted to enforce better medical resource management.

KNN
KNN (K-nearest neighbor) was used to predict the length of stay in hospital of each person according to their different features. To make the dataset be well trained by KNN, some preprocessing methods were used. As the dataset contains categorical data, which cannot be recognized by KNN, one-hot encoding and simple encoding were considered to be used. After normalizing the data by min-max normalization: = [14], the function of the different encoding was compared with the same set of hyperparameters. Since the accuracy of simple encoding was higher, it was chosen finally.
As the dataset, which contained more than 400,000 samples, was too large to be run out with KNN, 50,000 samples were chosen randomly and divided into the training set and test set, and the percentage of them were 90% and 10% respectively. There were three hyperparameters for this model: distance measure (Manhattan distance: (| − | + | − |) [15] and Euclidean Metric: [16] , different ways of processing depending on correlation and K(number of neighbors). Different combinations of hyperparameters were implemented to obtain the best model, which had the highest accuracy.

Support Vector Machine
SVM was used to analyze the dataset. The data contains two types of data, numerical and categorical. Multicollinearity, however, occurs when two or more independent variables in the dataset are correlated with each other, so to avoid it, I drop the first column after one-hot encoding [17]. As a result, the dimensions of the train and test are 116. According to the principal component analysis, we choose 68 features to analyze.
Since the original training set included more than 100,000 samples, according to the official document, first I used linear SVM instead of kernel SVM. Secondly, I still want to use the kernel function, since the dataset is nonlinear. However, the original training set samples were too large to analyze. I changed the sample size for the training set and test set. There are 40000 samples in the new training set and 10000 samples in the new test set. I can't do a good job of analyzing 68 features based on 50,000 data. So, I changed to use simple label encoding, and then I found 11 features to consider. There are three different kernel functions: Gaussian radial basis function, Polynomial function and Sigmoid function [18]. And when we use different kernel functions, we need to find the optimal value of C and gamma for each function. Where parameter C controls how much you want to punish your model for each misclassified point for a given curve, and gamma defines how far the influence of a single training example reaches.

Random Forest Classifier
Random Forest Classifier is an ensemble learning method used to classify samples from different classes [19]. In the model, a number of decision trees are constructed, and the majority class predicted by these decision trees is considered to be the classification result. In each tree node, a random subset of features is selected and among which, the best predictor is chosen according to a loss function, such as Gini impurity.
To implement this model, the healthcare dataset is first divided into training set (80%), validation set (10%), and test set (10%). During data preprocessing, different encoding methods, either one-hot encoding or simple encoding, were used to encode the categorical features in each sample. Principle component analysis (PCA) was then applied to reduce the dimensionality of the feature space and get rid of irrelevant features. In PCA, only the top n components were selected to explain 85% variance in the feature space. Therefore, the four preprocessing regimes are one-hot encoding, one-hot encoding with PCA, simple encoding, simple encoding with PCA. These four preprocessing methods were evaluated based on their corresponding random forest classification accuracy, and the one with highest classification accuracy on the validation set was chosen to be the final preprocessing regime.
Four hyperparameters in the random forest classifier are then tuned using the training set and validation set. The number of estimators is the number of decision trees in the random forest, where each tree gives a classification result and votes for the most likely class. The maximum depth of each tree is another parameter that controls the model complexity and bias-variance tradeoff. Two rest two parameters are the minimum number of samples at each leaf node and the minimum samples required to split a node. For each of the parameters, a wide range of values were experimented, and the best values were selected based on model accuracy. management.  Because the accuracy of Euclidean Metric was higher than the Manhattan distance obviously, it was assumed that the models which used Euclidean Metric were higher than those used Manhattan Distance.

Support Vector Machine
Firstly, I used Linear SVM. As the dataset has 11 classes, we use one-versus-one method. Finally, the accuracy is 0.35003. Then I tried to use different kernel functions. For Gaussian radial basis function (rbf):  Gaussian radial basis function is the best, according to Figure 6, since the accuracy is largest. And then we need to check whether it is over-fitting. According to Figure 7, the model does not overfit because as the sample size of data increases, the performance of the model doesn't change. Overall, the best model is Gaussian radial basis function with c=1 and gamma=scale. And the final accuracy is 0.3525.

Random Forest Classifier
The classification accuracies of random forest classifier using different preprocessing techniques are very similar. As shown in Figure 8, while the classification accuracy for each preprocessing method increased with the number of decision trees, there was less than 2% accuracy difference across encoding methods. Simple encoder with PCA was chosen to be the preprocessing regime because it yields the highest classification accuracy even when the number of estimators is low.

Figure 8. Comparison of different preprocessing regimes
During hyperparameter tuning, the model is found to be insensitive to different parameter values. Figure. 9 shows that the amount of variation in classification accuracy was lower than 3% for all four hyperparameters tuned. In the final model, number of estimators = 100, max depth = 10, min sample split = 20, and minimum samples leaf = 20. The test accuracy is 0.354 and the train accuracy was 0.356.
Therefore, there is no overfitting observed.

Discussion
According to Table 1, we find that the accuracy of the three models is about 0.35.
When we tuned the hyperparameters, we found that the adjustment of the hyperparameters could not improve the accuracy very well. At best we can only improve accuracy by four percent. Also, we tried different ways to encode, one-hot encoding and simple label encoding, there was no significant change in accuracy. We think that the accuracy of this dataset can only reach about 0.35, and we cannot significantly improve the accuracy. We think this is due to some features of the dataset.According to Figure 10, we can find only one factor that has a strong correlation with stay and other factors have very little relevance to it. Also, we know this dataset is unbalanced, based on the table 2. Class 0 has a huge amount of data, while class 6 and class 9 contain very few samples. From what has been discussed above, we believe that the factors contained in this dataset are not the main factors. And the classification of stay is not reasonable, we should not divide stay into 11 classes.

Conclusion
In this project, machine learning is utilized to solve real-world medical resource allocation problem during COVID-19 pandemic. By predicting each patient's estimated stay at hospital as a small 10day interval, doctors and governors could make plans about medical equipment and human resources accordingly. However, the highest classification accuracy of 35.41% given by our models is a bit too low to be useful as a diagnosis tool in practice. We propose several ways to optimize the prediction model in the future. Firstly, since the classification classes are highly imbalanced, with most of the patients spending less than 40 days in hospital, the classification labels could be reformatted into fewer classes concentrating on patients' stays less than 40 days. Secondly, the feature space in the raw dataset could be further expanded to include more relevant patient features such as body temperature. Extra features could potentially be better predictors for patients' length of stay, since Coronavirus infection symptoms are still understudied due to the novelty of the disease.