An automatic grading system for electronic medical records with neural network

Automatically grading electronic medical records from clinical researchers is an important task on healthcare research. In this paper, we present a convolutional neural network framework to grade medical records by grading note. We regard the scoring process as a text pattern classification task with mapped sentences from grading concept rule to medical records. This framework involves two stages. The first stage is key medical concept matching between grading note and medical records. The second stage is text pattern classification which can predict whether the key concept in grading note is correct, missing or incorrect. The result shows that our neural network model performs better than other traditional machine learning grading methods. Our system makes a great progress for text pattern classification accuracy and performs much better baselines on grading process.


Introduction
Automatically grading electronic medical records (EMR) for clinical researchers has attracted much research attention in healthcare community. EMRs help doctors further understandings of patients' health conditions, identifying their best-matching patient category and determining patients' customized treatment plan [1][2]. Unfortunately, before writing an excellent EMR, it takes a lot of time to train a recorder. The most important part in this progress is grading EMRs written by recorders.
Each EMR is collected by the conversion between doctor and patient. It usually contains various types of information, including patient personal information, past medical history, present illness and symptom etc. A good EMR should include all the important information known as key medical concepts. The grading task should judge if a recorder captured the key medical concepts. As shown in figure 1, medical concept is called grading rubric. If a grading rubric can be found correctly in the corresponding EMR, it will be labelled as C. Similarly, I will represent incorrect and M will represent missing.
A quick approach called word-count based model first scans all the words in grading rubric and EMR and scoring according to the proportion of words common to both texts. However, without considering semantic information will reduce the accuracy of scoring. Thus, the semantic distance representation between two texts has been developed. The two most common ways for representing texts is via a bag of words (BOW) [3] or by their term frequency inverse document frequency (TF-IDF) [4]. Furthermore, the Word Mover's Distance [5] (WMD) is developed to improve the above approaches. These methods can reflect the EMRs quality in a certain degree but is interfered with a large amount of irrelevant information in EMRs. Also, they cannot map the grading rubric exactly to 2 the corresponding sentence in EMRs. In this paper, we build an automatically grading system for EMR. It first matches grading rubric to sentence in EMR with closest sematic relatedness. Then, a convolutional neural network (CNN) is established for classifying new concatenated text pattern. We score the EMR by evaluating number of C labels as a percentage of the total number of labels in grading note.

General framework of grading system
As shown in figure 2, automatically grading system (auto-grader) considers the grading process as a two-stage pipeline framework. The first stage is called grading rubric matching. We split each grading note into a list of grading rubrics and sentence level vectors. Meanwhile, we get sentence vectors for EMR by same method. Then, we map the sentence embedding in grading rubric to EMR by calculating relatedness score of two pairs of sentences. The second stage is regarded as text pattern classification. We present a slight variant of the CNN architecture to predict whether the key concept is correct, missing or incorrect. The score is calculated based on the prediction results.

Grading rubric matching 2.2.1. Sentence embedding
The most important part of grading rubric matching is calculating sentence embedding through word vector. The supervised learning methods usually train a neural work and use the last hidden layer as sentence embedding [6]. However, these methods require additional annotated data. Here, we get sentence embedding v s by weighted averaging word embedding v w which is similar to Sanjeev's work [7].

Semantic relatedness sentences extraction and mapping
In order to judge whether grading rubric is correct, incorrect or missing, we need to find out if there's a sentence in EMR which contains exact same meaning of grading rubric. Sentence similarity is a 3 basic way to compare the semantic relatedness of a pair of sentences. Here, we use the Cosine Similarity to measure the relatedness score of the grading rubric in grading rubric to corresponding sentence in EMR. Here v g = [x 1 ,x 2 ,x 3 ,⋯,x n ] and v r = [y 1 ,y 2 ,y 3 ,⋯,y n ]. The cosine similarity is where v g stands for the embedding of grading rubric and v r stands for sentence embeddings in EMR.
For each v g , we select sentence embedding v r in EMR with the highest score and map them together.

Grading rubric labelling and scoring 2.3.1. Convolutional neural network for label prediction
The model architecture is similar to the CNN architecture of Yoon Kim's work [8]. Let x i ∈ R k be the k-dimensional word vector corresponding to the i th word in grading rubric and y i ∈ R k be the kdimensional word vector corresponding to the i th word in the most related sentence in EMR. Let the max of sentence length be n (padded when sentence length is less than n) and the concatenated text pattern can be represented as equation (2), (2) where ⊕ is the concatenation operator. In this paper, let T i꩘j refer to the concatenation of words t i ,t i 1 ,⋯,t i j . The input layer becomes a matrix of word vectors corresponding to the words in the text pattern. There are 2n words in the text pattern, so the matrix is 2n × k.
This type of matrix can be static or non-static according to [9][10]. We use pretrained embedding to make the matrix static. For the convolutional layer, a filter w ∈ R hk is applied to a window of h to produce a new feature during convolution operation. For example, a feature α i is generated from a window of words t i꩘i h−1 by = th ꩘ −1 th (3) Here b ∈ R is a bias variable and fh h is a rectified linear unit (ReLU). This filter is applied to each possible window of words in the text pattern {t 1꩘h ,t 2꩘h 1 ,⋯,t 2n−h 1꩘2n } to produce a feature map α = [α 1 , α 2 ,⋯,α i ,⋯,α 2n−h 1 ] and α ∈ R 2n−h 1 .
We then apply a pooling layer. A max-over-time pooling method [8] is used to extract the maximum value α = max { α} from the previous one-dimensional feature map α as the feature. Just like extracting one feature from one filter, this model can also use multiple filters with different window sizes to get multiple features. The pooling layer outputs the one-dimensional features and fully connects softmax layer. Similar to traditional neural networks, the output layer is defined as = h h t (4) Here z = [α 1 ,α 2 ,⋯,α m ] and m represent that we have m filters. The softmax layer reflects the probability distribution on the final classification with n labels and can be set with the need of the task. We employ dropout to prevent neural networks from overfitting and is the element-wise multiplication operator and r is a vector of m Bernoulli random variables which have certain probability to be 0 or 1.

Grading electronic medical records
The percentage of concepts need to be captured by doctor and patient determines the score of each EMR. For each generated text pattern, we calculate the number of each label and get the percentage of label C as the final score of EMRs. As equation (8) shows, where Score is the final score of EMRs and num i (i = C,I,M) represent the number of each label of grading rubric.

Dataset details
The EMRs are collected from our partner hospital and the data only permits noncommercial or academical use. The grading rubric and its label was done by several doctors. The data contain two parts. The first is EMRs which was written by recorders. These EMRs are based on conversation between clinicians and patients. They contain many abbreviated messages and incomplete expressions from oral information. The second part is grading notes which are written by doctor. A grading note contains grading rubrics which a recorder must record when he is listening conversation between clinicians and patients. We manually go through every grading rubric and label them by { C, I, M } for correct, incorrect or missing.
There are 1429 grading note and EMR pairs. Each grading note contains dozens to hundreds of rubrics. Firstly, we match every rubric with their most related sentence in EMRs. This process generates concatenated text pattern of two sentences with labels. We then use machine learning approaches to predict the labels of concatenated text pattern. Here, about 80000 concatenated text patterns will be the train data in the next stage.

Baseline model
To establish a baseline performance, we use weighted averaging the word vectors to get sentence embedding. We then concatenate both the sentence embedding pairs in grading rubric and EMR by calculating their cosine similarity. The label prediction problem can be regard as a vector classification problem. We train a nearest neighbor classifier (KNN), a logistic regression classifier (LR), a decision tree (DT) and a multi-layer neural network.

Hyperparameters and training
We initialize word embedding by using publicly available data which were trained on 100 billion words from Google News. The word embeddings are 300-dimensional word vectors trained by continuous bag-of-words architecture [11]. Words which are not present in the set of pre-trained embeddings will be initialized randomly. The class labels can be divided into two category that is 0 for C and 1 for M and I.For parameter setting, we refer to Yoon Kim's work [10] which use filter windows (h) of 3, 4, 5 with 128 feature maps each. We set the dropout rate of 0.5 and default batch size of 64. We select 80% of the text pattern as the train set and select test data set from the rest of 20%. Training is done through Adam. Cross entropy loss measures the performance of our CNN model. We calculate a separate loss for each class label per text pattern and sum the result.

Result and discussion
We compare our method with widely used text pattern classification methods. The experimental results in table 1. show that our auto-grader with CNN outperform the traditional methods on predicting three labels in the EMRs grading task. It proves that CNN-based approach can effective compose the semantic representation of texts.
where real score represent real score and pre score represent predicted score by our auto-grader. The root-mean-square error is used to measure the bias of grading result.
Here N is the number of EMRs need to be graded.

Conclusion
In the present work we have described an automatic grading system for EMRs with convolutional neural networks. Our method can not only grade the EMRs through key medical concept but can also match the key concept with the corresponding sentence if it exists in EMR. We evaluated our work with other models such as word-count based model, linear regression, k-neighbors classifier and so on. The result shows that our automatically grading system performs better than machine learning baselines. We believe our CNN-based auto-grader will bring more inspiration to real word machine learning applications.