Analysis of TCM Diagnosis and Treatment of Thyroid Diseases Based on Data Mining

Thyroid disease is the most common and frequently occurring endocrine disease. In recent years, the incidence rate of thyroid diseases has increased in recent years due to the gradual change of the ecological environment and the gradual popularization of iodine supplements for iodine deficiency disorders. At present, the prevalence of hyperthyroidism is 1.3%, hypothyroidism is 6.5%, thyroid nodules are 18.6%, and 2%-5% of thyroid nodules are thyroid cancer. Thyroid disease has become a high incidence and common disease in modern society. The number of thyroid disease patients in China has become the second largest disease after hypertension. Thyroid disease belongs to gall disease of traditional Chinese medicine. Compared with western medicine, traditional Chinese medicine has certain advantages in the treatment of gall disease. Compared with western medicine, traditional Chinese medicine can better improve clinical symptoms and reduce toxicity. In this paper, Stata and python were used to analyze and clean the data of TCM diagnosis thyroid medical records. Principal component analysis and factor analysis were used to analyze and clean the symptoms and pathogenesis of the medical record data. The data in the prescription were quantified. The feature selection of the data set was performed by principal component analysis to reduce the data dimension. In the process of treatment, different kinds of prescriptions will be used. Association analysis algorithm in data mining is used to mine the association rules of prescription medication and complications of thyroid disease, which provides a reference for clinical treatment drug selection and disease prevention of traditional Chinese medicine. Logistic regression was used to establish the prediction model of traditional Chinese medicine diagnosis prescription of thyroid by using the clinical data of traditional Chinese medicine.


Introduction
Thyroid diseases are common diseases in Department of Endocrinology. The incidence rate is increasing year by year, mainly including hyperthyroidism, hypothyroidism, thyroiditis, thyroid nodules, thyroid tumors, etc. According to the epidemiological data of thyroid diseases among urban residents in China, the prevalence of hypothyroidism and thyroid nodules is 17.8% and 18.6% respectively. Hyperthyroidism accounted for 1.6% [1][2][3] , and there were more than 200 million patients with thyroid dysfunction in China, and the prevalence rate showed a rapid upward trend. Modern medicine mainly uses anti-thyroid drugs, hormone replacement, radiation therapy or surgery in the treatment of thyroid diseases. Anti-thyroid drugs or hormones can quickly control the thyroid function index in the normal range, but it is easy to relapse after withdrawal of drugs, and the side effects are large, so the clinical symptoms of thyroid disease are not improved significantly. According to traditional Chinese medicine, thyroid disease belongs to the category of "gall disease". The scope of "gall disease" is very wide. It is caused by the accumulation of qi stagnation, phlegm obstruction and blood stasis in the anterior laryngeal node of the neck. The typical manifestation is gall swelling or mass formation in front of the neck [4][5][6] . In recent years, Chinese scholars have reported a large number of research results on the treatment of thyroid diseases with traditional Chinese medicine [5] , which confirmed that traditional Chinese medicine has advantages in regulating human endocrine, regulating body immunity, eliminating antibodies and stabilizing pituitary gland. Among them, the curative effect of famous traditional Chinese medicine in the diagnosis and treatment of thyroid diseases is particularly significant. In this paper, TCM diagnosis and treatment of thyroid medical record data cleaning and mining, using principal component analysis, factor analysis of medical record data in the symptoms, pathogenesis analysis, establish disease prediction model, guide clinical medication.

Research Object
The data of 998 medical records collected from Guoyitang of Nanjing University of traditional Chinese medicine include age, gender, diagnosis of traditional Chinese medicine, classification of Western medicine, diagnosis of Western medicine, clinical manifestations, tongue coating, pulse condition, physical and chemical examination, pathogenesis, rule of law, prescription, remarks and diagnosis time.

Research Methods
The medical records were imported into pychar, and the frequency of drug use and frequency of drug use in 998 patients who met the inclusion criteria were counted with Python 3.7, and then the data were imported into stata16 for further analysis. According to the frequency of use, the top 15 herbs were selected. Principal component analysis, correlation analysis and factor analysis were used to analyze and clean the symptoms and pathogenesis in the medical record data. The data in the prescription were quantified. The feature selection of the data set was performed by principal component analysis to reduce the data dimension [6][7][8] .

Single Drug Analysis
Rhizoma Pinelliae and Prunella vulgaris are the most commonly used drugs in clinical practice, which are 748 times, accounting for about 75% of the total cases. Prunella vulgaris can clear liver fire, level liver Yang, soothe liver depression and disperse phlegm and nuclear. Combined use of the two drugs, a drop of a powder, cold and warm, clearing heat and phlegm, treatment of liver qi stagnation, long and fire, phlegm and heat mutual knot of insomnia, wipe phlegm core, thin tumor. The drugs with more frequent occurrence are shown in Table 1.

Drug Association Analysis
Combined with machine learning method, APRIORI association algorithm and data cleaning technology in Python, the original data in traditional Chinese medicine prescriptions were standardized, and the drug compatibility relationship was mined by association rule model. The compatibility relationship is as follows:

Principal Component Analysis
There is a certain correlation between the variables. When there is a certain correlation between the two variables, it can be explained that the two variables reflect the information of the subject has a certain overlap. Principal component analysis is to delete redundant repetitive variables (closely related variables) and establish as few new variables as possible, so that these new variables are unrelated, and these new variables keep the original information as far as possible in reflecting the information of the subject.
Principal component analysis (PCA) [9][10][11] , or principal component analysis (PCA), is a statistical method that tries to combine the original variables into a group of new independent comprehensive variables, at the same time, according to the actual needs; a few less comprehensive variables can be taken out to reflect the information of the original variables as much as possible. In this modeling process, many kinds of symptoms were found, so the dimension of 164 kinds of symptoms was reduced by principal component analysis, so as to better analyze in the future.
Where eigenvalue is the eigenvalue, proportion is the contribution rate, and cumulative is the cumulative contribution rate. The results are shown in Table 2. The prediction results are shown in Figure 1.

Logistic Regression Analysis
Logistic regression is a generalized linear model [12] so it has many similarities with multiple linear + , in which w and B are parameters to be determined, and the difference lies in their different dependent variables. Multiple linear regression directly takes + as dependent variable, i.e., y = + , while logistic regression uses function L to correspond + to a hidden state p, p=L( + ), and then determines the value of dependent variable according to the size of P and 1-p. If l is a logistic function, it is a logistic regression; if l is a polynomial function, it is a polynomial regression. Logistic regression was performed for the 16 most frequently used drugs [13] . The results are shown in Table 3. According to the standard that the probability of drug use is at least 0.5, 275 times are accurate, that is, the drug was used. In the other 19 prescriptions, the prediction probability of the model was less than 0.5, and there was no medication. Therefore, the total classification accuracy is 275+19=294 divided by 343, which is 85.71%. The table also provides some conditional probabilities, such as sensitivity, or, in the case of medication, the percentage of cases with a prediction probability greater than or equal to 0.5 (275 out of 280, or 98.21%).

Result Discussion
Prunella vulgaris is bitter and pungent in taste and cold in nature. Its function is to clear the liver, relieve fire and disperse swelling. It is one of the important medicines for the treatment of gall disease in ancient times. "Ben Jing" refers to "dispersing gall and forming Qi", while "materia medica Congxin" states that it "governs scrofula, rat fistula and gall tumor". Pinellia is pungent and slightly warm in nature, and its function is to remove dampness and dissipate phlegm. "BIE Lu" says that "eliminate carbuncle and swelling", and "drug record" says "can remove gall". Clinical good use of Pinellia ternata, take its longer than dry and wet and lower temperature, avoid consumption of Qi and Yin. In the treatment of hyperthyroidism, Prunella vulgaris is good at clearing away heat and purging fire, and Pinellia ternata is good at resolving phlegm and dispersing nodules. The two have the functions of clearing heat, resolving phlegm and dispersing nodules, which coincides with the pathogenesis of hyperthyroidism based on Yin deficiency and hyperactivity of fire, qi stagnation, phlegm and blood stasis, so it has good clinical effect. In addition, the cold and cool of Prunella vulgaris can help to control the temperature and dryness of Pinellia ternata to avoid damaging Qi and Yin; the pungent powder of Pinellia vulgaris can also avoid the stagnation of Qi and blood caused by cold and cool of Prunella vulgaris.
The model of this study mainly aims at the prediction of whether the drugs used frequently in the treatment of thyroid diseases are used, and gives some answers to the question of whether the corresponding drugs are related, which can provide certain reference and assistance for doctors in the process of intelligent diagnosis and treatment of thyroid diseases.