The prevalence of Helicobacter pylori in referral population of Turkey

Helicobacter pylori infection is commonly associated with gastroduodenal diseases in humans, such as chronic gastritis and peptic ulcers, gastric mucosa-associated lymphoid tissue lymphoma, and even gastric cancer, which leads to high cost to society for treatment and even to death many people, when people do not know early of the infection prevalence. In this work we proposed a forecasting model to predict the infection prevalence. Based on our results society can make simple early prevention acts against the infection. The early prevention acts decrease the cost of treatment and save many people’s lives in the world.


1.
Introduction H. pylori causes persistent infection which develops inflammation and leads to peptic ulcer, chronic gastritis, gastric mucosa and gastric cancer in the human stomach [1], [2], [3]. According to GLOBOCAN 2018 data, stomach cancer is the 3rd most deadly cancer which estimated 783,000 deaths in 2018 [4]. Moreover, the infection is highly prevalent in approximately 50% of the population in the world [1]. It causes high cost to society and even brings on high risk of humans' lives.
To prevent prevalence of the infection, it is crucial to know early about the infection prevalence using forecasting models. Literature consists of some interesting research work related to forecasting models for infections such as Malaria, Scarlet fever, Chickenpox [5] combining Big Data and Neural Network. Moreover, there are some articles about predictions based on environmental factors which have a great impact on the prevalence of the infections. For instance, [6] built a time series model based on eight climate variables to predict hand, foot and mouth disease. In addition, [1] showed that average daily sunshine time correlated positively with H. pylori infection [1]. Based on the previous studies we conclude that climate variables can be used to obtain more accurate and efficient prediction of infections prevalence.
However, there is no prediction model created for H. pylori infection prevalence in the existing literature which is considered with climate variables yet. Therefore, the main purpose of this study is to design a forecasting model for H. pylori (Hp) infection prevalence based on Humidity (H), Dew Point (DP), Temperature (T), Wind Speed (W) and Pressure (P). The forecasting model is used with multivariate linear regression model (MVLR).
We have derived a prediction of the H. pylori infection prevalence by using forecast modelling. Due to the results of our prediction infectious doctor can produce recommendations on early information about the spreading of the infection. It gives a chance to act for prevention procedures against the infection, which leads not only to reduce the prevalence of the infection, but it also minimizes social costs for the public and saves many people's lives.
(* Binary data type of attributes where there is only 0 or 1 value). Weather data (WD), including H(%), DP(°F), T(°F), P(Hg), W(mph) was obtained from historical data by average daily information in the https://www.wunderground.com/history website, joined with the visitor date attribute. The joined data transformed to weekly (253 rows) and monthly (  By minimizing the residual sum of squares, we determined the best fit of 1−16 and 1−5 vector paraments to the data in table 1. The residuals are defined for each observed data-point as where is the number of the total H. pylori infected per month. We used nonlinear least square solver of SciPy in python from scipy.org.

Model assumptions
To achieve validness of the tests of hypothesis (like t-test and F-test) and to enhance that ordinary least squares (OLS) estimators are the Best Linear Unbiased Estimator (BLUE), it is required to follow four base assumptions: 1. The relationship between the dependent variable and the independent variables is linear. 3. Homoscedasticity. Homoscedasticity is a word used for the "constant variance" assumption. The regression model assumes that the residuals have the same variance throughout. When this assumption is violated, the problem is called "heteroscedasticity," or changing variance. We used the Breusch -Pegan and White test to check heteroscedasticity.
4. Normality of residuals with mean equals to zero. Errors need to be a normal probability distribution. This makes no difference to the estimates of the coefficients, or the ability of the model to forecast. But it does affect the F-, t-tests and confidence intervals.

Model
The following formula is obtained by applying optimal parameters of the proposed model (1) We obtained a MVRM to predict prevalence of the infection based on climate variables, where H. pylori is dependent variable and H, DP and T are independent variables. This formula predicts the accuracy with coefficient of determination (COD) equals to 0.77 and 0.75 for train data (42 months) and test data (11 months), respectively.
With training of subset of data we obtained a Multi Varies Regression Model which allow us to predict the accuracy with coefficient of determination (R2) equals 77% and 75% for train data (42 months) and test data (11 months), respectively (figure 4). The adjustment R2 is 75% which is comparatively very high, means that the correlation coefficient between the observed value of the dependent variable and the forecast value based on the regression model is very strong.
The value of F statistic is 31.39 and the significance of F is 0, endorsed by ANOVA, which is less than the critical value (p<0.001). The significance of the model is verified by rejection of the null hypothesis.

Model assumptions
The first assumption is consequence of the linearity of the 1−5 coefficients in the proposed model. From DW test result 1.717 we conclude that there is no autocorrelation between residuals and predicted values ( figure 1.b.). The third assumption is verified by Breusch-Pegan Test, the null hypothesis was not rejected (p>0.05). It can be seen in figure 1.a. by the QQ plot, which easily proves that it is homoscedasticity. The last assumption is also true for the given model and the mean of residuals is zero. In addition, Shapiro-Wilk, Anderson-Darling tests show that null hypothesis is rejected (p>0.05), which means the residuals are normally distributed ( figure.1.c)

Prediction Results.
The forecasts result of training and testing data were represented by figure 2, where it was separated by a grey vertical line. By date (month, year) and the number of CLO are represented by x-axis and y-axis, respectively. Actual data is in blue colour, training data is in green and testing data is in red colour. The forecasting data started from November 2002 till October 2003 which means that almost one year forecasts is high accurate.

4.
Conclusions We suggested new non-linear Multi Varied Regression Model to predict H. pylori infection prevalence based on the data of the Samatya hospital. The new model is constructed by using patterns of H. pylori infection prevalence behaviour in connection with of the mean of humidity, dew point and temperature. Our researched showed that only the forecasting model achieves more accurate results by using the combinations of the given climate variables. The results of this research can help to minimize social cost by predicting the prevalence of the H. pylori infection. The proposed model helps to conduct precise predictive analysis of H. pylori infection prevalence for 1 year based on the dynamics of climate variables. Keeping in mind importance of climate variables in the forecast modelling of H. pylori infection prevalence we found high correlation between the climate factors and the prevalence. This model gives high accurate early forecast results which can be used by hospitals or governments to do early prevention acts against the infection prevalence, since it is critical to save life of people and reduce cost in society.