Selection of variables in generalized linear mixed model for smoker in Jambi Province

Smoking is one of the health problems in Indonesia. Many factors cause a person to smoke, both originating from oneself and the environment. The statistical question that arises is how to choose the factors that are most significant in influencing people to smoke. These factors are the variables that will sed in modeling. This study aims to select the variables in the compressed linear mixed model using the Lasso penalty and the Boosting function, using the EM and REML algorithms. Respondents in this study were 160 smokers in Jambi Province. Based on the AIC value, the best model obtained from the selection of variables with the Boosting function and REML algorithm. The analysis shows that work, welfare level, and family members who smoke are the factors that influence people smoking in Jambi Province.


Introduction
Mixed models -both Generalized Linear Mixed Models (GLMM) and linear mixed models -have been used in various cases. The combined model combines both the same fixed effects for each observation or sample and random effects that apply to select samples or groups of samples. Through the use of random facts, the linear mixed model designed to handle repeated actions and other intricate study designs. Besides, GLMM tries to model data that does not follow the normal distribution. A direct relationship between the outcome and predictors redefined as a series of linear predictors and it to the expected value of the result via the "link" function. This link function, together with the variance of the expected return value, is chosen from the family members of the exponential distribution. [1] Many methods can be used to select predictor variables in a linear mixed model [2]. One of them is the Lasso penalty (Least Absolute Shrinkage and Selection Operator, or 1 ) [1,3], which has the advantage of selecting variables by reducing some coefficients to zero. Variables whose factors reduced to zero can remove from the model without affecting the ability of the model to predict the desired results. Only Lasso chooses variables without having to select a threshold size coefficient to determine the relationship but has limitations when dealing with many correlating variables [1].
Many methods can be used to select predictor variables in a linear mixed model. One of them is the Lasso penalty (Least Absolute Shrinkage and Selection Operator or 1 ) [1,3]. In its use in linear mixed models, several algorithms can be used to select variables with Lasso penalties including GLMMLasso, glmmLasso, PIRLS, MM, and Penalized-EM [4][5][6][7]. Another method that can be used to select variables in linear mixed models is by Boosting [8,9], which works iteratively which can produce results that are 2 very similar to those given by Lasso. In its use in linear mixed models, several algorithms can be used to select variables with the Boosting function, including AdaBoost and Greedy [10][11][12].
The use of glmer algorithm, GLMMLasso, glmmlasso, bGLMM for selecting variables shows that the glmmLasso algorithm gives the best results [5]. However, this algorithm has not compared the results with the Boosting function. Based on the description above, the purpose of this study is to select the factors that influence people who smoke or not by the Lasso penalty method and the Boosting function in a linear mixed model with a case of smokers in Jambi Province. Next, determine the best model obtained from the process of selecting variables.
Based on the description above, the purpose of this study is to select the factors that influence people who smoke or not by the Lasso penalty method and the Boosting function in a linear mixed model with a case of smokers in Jambi Province. Next, determine the best model obtained from the process of selecting variables.

Methods
The research data sourced from 2017 Indonesian Demographic and Health Survey (IDHS), which contained 16363 respondents. It aged over 17 years, divided into adolescent, male, and female categories. The data used were 160 respondents from Jambi Province. Response variables in this study were smoking or not smoking, 14 predictor variables with the residence as random variables. Smoking or not categorization based on the number of cigarettes smoked per day. Response variables are binary scales and numerical predictor variables.
Furthermore, all variables will be selected using two methods, namely the lasso penalty and the Boosting function. By setting the X3 variable (residence) as a random variable, a general linear mixed model is formed [5]: where denoted as the expectation value from the mean smokers for the respondent on the residence and represents a residence random intercept. The fitting of the model uses a logistic function because the response variable is assumed to be a bernoulli distribution.
In this study, the Lasso penalty function, which based on the glmmLasso algorithm and the glmmLasso function [13], is used to select predictor variables that estimated as parameters of the general linear mixed model which are factors affecting smoking people. The randomly chosen residence used as a random intercept effect with the Cross-Validation (CV) criteria as a determinant of the optimal lambda value. Then the variables are selected using the EM and REML algorithms. To see the goodness of the model, we use the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) of the two algorithms. The selection results obtained are then compared with the Boosting function based on the bGlmm algorithm and GMMBoost function [14]. The variable selection also performed using the EM and REML algorithms.

Sample description
The data of male respondents in 2017 Indonesian Demographic and Health Survey (IDHS) amounted to 16,363 people. This data collection is carried out individually in each of these households. The next sample of individuals selected from Jambi Province, totaling 160 people. Of the 160 people in this Characteristics of these smokers, 36.36% are early adult males aged 26-35 years, where 72.27% are heads of households with 52.73% having secondary education, and 46.36% are farmers on their land that are 95.45% earn income in cash and 29.09% are in the category of low-income families. This condition strengthened by the fact that 58.18% of smokers do not have health insurance, 69.09% smokers never read newspapers/magazines, 74.55% never listen to the radio, 95.45% at least once a week smokers watch TV and 68.18% never used the internet even though 87.27% smokers had cellphones.

Selection variables with Lasso penalty function
The sample used in this selection was 160 respondents from Jambi Province. The selection process uses two algorithms, EM and REML. Before selecting the variable, first, determine the tuning parameter value (lambda) for the Lasso penalty using the Cross-Validation (CV) criteria with a fold value = 10. The lambda value selection plot for the EM and REML algorithms can see in Figure 1. From Figure 1, it can see that for the value of λ = 1 to λ = 13, the IC value is still fluctuating (up and down), and the lambda value seen to be monotonically rising since λ = 14. That is, the optimum lambda value for these two algorithms is λ = 13. The results of selection on the EM algorithm show that for the amount of λ = 13 selected three variables, namely X5 (work), X7 (welfare level), and X9 (family members who smoke) with the random variable coefficient X3 is 0.1151 and intercept is 2.1045. From Figure 2, the selection of variables on the REML algorithm shows that for the value of λ = 13, selected three variables, namely X5 (occupation), X7 (welfare level), and X9 (smoking family members) with random variable coefficient X3 is 0.1320 and intercept is 2.1067. It shows that the variables selected with the Lasso penalty for the two algorithms are the same.

Variable selection with the boosting function
Selecting variables with the Boosting function uses two algorithms, EM and REML, and the GMMBoost function. The EM algorithm produces three selected variables, namely X5 (work), X7 (welfare level), and X9 (family members who smoke) with the random variable coefficient X3 being 0.2259 and interception is 2.3242. Selecting variables with the REML algorithm also produced three selected variables, namely X5 (occupation), X7 (welfare level), and X9 (smoking family members) with random variable coefficient X3 is 10-12 and interception is 2.2957. It shows that the variables selected with the Lasso penalty for the two algorithms are the same.

Selection of the best model
The indicator of the feasibility of the model can use two values, namely AIC and BIC. However, the Boosting function only issues AIC values. Based on the AIC value, the model produced by the Boosting function is better than the model generated by the Lasso penalty. The two AIC values for the Lasso penalty model are almost twice that of the Boosting function model. While the AIC value for the REML algorithm is smaller than the EM algorithm. The results followed the proposed of Buhlmann [15] that an AIC-based method for tuning parameter, namely, for choosing the number of boosting iterations more attractive since it is not required to run the algorithm multiple times for cross-validation as commonly used for Lasso. From Table 1, it can see that the best model generated by selecting variables with Boosting using the REML algorithm. The results obtained are following the results of Andreas Myar et al. [16] which said that Boosting algorithms have triggered a lot of research during the last decade because it was offering various practical advantages like automated variable selection and implicit regularization of effect estimates.

Conclusions
The research showed that variable selection with the Boosting function is better than the Lasso penalty for the EM and REML algorithm, even though the AIC value generated by the REML algorithm is smaller than the EM algorithm. Although following different strategies for optimization and regularization, both methods imply similar constraints to the estimation problem leading to a comparable performance regarding prediction accuracy and variable selection in practice [17].