Online Statistical Modeling (Regression Analysis) for Independent Responses

Regression analysis (statistical analmodelling) are among statistical methods which are frequently needed in analyzing quantitative data, especially to model relationship between response and explanatory variables. Nowadays, statistical models have been developed into various directions to model various type and complex relationship of data. Rich varieties of advanced and recent statistical modelling are mostly available on open source software (one of them is R). However, these advanced statistical modelling, are not very friendly to novice R users, since they are based on programming script or command line interface. Our research aims to developed web interface (based on R and shiny), so that most recent and advanced statistical modelling are readily available, accessible and applicable on web. We have previously made interface in the form of e-tutorial for several modern and advanced statistical modelling on R especially for independent responses (including linear models/LM, generalized linier models/GLM, generalized additive model/GAM and generalized additive model for location scale and shape/GAMLSS). In this research we unified them in the form of data analysis, including model using Computer Intensive Statistics (Bootstrap and Markov Chain Monte Carlo/ MCMC). All are readily accessible on our online Virtual Statistics Laboratory. The web (interface) make the statistical modeling becomes easier to apply and easier to compare them in order to find the most appropriate model for the data.


Introduction
Regression analyss (statistical models) are among statistical methods which are frequently employed in analyzing quantitative data, especially to model dependences between response and several explanatory variables. Nowadays, statistical models have been developed into various directions to handle various type and complex relationship of data. Rich variety of advanced and recent statistical modelings are mostly available on open source software (one of them is R). However, these advanced statistical models, are mostly based on programming script or command line interface, which mean, that they are not easily accessed by applied or practical researchers. The gaps between developed and accessible statistical methods worried statisticians [1] that "practitioners continue to use inappropriate or suboptimal methods due to their being restricted to what is made available via GUIs".
Therefore it is essential to build interface to make advanced and most recent statistical methods, especially statistical model on R, becoming more user friendly and easier to access and to use. Several GUIs have been developed for various purposes. Explicet, is a GUI designed for management, analysis and visualization of microbiome data [2] and it is claimed has made the analysis of complex microbiome datasets becoming "much more accessible to the growing number of investigators". Microarray Я US, has been developed based on bioconductor R packages, mainly for researchers with no or little knowledge of R, to have a more reliable and accurate microarray data analysis [3]. Interactive web for statistics learning have also been developed. RwikiStat was developed by combining MediaWiki and Rweb [4] and combining theory with laboratory practice using Rweb, however user still need to have R scripting capabilities. Other types of statistics tutorial with combination of statistics theory and data analysis have been developed using R with shiny packages for spesific topic [5], [6]. This type of data analysis are accompanied with summary of theory and step by step choice analysis with example of interpretation to ensure users are doing analysis data with understanding but no need to master or understand R scripting.
Statistical models with general form = + for i = 1, 2,3, ..., n have been extended into various directions. For model with independent errors, the model starti from (i) simple linear models (LM) having independent Gaussian errors, i.e., ~ (0, 2 ) and = ∑ , for j = 0,1,2,..., p (with p number of regressor/ predictors) [7], (ii) when outliers exist, there are several methods available using robust linear models approaches (RLM) [8][9] (iii) Generalized linear model (GLM) extends LM to accommodate independent errors with wider class of distributions known as the exponential family distributions (i.e, having continuous, count, or binary responses) and possibly nonlinear relationship between response means and the linear predictors, i.e., continuous and differentiable link function g (such as log, logit, inverse/ reciprocal), such that ( ) = ∑ . [10] [11]. Later, (iv) statistical model were again generalized to accommodate additive predictors (GAM) such that, ( ) = ( ), for smooth function f (parametric or nonparametric). One of the most frequently applied nonparametric smooth functions are the family of spline smoothers [12][13] [14], and (v) perhaps the most recent development of statistical model with independent errors are extension of GAM into GAMLSS [15] [16]. GAMLSS accommodates wider type of distribution (with 1, 2, 3, up to 4 parameters, such as the mean, variance, skewness and kurtosis). In addition to modeling the mean, with wider type of distributions, GAMLSS, can also model all other parameters of distributions, each may have its own link function. Recently GAMLSS is extended with variables selection capabilities [17]. In addition to those main statistical models, for small sample, the model are also extended to employ Computer Intensive Statistics (CIS) techniques, such as Bootstrap regression [7] and Markov Chained Monte Carlo (MCMC) regression [18].
All the statistical models mentioned above are already implemented in various packages on R. However, for novice R users, they are not easy to apply since they are all based on command line interface (script). Moreover, in addition to those packages, users may need to upload and call other functions from other R packages for drawing graph or calculating goodness of fit. In this paper we report the development Web-based-GUI interface that unifies most statistical models for independent responses using R and enriched by various options for data exploration, graphical visualisation and goodness of fit measures utilizing several selected R packages.

Methods
We develop an interface for unified online (web-based) statistical models for independent responses, which covers LM, RLM, GLM, GAM, GAMLSS, Bootsrap and MCMC regression based on various previously mentioned R-packages. We mainly utilize shiny toolkits [18] to build the interface. There are severeal main steps to follow in building the interface: (i) selecting main and related packages, including the primary functions for the models and secondary functions for graphical visualization, such as scatter plot and correlation plot matrices [20,21], and scatter plot with various smoother [22], and other regression visualization [23]; (ii) identifying the input parameters of the functions, (iii) defining input functions and their options in ui.r file and output functions server.r file; (iv) checking the compatibility of loaded packages and related functions; (v) uploading the files to the web (shiny server) so that they are readily accessible by users.

General Features of the Web GUI interface
At this stage, we have developed online statistical model fitting for independent responses, covering several models described previously with general features as follows (see Figure 1). 1) Data Input: internal database (for practical purposes), or import users' own data with csv or text format (for real data analysis). Users can load all chosen data, or only load small number of the   4).
The web can be accessed at http://statslab-rshiny.fmipa.unej.ac.id/RProg/MSI/. The summary of features for each model fitting is given in Table 1 and the appearance of the web can be seen in Figure  1.

Numerical Illustrations
The following are numerical illustrations using iris data available on datasets package. The data were first published in 1935 [24]. The main purposes of the illustration are not to show the accuracy of the computation (results), since the results are the same if they are done via script, but to show that results (estimate and comparison among available model) can be done completely and more easily, using "point and click" on the web (see Figure 1). The fitting start from exploration to testing hypothesis about parameters of various alternatif models

Data Exploration
From summary of data we see that the data consist of 1 factor and 4 variables, means that Species as factor may worth considering in the model. Graphical exploration can be made by creating scatter plot, with various type of smoother (available on menu). The first graphics explorations utilize scatter plot matrix of the variables, to check whether factor (i.e., Species) worth considering in the model. For the seek of clarity we only focus on Sepal.Length and Sepal.Width (Figure 3). The plot show that inclusion of Species in the model changes the regression lines directions (regression coefficients or the lines' slope) significantly from negative and may near zero (Figure 3a), to positive for each Species (Figure 3b). The next exploration using various smoother (with ggplot2 package), give us idea that the data may be better fitted using more advanced regression (such as GLM or GAM). Figure 4 shows that applying other continuous distributions (Gamma families) with nonidentity link seem improve the fitness of model. These graphics appearance suggests that Species should be included in the model and more advance model (such as GLM, GAM, GAMLSS, should be considered

Alternatives of model fittings
Using our online data analysis, users can easily employ various types of modelings and various combinations of model parameters. For illustration, we set Sepal.Length as response and other variables or factor as explanatory variabels. We fit several models (i) Gaussian distribution (LM) with and without factor, (ii) Gamma with Log link (GLM) and (iii) GAM (by giving smoother on some variables, (iv) GAMLSS (by modeling the scale parameter), and (v) GLM with Natural or B-Splines. All the models are easily set in our interface and the GOF are informed for each model. We describe some of the fitting and summarise the results of all fittings.

(v) Fitting GAMLSS with Two Parameters Gamma and log link
We have variety of choices of parameters for GAMLSS (sunch as type of distributions; formula for mean, sigma, nu and Tau, and, type of smoothers). We only choose Gamma with two parameter (mu and sigma), so we only have choices to model mu (µ) and sigma () (neither tau and nor nu). Apparently (for some rough choices) sigma does not dignificantly dependend upen some predictor. Therefore we only report model with constant sigma. Which mean interm of parameter model our GAMLSS does not differ significantly from GAM.

Comparing the models
The estimate of each model and its GOF are summarized and compared in  There are some remarks can be drawn from the results. (i) It is worth to consider including factors (grups) in the model, when data do not heve observed group, users can perform cluster analysis and take the clusters as group (buliding cluster using Kmeans is also available in our online analysis) (ii) The significance of individual parameter depends upon the combination of other parameters in the model. The parameters of some variabels may not be significant, but removing them from model can worsen the model (increase the AIC). Therefore, the parameters or variabels may be retained in the model.

Advantages and disadvantages using online data analysis
The method are placed in the web as part of Virtual Statistics Laboratory. The method can be accessed at http://statslab-rshiny.fmipa.unej.ac.id/RProg/MSI/. There are some adventages for users in using this online model fitting including (i) no need to install R, (ii) no need to master R scripting, (iii) users are easier to surf from one model to another, checking the graphical appearance and the GOF of the model (iv) user can access (do data analysis) using various type of gadgets (hp, tablet notebook, etc) and do simple to advanced statistical modeling with R. The main discomfort in using online data analysis is related to the speed of the available internet network and the number of users accessing the web at the same time. At this stage, web performances (the speed on various gadgets and various web browsers) have not been critically examined. However for local lectures or laboratory practices, students experience no noticable disruptions.

Future developments
Some features have not been currently implemented namely (i) loess smoother for GAM (since they are conflicted with MGCV), (ii) nonlinear and multiple predictors model for the scale, shape and tau parameters in GAMLSS (iii) testing multicolinearity and models alternatives when it occurs in the predictors. These features, in near future, will be gradually included and tested.

Conclusion
Our online statistical model for independent responses, for LM, RLM, GLM, GAM , GAM LSS and CIS, has covered all main features (options) generally done using CLI (script programming), although for CIS types they are not illustrated. It enables users easier to do and compare various types of statistical modellings and choose the most appropriate model. In addition, user is also able to do various data explorations (scatter plot matrix, correlation plot matrix, and other visualization grafik). For GAM and GAMLSS more features are still to be added, and possibly extend the models to include multicollinearity.