VOStat: A Statistical Web Service for Astronomers

VOStat is a Web service providing interactive statistical analysis of astronomical tabular datasets. It is integrated into the suite of analysis and visualization tools associated with the international Virtual Observatory (VO) through the SAMP communication system. A user supplies VOStat with a dataset extracted from the VO, or otherwise acquired, and chooses among $\sim 60$ statistical functions. These include data transformations, plots and summaries, density estimation, one- and two-sample hypothesis tests, global and local regressions, multivariate analysis and clustering, spatial analysis, directional statistics, survival analysis (for censored data like upper limits), and time series analysis. The statistical operations are performed using the public domain {\bf R} statistical software environment, including a small fraction of its $>4000$ {\bf CRAN} add-on packages. The purpose of VOStat is to facilitate a wider range of statistical analyses than are commonly used in astronomy, and to promote use of more advanced methodology in {\bf R} and {\bf CRAN}.


Introduction
Astronomy has always accounted for the production of a significant amount of scientific data. With recent developments in instrumentation, this amount surpassed most fields of science. Most of the astronomers, however, are still using traditional techniques to analyze the available data. While this approach has definitely produced many good results, it is expected to be even more productive if it is coupled with statistical techniques: the time-tested science of data analysis.
Both astronomy and statistics are well developed branches of science, each with its own set of journals, conferences, degrees and jargons that require time and patience to master. So desirable though it is, simultaneous proficiency in both the fields not commonly achieved.
VOStat 1 is a web service that is designed to help in this endeavor. Oriented towards tabular data, rather than images or spectra, it is an effort to bring statistics to the astronomer's desktop. There seems to be two ways to introduce statistics to an astronomer. The first is to ask her to read and digest a selection of standard books on statistics (most of which are written with social scientists in mind), and then expect her to apply the ideas to problems in her own research. Statistics texts designed for the physical scientist are now emerging to assist in this process (James 2006;Wall & Jenkins 2012;Feigelson & Babu 2012). The second approach is to offer her a convenient Web service with a menu of statistical methods to choose from. Data can be uploaded from her local computer or acquired from the Virtual Observatory, a worldwide federation of public astronomical databases and associated analysis services. She can pick methods and immediately apply them to her data without having to download software or read more than a short non-technical description. If she likes the result, she may then delve deeper into the methodology and acquire the underlying software to investigate the initial findings more carefully on her machine.
VOStat takes the latter approach. It uses R, the largest free, public domain statistical software to perform its computations. The output of VOStat is not merely the output of R, but also the R scripts (generated automatically by VOStat based on the user's specification). Indeed, the main goal is not to encourage astronomers to outsource their statistical requirements to VOStat, but to teach them how to be able to carry out independent statistical analysis of their data themselves. VOStat is available to anyone on the World Wide Web, but is particularly oriented towards astronomers drawing upon Virtual Observatory datasets and tools.

The R Statistical Software Environment
For most of the computer age, the primary software implementing statistical analyses were commercial packages. The Statistical Analysis System 2 , providing a vast array of built-in procedures in a Fortran-like programming language, is the most comprehensive, serving industries and governments. During the 1980s, John Cambers and colleagues at AT&T developed S (Chambers 1998), a statistical programming language written in C, that evolved into the commercial package S-Plus 3 . In the 1990s, New Zealand statisticians Ross Ihaka and Robert Gentleman rewrote S in the public domain, calling it R, and released it under GNU General Public License software product under the auspices of a non-profit R Foundation for Statistical Computing led by a growing R Development Core Team (Ihaka & Gentleman 1996).
In the spirit of open source software, the R team set up the Comprehensive R Archive Network (CRAN ), a structure for external experts in statistical computing to contribute specialized packages. About 20 of the early packages were incorporated into the base R package, which is now a reasonably stable product. But CRAN contributions continued to enter the system, growing exponentially since 2001. Today there are > 4000 CRAN packages with a new package entering every day. They cover every field of applied statistics with particularly strong contributions from the biology (particularly genomics) and econometrics communities. The R/ CRAN system now has > 60, 000 statistical functionalities, some simple and others very complex. CRAN packages are easily retrieved on-the-fly during an R session.
As there is no global index to CRAN , it can be tricky to find the appropriate functionality for the task at hand. The CRAN Task Views 4 provide brief but useful summaries of packages in ∼ 30 broad topics. These include Bayesian inference, chemometrics and computational physics, cluster analysis and finite mixture models, graphic displays and visualization, high performance computing, machine learning, medical imaging, multivariate analysis, optimization and mathematical programming, robust statistical methods, spatial data, survival analysis (treating upper limits), and time series analysis. Publications using R/ CRAN should cite R Development Core Team (2012) and the reference given by the function citation{package}.
The capabilities of VOStat are similar in scope to those provided by other R graphical user interfaces (GUIs) like Deducer and is less extensive than some like R Commander 5 . The purpose for creating a new R GUI for astronomy is interoperability with other Virtual Observatory capabilities through its Simple Application Messinging Protocol (SAMP), so the astronomer experiences a flexible and capable environment for interactive scientific analysis of tabular data 6 .

Overview of VOStat
The international Virtual Observatory (VO) system federates the data resources of various astronomical facilities around the world. It provldes tools for the extraction, visualization, and comparison of data extracted from VO databases, as well as some specialized astronomical analyses such as construction and fitting spectral energy distributions and lightcurves for variables objects. This system provides easy web-access to a vast amount of astronomical data, and allows astronomers to communicate findings easily for multi-observatory studies. Its impressive scope notwithstanding, the VO system lacks comprehensive statistical tools suitable for analyzing the data extracted by a user.
The VOStat project addresses this deficiency with two goals: it provides a statistical toolbox well integrated into the VO system to enable extracted astronomical data to be analyzed efficiently and effectively; and it spreads awareness among practicing astronomers about the need of complementing domain knowledge with statistically valid data analysis techniques. VOStat seeks to achieve these aims with a web-based astrostatistical computing portal that links VO-extracted datasets to the statistical analysis capabilities of R, the largest free, open-source statistical computing environment. Automated connectivity between these software environments is provided by the VO's Simple Application Management Proocol (SAMP), a VO astronomer can (for example) interactively acquire tabular datasets from the VO Data Discovery Tool, view and manipulate the tables using TOPCAT or VOTable, and perform statistical analyses with graphics using VOStat. While the analyses provided by VOStat are not highly complex due to the limitations of the Web services interface, R code is provided giving a springboard for the astronomer to develop and implement more advanced statistical methodology geared towards the particular astronomical research problem at hand.
VOStat is accessed by the public at www.vostat.org. Astronomers can upload their data into the VOStat server via a simple Graphical User Interface that operates the chosen R functions on the user-provided dataset. The data files can be on the user's home computer, on a Web site, or obtained from VO-compliant tools via the VO's SAMP. VOStat thus can interact with human users directly (e.g., accepting data, displaying plots on screen), with VO-compliant tools via a SAMP hub, a specialized server running in the local machine of the user that uses SAMP as its communication protocol.
Once a dataset is extracted from the VO or otherwise provided, the user chooses from a list of several dozen statistical techniques, the selected techniques are applied to the data, and the results are returned to the user. Apart from traditional output like on-screen plots and tables, VOStat also provides the following to understand and extend the statistical analysis of their dataset on their home computer: 1. annotated R code for each step of the statistical analysis; 2. workspace file giving internal R files of statistical analysis results; 3. help files from R documentation and from Wikipedia describing the statistical functions; 4. guidance for the statistically uninitiated astronomer to select the right analysis.
VOStat is oriented towards analysis of tabular data. Commonly, the rows of the table represent a collection of astronomical objects and the columns represent measured or inferred properties of the objects. But astronomical lightcurves or spectra can be analyzed, where the first column gives observed times or wavelengths and the other columns give observed values. R has extensive capabilities for analysis of images (see the CRAN Task View on medical image analysis, http://cran.r-project.org/web/views/MedicalImaging.html), but these are not incorporated into VOStat. Therefore, we do not encourage analysis of astronomical images using VOStat.
A common concern among astronomers while choosing a software is the upper bound of data size that can be handled. VOStat imposes no further restriction on this beyond the limitation of R itself. Thanks to the way R is built from diverse packages created by unrelated groups of developers, it is difficult to quantify the data size limitations of R in general. Certain methods allow quite large data sets, while others do not. The primary design goal behind VOStat is to motivate astronomers to use R. So users are encouraged to play with the available methods using moderately sized subsets of their data, and then download and tweak the generated script to work with the entire data sets. VOStat has facilities built into it to choose a subset of data.
Two earlier versions of VOStat were developed, the first produced by a U.S. group involving Penn State and Caltech, and another undertaken by VO-India and Penn State with contributions from Caltech and Calcutta University Statistics Department. The latter is provided for use by VO-India 7 . These versions of VOStat did not have SAMP communication with other VO tools and provided a narrower suite of statistical methods. Other tools for interfacing R to the Web for general statistical analyses without association with the Virtual Observatory include FastRWeb, R2HTML, Rcgi, Rweb and Rwui 8 .

Internal structure of VOStat
VOStat uses Java as the interface between R and the user. The Apache Tomcat web server deploys the Java servlets that comprise the web interface. An R session is then spawned on the server computer to implement the chosen statistical procedures. Figure 1 shows the workings of VOStat, and the three principal modules are described here.

Data loader module
VOStat allows data to be fed into it from different sources: 1. Uploading a file from the user's local machine.
2. Uploading a file from a URL on the Internet.

Getting data from VO-compatible tools via a SAMP hub. These tools include Aladin Sky
Atlas, Datascope, Octet, Open SkyQuery, Topcat, VIM, and VisIVO.
We rely on Java applet technology for the SAMP communication, and on multipart file uploading for accessing a file from the user's local disk. The data file may be in different formats commonly in use among astronomers: VOTable, FITS, or ASCII tab/space/comma separated values. VOStat can guess the file format from the file name extension, or the user may specify the file format from a drop-down menu. The file extensions recognized by VOStat are .txt, .dat, .csv, .fits, .vot and .xml. These formats are read using R utilities.
For datasets available on the Web, the first page of VOStat has a text field where the user can specify the URL of the file to be uploaded. A similar interface allows files to be uploaded from the user's hard disk. Obtaining data via SAMP from VO-compliant tools is slightly more involved, because the SAMP hub must run in the user's local machine while VOStat resides in a remote server. For this we embed a trusted Java applet in the webpage served by the VOStat server. This applet contains two clients in it, a SAMP client and an HTTP client. The SAMP client looks for and connects to any SAMP hub found running in the local machine. Once a file broadcast message (MType: table.load.votable) is received by the SAMP client, the HTTP client establishes connection with the remote VOStat server. The file is read by the applet and sent over the internet to a servlet running in the VOStat server. This indirect process is required because, during 2011-12 when VOStat was developed, a SAMP hub can run in the local machine while VOStat tables are broadcast via local URLs only.

Analysis Module
This is written entirely in R. For the more commonly used statistical functions, the code uses functions in base R. For more specialized functions, relevant CRAN packages are first installed on the server machine and the CRAN functions are run on the dataset. In a few cases, we implement our own analysis techniques designed for astronomers. The VOStat architecture is created with the design goal that new statistical techniques may be added easily without disrupting the existing workflow. The analysis tools currently in VOStat are briefly described in §5.

Output Module
Three types of output from VOStat are provided. First, various plots, estimated parameter values, confidence intervals, correlation coefficients, diagnostic messages, and other ASCII results are for direct use by the user. Numerical output is usually short and appears on the screen. Plots can be downloaded in PDF or Encapsulated PostScript formats. Second, tables meant for consumption by other SAMP-aware VO tools are broadcasted through the local SAMP client.
Third, the R binary workspace (usually called .Rdata) is available for download so that an analysis can be continued seamlessly at a later session. This can assist users who want to continue more varied and advanced using R installed in their own machines.
To further the goal of educating users about statistical tools, VOStat also gives background information about the performed analysis. First, annotated R codes can be 'cut and paste' to reproduce the analysis on a home computer. The VOStat computation is thus not an opaque 'black box' but can be reproduced and extended by the scientist. Second, links are provided to the R documentation for each important R function used. These 'help files' have a standard form defining the function, presenting the inputs and outputs, algorithms, and generic examples of their use. Third, links are given to Wikipedia entries to give a more detailed and mathematical presentation of the statistical method. References to the methodological literature are given in both the R help files and Wikipedia entries. Nearly all of the statistical functions in VOStat are also described, with applications to astronomical datasets, in the textbook Modern Statistical Methods in Astronomy with R Applications (Feigelson & Babu 2012). This text provides 19 astronomical datasets that can be used as exercises for VOStat application and more elaborate R code illustrating statistical analyses of these datasets 9 .
For analyses that produce new data worth feeding into other VO-tools (for example, for advanced visualization) can be routed via a SAMP hub running in the user's machine. Like other web-based VO-tools, VOStat uses trusted applets to connect to a SAMP hub running in the local machine. However, unlike some other tools like Aladin Sky Atlas, VOStat does not start a local hub itself. It only connects to a hub that is already running. This is to avoid disrupting coordination of the local VO-tools in case of a browser failure.

VOStat statistical functions
VOStat (Rev. 2, December 2012) currently provides over 50 statistical functionalities organized into 11 topics. We review these briefly here. R and CRAN functions are given in brackets and italics, respectively. Modules that produce output that can be automatically transferred to another VO-compatible client via SAMP are labeled 'SAMP'. Figures of selected graphical outputs were generated in a separate session of R from the VOStat R codes in order to cosmetically improve them for publication (e.g., axis labels). Intermediate-level descriptions of these statistical functions with applications to astronomical data are given by Feigelson & Babu (2012).

Transforming and subsetting the data
Data summary This gives minimum, quartiles (25%, 50%, 75%), and maximum values for each variable. [summary] Select cases Rows can be selected using operations <, ≤, >, and ≥ applied to specified columns.

Plots and summaries
Boxplot Boxplots show the minimum, maximum, median and quartiles of a univariate distribution with tails and outliers (Tukey 1977).

[boxplot]
Dotplot Dotplots show univariate values of individual objects with labels. This is appropriate (in terms of visual clarity) only for small datasets. [dotchart] Histogram Histograms show the univariate distribution grouped into bins. The user can specify the number of bins or the bin boundaries. If none are given, then the number of bins uses the Freedman-Diaconis algorithm based on the inter-quartile range (Freedman & Diaconis 1981). [hist] Averaged shifted histogram Averaged shifted histograms are univariate density estimators (smoothers) that repeatedly bins the data with shifted origins and convolved with a quartic (biweight) kernel (Scott 1992). The user specifies the number of bins used. This is implemented in the CRAN ash package (Scott et al. 2012). An example is shown in Figure 2b. [ash1]

Scatterplot
This is the common plot of individual data points in two variables specified by the user. Numerous options in formatting this plot (and other VOStat graphical outputs) are available to the user who adjusts the plot parameters in R on their home computer. [plot]

Pairs plot
A pairs plot is a matrix of scatterplots for a multivariate dataset, sometimes called a SPLOM (for 'Scatter PLOt Matrix'). It is often valuable for examining patterns and locating problems in multivariate data. In practice, a limited number of variables (say < 10) can be shown in a single plot. An example is shown in Figure 2d.

[pairs]
3D scatterplot This is a plot of three variables specified by the user implemented in CRAN package scatterplot3d (Ligges & Mächler 2003).

[scatterplot3d]
Summary statistics This gives mean, median, variance, median absolute deviation for each variable specified by the user. It also shows a matrix of Kendall's τ nonparametric coefficient of bivariate correlation with associated probability values obtained with CRAN package SuppDists (Wheeler 2009

Empirical distribution function
This provides a sequence of plots of the normalized cumulative step-function of a single variable with 95% confidence bands based on the Kolmogorov-Smirnov statistic using the sfsmisc CRAN package (Mächler et al. 2011). [ecdf.ksCI] Kernel density estimation: 1-dimensional, fixed window width This function convolves the data with a Gaussian function. Here a constant bandwidth selected by the user is used; if no bandwidth is specified, a default optimized to a Gaussian distribution for the data, h = ( 4 3n ) 1/5 σ where n is the number of data points and σ is the sample standard deviation. A rug plot of the input data and 95% confidence bands around the smooth estimator are shown. This is produced with CRAN package sm (Bowman & Azzalini 1997, 2010. [sm.density, plot, lines, rug] SAMP Kernel density estimation: 1-dimensional, adaptive window width This is a sophisticated robust L-estimator that applies a Gaussian kernel with bandwidth adapted to the local concentration of data points implemented in the CRAN package quantreg (Portnoy & Koenker 1989;Koenker 2012). The user can adjust a parameter that determines the sensitivity of the local bandwidth to local variations. [akj, seq, plot, rug] SAMP Kernel density estimation: 2-dimensional, fixed window width This function convolves the data with a Gaussian function of fixed bandwidth and produces a colorful 3-dimensional surface plot that can be interactively rotated (Bowman & Azzalini 1997). Confidence bands are available but not plotted. [sm.density] SAMP

Fitting distributions to data
This performs a maximum likelihood fit of a statistical distribution to a univariate dataset. Fifteen distribution are provided: beta, cauchy, chi-squared, exponential, F, gamma, geometric, log-normal, logistic, negative binomial, normal, Pareto (power law), Poisson, t and Weibull. The best-fit parameters are obtained numerically except for the normal, log-normal, geometric, exponential and Poisson distributions where analytic solutions are used. Output includes best fit parameters, their estimated standard errors from the Fisher information matrix, and the log-likelihood. The R implementation is described by (Venables & Ripley 2002). The only exception is Pareto distribution, for which VOStat uses its own code. [fitdistr]

Comparing data with distributions
Normal plot This function provides a quantile-quantile (Q-Q) plot of a univariate dataset compared with the normal (Gaussian) distribution. If the distribution agrees well with a normal, One-sample KS test A selected univariate sample is compared to a uniform, exponential or normal distribution (with user-specified parameters). The two cumulative distribution functions are shown with the differences, and the Kolmogorov-Smirnov 1-sample test is applied. [ks.test, seq, punif, pexp, pnorm, ecdf] Anderson-Darling test As in the previous function, a univariate sample is tested against a uniform, exponential or normali distribution using the Anderson-Darling test. This is more sensitive than the Kolmogorov-Smirnov test for multiple small-scale deviations and differences at the edges of the distribution. The function, implemented in CRAN package adk, permits k-sample comparisons, but this is not implemented in VOStat (Scholz 2008

Some standard tests
One-sample Z test This is a hypothesis test that compares the mean of a chosen univariate dataset with a user-specified value. It applies when the dataset is approximately normal (Gaussian) distribution, the variance is known, and the sample is sufficiently large. [nrow, pnorm]

One-sample t test This is a similar hypothesis test for the mean but does not require prior specification of the variance. [t.test]
Paired t test This is a similar hypothesis that the means of two univariate samples have the same mean (or with a specified offset). Note the assumption of Gaussianity. [t.test]

Linear regression
Multivariate linear regression is performed using R's important lm (linear modeling) function. Its use is described in detail by Sheather (2009). The user specifies a single response variable (optionally weighted by measurement errors) and one or more independent variables (without measurement errors). The fitted regression coefficients and their confidence intervals are produced, along with diagnostic residual plots to assist in evaluating the quality of the fit. (lm, length, for, plot) SAMP Nadaraya-Watson local regression This is a nonparametric bivariate kernel regression technique to estimate conditional expectation of the random variable Y on X. It can find nonlinear relationships of unknown functional form. A constant bandwidth for a Gaussian kernel derived by Nadaraya and Watson in 1964 is used. The resulting plot shows the bivariate scatter plot, local regression curve, and 95% confidence intervals obtained from bootstrap resamples. An example is shown in Figure 2c. The algorithm is implemented in CRAN package np (Hayfield & Racine 2008). [npregbw, npplot, points] Major axis/orthogonal regression While ordinary linear regression requires specification of a 'response' or dependent variable, astronomers sometimes seek intrinsic relationships that treat random variables in a symmetrical fashion. VOStat implements three symmetrical bivariate leastsquares linear regression fits using the CRAN package lmodel2 (Legendra 2008). Astronomical applications of symmetric regression lines are described in Feigelson & Babu (1992).

LOESS regression
A popular method for local polynomial regression fitting based on weighted least squares fitting of linear functions locally around each data point (Cleveland 1994). A local parabola is used here, and the user specifies a smoothing bandwidth parameter. [loess, SAMP, plot, order, lines] SAMP

Linear quantile regression
Quantile regression is a sophisticated method that fits, for sufficiently large datasets, linear and nonlinear relationships to the median or other quantile of the response variable conditioned on the independent variable (Koenker 2005). This is valuable if the scatter about the relationship is asymmetrical or otherwise non-Gaussian. The calculation is made by CRAN package quantreg (Koenker 2012).
Robust regression using an M -estimator A method of iterative least squares multivariate regression where the influence of outliers are down-weighted using Huber's ψ function (Huber 1981). The VOStat implementation is restricted to linear fits, but a much wider range of functions is permitted in R. [rlm, plot] SAMP

Robust regression using trimmed least squares
Another linear robust regression method that removes, rather than downweights, outlying points (Rousseeuw & Leroy 2003 Canonical correlation analysis A multivariate procedure that finds two subsets of variables whose linear combinations have maximum correlation with each other. [cancor, no.omit] Model based clustering This is an important method to find structures in multivariate data assuming they follow multivariate normal (Gaussian) distributions. These can be shaped like spheres, ellipsoids, pancakes or cigars in p-dimensions. This is called a 'normal mixture model' (McLachlan & Peel 2000). Parameters are obtained by maximum likelihood estimation using the EM Algorithm with model selection (i.e., choice of number of clusters) based on the Bayesian Information Criterion. This well-known code is in CRAN package mclust (Fraley & Raftery 2006). The VOStat code produces a plot of BIC vs. number of clusters, a pairs plot with symbols reflecting cluster membership, and tabular output giving the multivariate means and variances of each cluster.
[mclutBIC, plot, summary] SAMP Hierarchical clustering Nonparametric hierarchical clustering is based on agglomerating proximate data points into increasingly larger groupings using a chosen metric and cluster criterion. VOStat assumes a Euclidean metric, so standardization of variables may be desirable. Clustering criteria include: Ward's, single linkage, average linkage, complete linkage, McQuitty's criterion, median and centroid. Note that 'single linkage' clustering is commonly called the 'friends-of-friends' algorithm in astronomy, but is not recommended due to its propensity to 'chain' together distinct structures (Everitt et al. 2011). [dist, hclust, plclust, cophenetic, cor, cutree] SAMP

Spatial methods
Variogram A variogram is a function measuring the spatial dependence of a random field that is sampled at data points in two dimensions. Three variables are involved: the variable whose spatial pattern is of interest, and two variables representing location. The plot shown here gives the variance of the difference between the intensity of the process at two locations separated by a distance in the spatial plane. The calculation assumes the point process is stationary (same behavior at all locations) and isotropic. The plots of temperature fluctuations as a function of angular scale for the cosmic microwave background, and of velocity dispersion as a function of physical scale in molecular clouds are astronomical examples of variograms. It is implemented here in the CRAN package gstat (Pebesma 2004). [variogram, plot]

k-nearest neighbors and Moran's I test
The k-nearest neighbors method useful for forming an idea about the topology of a point process. Each point is connected to its k nearest neighbors. Moran's test can check for spatial autocorrelation of a variable with respect to this topology. It gives the probability that the autocorrelation is consistent with complete spatial randomness. [spdep, RANN] Ripley's K and related measures Astronomers commonly examine the 2-point correlation function (known in statistics as the 'pair correlation function') for clustering of a stationary point process, showing the strength of spatial clustering as a function of distance. Statisticians commonly use the integral of the 2-point correlation function to avoid arbitrary binning; this is known as Ripley's K function. The plot provided by VOStat shows the K function with two corrections of edge effects, and a theoretical curve assuming complete spatial randomness. Also shown are Besag's L (the K function rescaled to be horizontal in the null case of random distributions), Baddeley's J (a different function that is less sensitive to edge effects), and the pair correlation function. Computations are performed with the CRAN package spatstat (Baddeley 2005). A pair correlation function is shown in Figure 2e. [as.matrix, owin, as.ppp, Kest, Lest, Jest, pcf, plot] Voronoi tessellation and adaptive 2D kernel density estimation The Voronoi (or Dirichlet) tessellation is a division of the space into distinct regions around each data point defined such that each tile contains the locations closer to that point than any other point. The result is the division of the space into polygons around each point. The inverse of the polygonal area is a measure of the local density of points, and is used to create a nonparametric adaptive smoother of the distribution of points. VOStat provides both a 2-dimensional grey-scale map and a 3-dimensional perspective plot of the smoothed distribution that can be interactively rotated. These operations are computed with spatstat. A Voronoi tessellation and its associated adaptively smoothed distribution are shown in Figure 2f -g. [as.matrix, owin, dirichelet, adaptive.density, plot, surf3d] 5.9. Directional statistics Plotting directional data A circular plot is done to provide an idea about the general shape of the data. It is an anlog of barplot or histogram for directional data. A directional data plot is shown in Figure 2a.

Summarizing directional data
Since directional data deal with angles, one has to be careful about 2π radians wrapping back to 0. VOStat computes the mean direction as well as the mean resultant length. [circular] Kernel density estimation This tools computes kernel density estimate from directional data. Testing goodness-of-fit with a von Mises distribution Usually fitting a distribution to a data is not enough until we know if the fit is adequate. Watson's test checks goodness of fit of a given data set with a fitted von Mises distribution. It requires bootstrapping. [CircStats] Testing goodness-of-fit with a uniform distribution Same as above but with uniform distribution in place of von Mises distribution.

Survival analysis
Maximum likelihood Kaplan-Meier estimate of the survival curve for censored data Astronomers often observe objects without detection, giving rise to upper limits of the observed property. In statistics, these limits are called left-censored data points, and are treated with methods from 'survival analysis'. The Kaplan-Meier estimator is the unique maximum likelihood estimator of a randomly censored univariate sample. Confidence intervals are based on Greenwood's formula. A VOStat plot of a Kaplan-Meier estimator is shown in Figure 3. [Surv, surfit, plot, lines] Tests for censored data Two-sample tests to compare different univariate samples with censoring. Generalizations of the Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests are applied. Here, censoring patterns need not be random or the same in the two samples. [survdiff] Maximum likelihood Lynden-Bell-Woodroofe estimator While censoring occurs when known objects are not detected in some property, truncation occurs when an unknown number of unknown objects are not detected due to sensitivity limitations. Analogous to the better-known Kaplan-Meier distribution for censoring, the Lynden-Bell-Woodroofe estimator gives the unique maximum likelihood estimator for a randomly truncated univariate variable (Lynden-Bell 1971;Woodroofe 1985). Useful for estimation of luminosity functions in flux-limited astronomical surveys, its mathematical properties are often better than heuristic functions like the widely-used 1/V max estimator with arbitrary binning (Schmidt 1968).

Time series analysis
Nearly all time series analysis in R and CRAN requires regularly spaced data without gaps. These methods are thus inappropriate for many irregularly spaced astronomical datasets.

Time series plotting with smoothing
This VOStat function plots a univariate time series with Gaussian kernel smoothing based on user-provided bandwidth. A time series plot is shown in Figure 4a. [plot.ts, ksmooth, lines] (Partial) Autocorrelation Function (ACF and PACF) Computation and plot of the autocorrelation function of the univariate time series; this is the correlation as a function of lag time. The partial autocorrelation function removes the effects of intermediate lags. A PACF plot is shown in Figure 4b. [acf, pacf] Autoregressive modeling This is a maximum likelihood fit of the time series to a stochastic autoregressive model, where the order of the model is selected by the maximum of the Akaike Information Criterion (AIC), a penalized likelihood measure. VOStat provides plots of AIC against order, and shows the periodogram of the resulting best-fit model. [ar, plot] Periodogram with smoothing option This function provides a Schuster periodogram based on a Fourier transform of the time series, with Daniell (boxcar) smoothing. The time series is first detrended by a linear function, a 10% taper is applied, and a 95% confidence interval assuming white noise is shown. A periodogram of an astronomical time series is shown in Figure 4c. [spec.pgram]

Conclusion
While many resources are applied to the reduction of data as it arrives from telescopes, VOStat is one of the relatively few resources available to assist the astronomer in the later stages of scientific analysis, such as the discovery of patterns in the observations and comparison of observed properties of celestial objects with astrophysical theory. As eloquently phrased by Gregory (2005): The goal of science is to unlock natures secrets. ... Our understanding comes through the development of theoretical models which are capable of explaining the existing observations as well as making testable predictions. ... Fortunately, a variety of sophisticated mathematical and computational approaches have been developed to help us through this interface, these go under the general heading of statistical inference.
VOStat provides a user-friendly, interactive, Web-based interface that applies several dozen statistical operations to a user-supplied dataset. A unique difference from other statistical services is its integration with other interactive tools of the Virtual Observatory using the VO's SAMP communication system between applications. Using VO tools, an astronomer can extract carefully chosen datasets from huge, widely distributed, multiwavelength databases, analyzing and enhancing the data in various ways within VOStat. Some results from VOStat analysis can then be broadcast back to other VO applications through SAMP for visualization and further analysis.
The engine behind VOStat is the remarkable R statistical software system, the most comprehensive available public domain data analysis environment. We emphasize that VOStat taps only a small fraction of R, and a tiny (< 1%) fraction of the specialized CRAN packages. Thus, we strongly encourage VOStat users to extend the on-line analysis with a more thorough interactive analysis on their home computer using R and CRAN. Educational materials for learning R are available at http://www.r-project.org and in many books, including one designed specially for astronomy (Feigelson & Babu 2012). This work is supported by the NSF 'SI2-SSE' grant AST-1047586 (PI -G. J. Babu).

STOP
Core radius (pc)   Partial autocorrelation function showing complex structure, with 95% confidence intervals assuming white noise. (c) Fourier periodogram with detrending, tapering and smoothing, with 95% confidence interval assuming white noise. Data and analysis are described in Feigelson & Babu (2012).