Validating Quantitative Measurement Using Qualitative Data: Combining Rasch Scaling and Latent Semantic Analysis in Psychiatry

An extension of concurrent validity is proposed that uses qualitative data for the purpose of validating quantitative measures. The approach relies on Latent Semantic Analysis (LSA) which places verbal (written) statements in a high dimensional semantic space. Using data from a medical / psychiatric domain as a case study – Near Death Experiences, or NDE – we established concurrent validity by connecting NDErs qualitative (written) experiential accounts with their locations on a Rasch scalable measure of NDE intensity. Concurrent validity received strong empirical support since the variance in the Rasch measures could be predicted reliably from the coordinates of their accounts in the LSA derived semantic space (R2 = 0.33). These coordinates also predicted NDErs age with considerable precision (R2 = 0.25). Both estimates are probably artificially low due to the small available data samples (n = 588). It appears that Rasch scalability of NDE intensity is a prerequisite for these findings, as each intensity level is associated (at least probabilistically) with a well- defined pattern of item endorsements.


Overview
Issues of validity have a long-standing history in measurement within the social sciences, as it is may not be clear that questionnaires and tests indeed capture what they are intended to measure, a requirement often called construct validity (Messick, 1989). A test is said to show face validity if testtakers' (respondents') subjective hypotheses concerning a tests's purpose correspond to the designers' intentions. While important in tests' public acceptance (Holden, 2010), face-validity is not considered to be authoritative. Instead, most serious validation research relies on outside statistical criteria: predictive validity reflects the possibility of predicting particular behavioral outcomes, and concurrent validity revolves around a measure's correlation with existing measures of the same construct. Borsboom and Markus (2013) have criticized these approaches for confusing consistency and truth, and others have argued that validity should revolve around the suitability of an instrument for application in actual practice (Fisher, 2003). Rasch scaling aids in this respect as it provides a variety of fit statistics to identify items that fail to show construct invariance and persons for whom the items do not combine to form linear measures (Michell, 2003).
The preceding notions regarding validity stand in stark contrast to the aims of qualitative psychology which focuses on understanding behavior from informants' perspective within a dynamic and negotiated reality (see e.g., Minchiello, 1990). According to Miles and Huberman (1984, p. 225) "In qualitative research, numbers tend to get ignored. After all, the hallmark of qualitative research is that it goes beyond how much there is of something to tell us about is essential qualities." Exceptional in this respect are qualitative studies (Lange, et al., 2014) that encode their data in a format suitable for multi-faceted Rasch scaling (Linacre, 2013). However, most qualitative research efforts report their findings in the informants' subjective language thereby making it difficult to judge their validity, as it is difficult to connect the findings to those obtained by other researchers.
Using data from a medical domain (psychiatry) -specifically, the reporting of near-death experiences (NDE) -this paper aims to show that purely qualitative data can be transformed into a form that is suitable for the validation of quantitative measures using unsupervised learning. The approach is based on algorithms that were recently developed in Artificial Intelligence and Natural Language Processing to perform Latent Semantic Analysis (LSA) using large data sets (see below). LSA takes respondents' qualitative (written) accounts of their own subjective opinions, insights, and interpretations of events, and places these inside a high-dimensional semantic space. Also available is a Rasch scaled questionnaire that assesses important aspects of this domain. Standard statistical techniques will then be used to predict respondents' locations on this dimension. Thus, the preceding can be seen as a special case of establishing concurrent validity by connecting qualitative and quantitative data. The approach involves considerable simplification, as LSA reduces the richness and complexity of data valued by qualitative researchers. It is not clear if this simplification actually enhances or detracts from the overall effectiveness of the approach.

Latent Semantic Analysis
Assuming that we have d documents and t relevant terms one can construct a matrix X whose rows 1, …, i, …, t represent a dictionary of selected terms (or "tokens"), and whose columns 1, …, j, … d represent the documents containing these terms. Each entry X ij reflects the weighted frequency with which dictionary term i occurs in document j. The weighting does not affect the following overview, and the topic is further addressed in Section 3.
It has been known since last century that any matrix X can be written as the product of three matrices: where T has size t x d and D has size d x d, both with orthonormal columns (i.e., T' T= I and D' D = I), while S is a d x d diagonal matrix. The matrices T, D, and S are referred to as the left and right singular vectors and the diagonal matrix of singular values, respectively, and together these define X's Singular Value Decomposition (SVD). By convention, the diagonal elements of S are positive and ordered in decreasing magnitude. The utility of SVD lies in the fact that keeping just the k largest singular values (i.e., setting S k+1 = S k+2 = …S d = 0) yields a matrix X* which increasingly approximates X for greater k. Doing so produces two useful results: First, X* is a matrix of rank k which is closest to X in a least-square sense. Second, since the corresponding columns in T and D will be multiplied by 0, they can simply be deleted, yielding T* and D* with sizes t x k and d x k, respectively. LSA (se, e.g., Landauer, Folz, and Laham, 1998) uses SVD to reduce the dimensionality of the semantic space needed to represent a set documents from d to k. The choice of k is important as it should be chosen such that X* captures mainly the "real" data in X, while omitting the sampling error. There are no obviously proper ways of achieving this goal, and researchers therefore experiment with different choices of k to achieve best results. In practice, the initial dimensionality t is in the order of several thousands, whereas the reduced space has a dimensionality of k = 300-500.
Of course, the approach would be of little use if Eq. 1 has to be recomputed each time a new document y is encountered. However, assuming that the model is correct (i.e., X = X*), a document's  (Deerwester et al., 1990, p. 399): While the mathematics behind SVD has long been known, the computation of the T, S, and D of sufficient sizes needed for actual text analysis applications has become feasible only with the advent of modern computers and the development of efficient algorithms. Fortunately, recent years have seen a rapid development of extremely powerful methods based on the use of sparse matrices, and capable of distributed and/or parallel processing. For instance, Řehůřek and Sojka (2010) describe the Python based Gensim software capable of analyzing enormous data sets on a standard personal computer.

Term Weighting
It is possible to base LSA directly on the frequency by which term i occurs in document j. However, in practice better results are obtained by transforming the between term-document co-occurrence matrix into a locally/globally weighted term frequency -inverse document frequency (tf-idf) matrix with realvalued entries. Such transformed values increase proportionally with terms' occurrence within a document, but -in order to control for the fact that some words are more common than others -these values decrease proportional to the logarithm of the frequency of the term across the entire corpus. For instance, assume that there are just three documents d 1 , d 2 , and d 3 and that some term z occurs 3 times in d 1 and once in d 2 . Then the term frequency tf = 3+1+0=4 and the logarithm of the inverse document frequency idf = ln 3/2 = 0.41 (using natural logarithms), and hence tf-idf z = tf*idf = 4*0.41 = 1.64.

Case Study: Near-Death Experiences
It is well-known to physicians and mental health professionals that people (adults and children alike) when suddenly faced with their own death experience, a distinctive state of consciousness in which their existence is seemingly unbounded by a physical body or earthly environs. Such near-death experiences (NDE) are among the most potent of psychological episodes (for an overview, see e.g., Holden, Greyson & James, 2009). In fact, the Diagnostic Manual of Statistical and Mental Disorders contains the V-Code category "Religious or Spiritual Problem" in part to acknowledge and guide clinicians in addressing the impact and aftereffects of NDEs and related experiences (Lukoff, 1998). Greyson(1983) quantified NDE intensity by means of a series of questions derived from statements of individuals who shared their accounts following such experiences. Specifically, NDErs (i.e., those reporting an NDE) were asked to rate the occurrence of 16 different experiences in terms of three ordered categories which generically represent 'not present,' 'mildly or ambiguously present' or 'definitively present.' Indicative of its measurement validity, the NDE proved to be Rasch scaleable (Lange, Greyson, Houran, 2004).
The questions constituting Greyson's NDE measure are listed in Table 1, together with their item difficulty in logits. Note that items at the lower end of this scale (i.e., with the highest endorsement) refer to experiences of peace, joy and harmony, followed by finding insight and mystical or religious experiences at intermediate levels, while items with the highest values (least endorsed) refer to an awareness of events occurring in a different place or time. By the very nature of Rasch scaling, this sequence defines a true hierarchy such that events mentioned higher in this sequence become salient only after of those lower in the sequence have already been reported -at least probabilistically. In other words, there exists a well-defined relation between NDE intensity and the verbal, i.e., qualitative, contents of the NDE.
Naturally, those who have not had near death experiences cannot be expected to provide meaningful information about NDE. Therefore, Table 1 shows the item locations as computed for those that were classified psychiatrically as having had "True NDE" only (cf, Greyson, 1983). An independent sample of 833 NDE accounts was available based on which a semantic space was constructed using Řehůřek and Sojka's (2010) Gensim software. This space used a dictionary with 1500 tokens (terms) that were deemed relevant, and the resulting 1500 x 833 token x document matrix was decomposed according to Eq. 1. A total of 588 of the 833 people also had completed Greyson's NDE scale, and their Rasch NDE intensity estimates were combined with the their accounts' coordinates in the semantic space as computed using Eq. 2. Using the glm procedure provided by the R language, the first 50, 100, …, 400 coordinates (ordered by importance) were used to predict respondents' estimated Rasch NDE intensity using standard multiple regression.
Not surprisingly, Table 2 shows that using increasingly complex semantic spaces adds predictive information, as R 2 increases continuously when more predictors are added. Consistent with other research, the corresponding R 2 values when adjusted for sample size and number of predictor variables reach a maximum (i.e., 0.33) for about 250 predictor variables and this value is perhaps a more appropriate estimate of the quality of prediction. Nevertheless, the finding that at least 1/3 of the variation in respondents' quantitative Rasch NDE measures can be predicted from their qualitative NDE accounts strongly supports the notion that qualitative data can be used to validate quantitative variables. The multiple correlation between variance explained by the first 250 factors and their weights in the multiple regression equation is essentially zero (r = -0.06), i.e., the semantic factors do not contribute to prediction in proportion to the variance they explain in the SVD context.
It is noted that respondents' age could reliably be predicted from the semantics from their NDE accounts (R 2 = 0.25). However, the multiple correlation with NDErs gender proved negligible.

Conclusion
A standard approach to establishing content validity is to correlate the responses to one set of questions to sets of questions purporting to measure the same construct. This approach need not be limited to comparisons between quantitative measures since that Latent Semantic Analysis can be used to capture the meaning of free style written descriptions in order to establish a correlation with other a Rasch scaled variable. It is my conjecture that one variable's Rasch scalability may well prove crucial in this context, as Rasch scalability defines an "item hierarchy" which essentially imposes an orderly progression on the semantic labels corresponding to variables' quantitative properties.
It should be noted that analysis of textual data is not limited by the size of the text corpus, economic factors, or hardware constraints. For instance, Řehůřek and Sojka's (2010) Gensim used here is freely available under the OSI-approved GNU LGPL license. This software is capable of analyzing the complete English Wikipedia by decomposing a sparse matrix with 3.9 million rows x 100,000 dictionary terms with about 760 million non-zero entries in about 13 hours on a standard PC.