Construct Validity of Science Motivation and Beliefs Instrument (SLA-MB): A Case study in Sumedang, Indonesia

Along with numerous instruments developed and used in science education researches, some of those instruments have been translated to local language in the country where the instruments were used. Most of researchers that used those translated instruments did not report the quality of those translated instruments. One of the instruments is the Scientific Literacy Assessment (SLA) including the Science Motivation and Beliefs (SLA-MB) as part of the SLA. In this study, the SLA-MB has been translated into Indonesian Language (Bahasa). The purpose of this study is to investigate the SLA-MB instrument that has been translated to Indonesian language from the view of dimensionality, reliability, item quality and differential item functioning (DIF) based on IRT-Rasch analysis. We used Conquest and Winstep as the program for IRT-Rasch analysis. We employed quantitative research method with school-survey on this study. Research subjects are 223 Indonesian Middle school students (age 13-16), with 64 boys and 159 girls. IRT-Rasch analysis of the SLA-MB Indonesian version indicated that a three-dimensional model fit significantly better than one-dimension model, and the reliability of each dimensions are about 0.60 to 0.82. As well as those findings, fit values of all items are acceptable, moreover we found no DIF for all of the SLA-MB items. Overall, our study suggests that Indonesian version of SLA-MB is acceptable to be implemented as research instrument conducted in Indonesia.


Introduction
The rapid development of science and technology in the 21 st century demands humans not only have the basic literacy such as reading, writing and counting but requires another ability that makes human can survive and follow this century which full of science and technological products. Basic capability that enables humans to survive in the rapid development of science and technology is called as scientific literacy. As described by the Organisation for Economic Cooperation and Development or OECD [1], scientific literacy is the ability to understand scientific knowledge, identify scientific issues, drawconclusions based on the evidences and make decisions on how the human impact the environment. Thus, scientifically literate society can make decisions and participate actively in the discussions about scientific issues which are currently being ardently discussed [2]. Unfortunately, based on the results from international survey, such as the Program for International Student Assessment (PISA) which is a popular triannual international survey, from 2000 until 2012 Indonesian students achievement on scientific literacy are always categorized in the low category, even in the 2012 survey from 65 countries that participated on PISA Indonesian students achievement on scientific literacy was placed on rank 64 [3]. The Indonesian Government is not standing still to follow up the results of this "long-standing" low achievement. It is proved in line with the publication of a new Curriculum 2013, which one of the goals of Indonesian education is to increasethe performance of Indonesian students in the International surveys and studies especially in PISA [4]. However, scientific literacy is not only deal with the cognitive aspects, but also affective aspects which is from several studies and some experts prove and mentioned that affective could significantly affect cognitive aspects [5], [6].One of the affective component that has been scrutinied and has a relationship with scienctific literacy is the motivation to learn science [7]. The motivation to learn science is one's internal state which aroses, directs and sustains in learning-science behaviours [7]. Motivation is essentially an elaboration of Bandura's social cognitive theory whichassumesthat students will learn efectively when it is self-regulated, which takes place when students can control their motivation against what they learn and it will eventually lead towards their learning outcomes [7], [8]. Making students become motivated self-learner is the goal of every teacher [9], because students, who have become a motivated self-learner in science classes, will have much curiosity to solve and discuss vigorously on science-related problems. This kind of capability will eventually lead students to implement it on their daily lives which are multi-discipline. It is aligned with one of the definitions and purposes of scientific literacy which stated that individuals can implement the methods, concepts and theories of science in their daily lives. Hence, the motivation to learn science has an important role in improving the student's scientific literacy, especially Indonesian students, which besidesthe results from PISA, the results from Rachmatullah et al. showed thatIndonesian student's motivation towards science is still low which is in the value of 60% [10].
In order to evaluate Indonesian students' attitude and motivation in learning science, we need a valid, reliable and qualify instrument which meets the psychometric standard of instruments, so that the results which are obtained from the instrument can be trusted. Moreover, recently Fives et al. developed a set of instruments to measure scientific literacy ofmiddle school students, and the instrument is not only measure stidents cognitive aspect, but alsoit measures student motivation towards science which is called Scientific Literacy Assessment-Science Motivation and Beliefs (SLA-MB) [6]. In previous studies, we have been using SLA-MB which has been translated into Indonesian (Bahasa), and we only reported the validation of its content, but did not report the validation of the construct [10]. In fact the construct validity is the important thing that should be reported because as Messick stated about construct validity, it provides evidence to support the trustworthiness of the score interpretation in the explanatory concept that describes the relationship between scores in the performance and test with other variables. In short, construct validity is evidence-based to interpret the score that obtained from the studnets or participants [11].
Based on Messick [11] in addition to the content validity that has mentioned above, there are five other aspects of the Construct Validity namely substantive, structural, generalizabity, external and consequential aspects. A substantive aspect of the construct validity is the empirical evidence that related to the consistency of the responses on the instrument [12]. Structural aspect of construct validity is an aspect that related to the consistency of responses that it could describe a particular domain or specific dimensions that lie in the instrument. Aspects generalizabity assumes that an instrument must be fair from a variety of respondents, no group of respondents who get disadvantage when doing the instrument. External aspects of contruct validity sees the correlation between the developed instrumentwithother similar instruments (e.g. same content), while consequences aspects of construct validity predicts how the consequences or effect on respondents after doing the instruments. For more information about the construct validity see Messick "Validity of Psychological Assessment" [11]. In order to determine the construct validity of an instrument, especially in the realm of social sciences namelypsychology and education including science education, the most well-known methods are through Classical Test Theory (CTT) and Item Response Theory (IRT). CTT assumes that the individual's attitudesscores on the rating scale form a linear combination with total score, which in CTT each item is presumed to havethe same difficulty level and standard error [13], so that when analyzing rating scale data (e.g. Likert scale) is often considered as interval data, and sum it directly from the raw data. In fact the numbers from the Likert-scale are ordinal data, which the differences between one points to another is exatly not known. As an example of the unclear distinction "strongly agree", "agree", "disagree", whether it was true that "agree" to "strongly agree" to have one point different attitudes? Point or score is a number belonging to the categorical data or namely ordinal data, which can not be used when testing statistical parametric tests [13], because only the data in the form of interval and ratio one which can be used in the statisticstests [14]. Thus, with the CTT method most reasearches are not same, and the results of the instrument, which were analyzed using CTT, can only be used in the same research subjects and can not be used for wider subjects [15].
IRT is one of psychometric methods to examine an instrument to be more reliable, valid and widely used [16]. IRT such as Rasch analysis thatis coinducted in this study will convert ordinal data into ratioscale data and produceitem parameters and person's parameters which are ratio data too [16]. The data which have been analyzed through Rasch analysis will become equal with other types of measurements such as height measurement, and so forth that is continuous data and has specificunit. The data that have been analyzed using Rasch method will have a unit called the logit Rasch (logarithm of odds) which isin interval/ratio level [13], [17].Therefore, recently many science education researchers mainly those who related to educational assessment havestarted using IRT as a method of testing the reliability, validity and legality of the instrument and began to leave the CTT method [18]. For more details about Rasch models see Boone et al [19]. Generally speaking, based on the above explanation the purpose of this study is to uncover the construct validity (dimensionality, generality and reliability) of SLA-MB which has been translated into Indonesian by using IRT-Rasch model where the method was not used by the instruments'developers.

Research methods
Proviously this research had a purpose to uncover Indonesian middle school students' scientific literacy in one district called Sumedang, so the data collection procedureis referred to how PISA collects the data, in which the subject is the student with average age of 13-16 years. In Indonesia, students with that age-range are in the ninth grade of middle school. Therefore, the research subjects are 223 middle school students of grade ninth consisting of 159 girls and 64 boys. Students who used directly as the subject of our research are derived from eight middle schools in four sub-regions in Sumedang district. Four of eight those middle schools are A accreditated schools and the other fourschools are the school with B accreditation, so that each sub-region have representation in both A and Bacreditation. Because it is only used for the pilot study and the purpose of generalization, we only collected samples one class from each school.
As has already been mentioned and described in the previous section about research instrument that we used, we employed one part of scientific literacy instruments that have been developed by Fives et al [6] which is the Scientific Literacy Assessments-Science Motivation and Beliefs. At the time of developing the SLA-MB, Fives et al divided it into three indicators which are the value of science, self-efficacy and personal epistemology. SLA-MB is formatted rating scale with Likert-scale 1 to 5types. Before employing SLA-MB as the research instrument, it was translated first from English into Indonesian by experts in the field of English-Indonesian and then re-checked by an expert in the field of science education for its contextualization fit. Table 1 is listed each item on each indicator of SLA-MB both Original English version and Indonesian version.
The construct validity of an instrument or assessment consists of six aspects: content, substantive, structural, generalizabity, external and consequential. In this intensive research we did not examine all of these aspects, but we just examined three aspects that we believe are the crucial aspects and highly correlated with the definition of construct validity. The third aspects that we have done are substantive, structural and generalizabity. Before analyzing these aspects, at the time of coding data, we reversed the score or points items that have a negative attitude. Then after that, firstly we examined the structural aspects. Exploring structural aspects, we employed the dimensionality test based on IRT-Rasch analysis. We tested two types of Rasch models for SLA-MB instruments, whether such instruments isunidimension or multidimension (three-dimension) that is accordance with the validity of the content.The higher chi-square (X2) value and lower value on Final Deviance and the Akaike Information Criteria (AIC) than other models indicatethat the model fit well to the instrument. Examining substantive aspects, we conducted an analysis of reliability by using IRT-Rasch (through PV-reliability) and CTT (Cronbach's Alpha), in addition we also reporteditem fit indices of each dimension by reporting mean-square (MNSQ), which the benchmark for MNSQ rating scale is between 0.7-1.4 [19]. And lastly, in order to uncover the generalizabity aspects, we did Differential Item Functioning (DIF) gender analysis and examines whether there is gender bias for each item or not. According to Boone et al. items contained bias could be seen from the DIFcontrast > 0.64 [19]. We used software Conquest 4.5.0, SPSS 22 and Winstep 3.68.2 to analyze our data. Saya dapat menggunakan sains untuk membuat suatu keputusan mengenai kehidupan sehari-hari saya.

SE2
I know how to use the scientific method to solve problems.

SE3
It is easy for me to tell the difference between scientific findings and advertisements.

SE4
When I do my work in science class, I am able to find the important ideas.

SE5
I can use math to answer scientific questions. Saya dapat menggunakan matematika untuk menjawab pertanyaan ilmiah.

SE6
I can tell the difference between observations and conclusions in a story.

SE7
It is easy for me to make a graph of my data.

Findings
As can be seen on  Substantive aspect of SLA-MB was examined through reliability test, which the results are shown on Table 3. based on Table 3 the reliability values of each dimensions through CTT methods (Cronbach's alpha) and IRT-Rasch (PV-reliability) are above 0.6, which means that thethe results obtained from instrument can be accepted.
Indicator Value of Science has a reliability value on CTT and IRT-Rasch 0.691 and 0.735 (respectively), for self-efficacy has value of reliability 0.603 and 0.662, and for the Personal Epistemology dimension has reliability value 0.785 and 0.817. In addition to value the reliability, we examined the substantive aspects to explore the fitof each item with the Rasch model, the results are shown on the value weighted and unweighted MNSQ on Table 4. For dimensions Value of Science, the MNSQ value of each item ranged from 0.87 -1.24 logit, for the dimensions of self-efficacy MNSQ values ranged from 0.79 -1.21 logit and for personal epistemology dimension ranged from 0.65 -1.31 logit. According to the MNSQ value rangedbetween 0.7-1.4 logit for almost all of the items in each dimension of SLA-MB, it means that the items fit Rasch model. As shownon Table 4, we did not find any DIF gender or gender bias from all of SLA-MB items. The DIF contrast values of each item are still in the value of lower than 0.64. DIF contrast that is rangedon the values of science dimension is from 0.03 to 0.27, on the dimensions of self-efficacy is ranged from 0.03 to 0.54 and Personal Epistemology from 0.09 to 0.35.

Discussions
The validity of an instrument is an overall evaluative assessment of the empirical and theoretical evidence supporting the feasibility and suitability of interpretation and feed-back based on the test scores or other parts of the assessment [11]. Therefore, the validity of an instrument is not only as a supporting instrument or assessment but also as a reflection of what we explore through the instruments. A valid instrument will produce a finding, data and other thingsthat we searched through itby reflecting what it should be reflected in and obtained from the instument. Based on our findings on the construct validaty of SLA-MB that have been described in the previous section, in this section we will discuss the extent to which aspects of construct validity are traced from this study might be explained. First, the structural aspect of construct validity is related to the consistency of response so that from participants that can describe a particular domain or specific dimensions. Based on Table 2, it is shown that the numbers of domains that exist on the SLA-MB based IRT Rasch analysis arethree domains, which indeed it is in accordance with the content validity that has been validated by the original developers of SLA-MB. Those three domains or dimensions are the value of science, self-efficacy and personal epistemology. The importance of disclosure the domains or dimension besides indeed reveal the structural form of an assessment or research instruments, with the disclosure of an instrumentdimensionsionality will provide clarity regarding what are the variables that traced from research instruments that we used.Basically an instrument or assessment can uncover one or more variables whether it will be knowledge, performance or attitude of the respondents [18]. Why we should use a statistical analysis, in this case IRT-Rasch analysis, and why do not it jut divided based on its content? As explained previously that the construct validity not only deal with the theoretical aspect (in this case the content), but it should be dealt with empirical aspect which isreflected from data sample. Thus, the validity of the content and structural in this case are complemented to each other. The implications of the acquisition of an instrument dimensionality is to uncover the boundaries 6

MSCEIS
IOP Publishing IOP Conf. Series: Journal of an item in instruments that measure the properties that are supposed to be measured by the instrument which is in the domain or the same dimensions with other items that measure the same variable.
The results of the dimensionality investigation will impact on the quality of every item that exists in every dimension. Based on the results that shown in Table 4 especially on the fit indices or MNSQ of each item, we found that almost all items has MNSQ values are met the benchmark. It shows that every item is fit with Rasch models, in which case each item fit with every dimension they are occupied. But, we found one of the items on the Personal Epistemology dimension that overfit Rasch model, the item is EB4 with MNSQ 0.66. We assume that the item is still fit the Rasch model because the relative value of 0.66 is still being near the cutoff.In addition to the items fit, substantive aspects of an instrument can also be explained by the reliability of the instrument or the internal consistency of the instrument. With sufficient reliability, then a predictable instrument can provide consistent measurement results [18]. Based on the results of reliability test through CTT with Cronbachs' alpha and reliability based on the Rasch model with the PV-Value, every dimension of SLA-MB possessed reliability values which belong to the instrument which has good internal consistency [20].
The last finding ofour study is Differential item Functioning (DIF) gender of SLA-MB. The purpose of analyzing DIFgender is to examine whether one gender group benefited at the time doing on each item in the SLA-MB and othergender groups disadvantaged. Based on the results that shown in TABLE 4, we found that there is no big gender bias from all of the items (DIF contrast <0.64). It reveals that the SLA-MB instrument is not an instrument that is biased or in favor of one gender group, it also supports the generality of SLA-MB. The absence of DIF or bias of the instrument SLA-MB implies that the obtained survey results can perform a comparative study of gender because no significant gender bias in each item. In contrast, if we found gender bias, for example, some items more easily answered by girls than boys, then the findings could not do a comparative analysis of gender, because it is quite predictable that girls might have higher results.

Conclusions and limitations
Based on our findings, the exploration of contruct validity on SLA-MB instrument using IRT-Rasch, it was found that the SLA-MB fit with the three dimensions Rasch model which consist of the Value of Science dimension, Self-efficacy dimension and the Personal Epistemology dimension, which corresponds to the validity of the content that have been analyzed by SLA-MB developers. Based on substantive aspects, SLA-MB has acceptable value of internal consistency and has MNSQ values that indicateeach item of SLA-MB fit with Rasch models. In addition, in the aspect of generalizabity, there is no gender bias of each item ofthe SLA-MB, so that each item has the same difficulty level if done well by female students and male students. With such results, we conclude that the SLA-MB Indonesian version can be used as an instrument to explore motivation, values and beliefs of Indonesian students towards science.
Although this study has done to produce and reveal the construct validity of the SLA-MB using IRT-Rasch, but not all aspects of construct validity are revealed. Therefore, for further study more aspects of the disclosure of the construct validity are strongly advised to obtain the better validity and legitimacy of instruments that fulfill the pre-requisites of psychometric and psychology.