Investigating male bias in multiple choice questions: contrasting formative and summative settings

Previous studies have claimed that male advantage may arise from multiple choice question (MCQ) types; we have made a detailed evaluation of this hypothesis, finding limited evidence that female students are disadvantaged by MCQs in summative assessment. Additionally, we find no significant evidence of a gender gap around the use of multiple choice-type questions, including variants such as multiple response questions, in formative assessment. Our findings suggest that the use of a MCQ format is not a significant factor in the gender gap in assessment.

by males in second level physics and astronomy modules (at Level 5 in the Framework for Higher Education Qualifications). Previously, we considered the effect of written exam question scaffolding on this gap (Dawkins et al 2017). However, the use of multiple choicetype questions, alongside others, in both formative and summative assessment in these modules also leaves us well-placed to address the research question of whether there is evidence to support the assertion that the gender gap is exacerbated by the use of these question types.
The Open University offers open access distance learning courses with a substantial online component. A total of 360 credits are required for a degree, and students are encouraged to progress through a defined qualification pathway in order to provide themselves with adequate mathematical preparation prior to attempting the second and third level physical sciences. There are no formal pre-requisites at entrance and our students have a diverse range of educational backgrounds and motivations for their studies. We have a significant number of part-time students who are studying at a later stage of life than in most conventional universities and who contribute to a demographically diverse student population. However, despite these differences, we see a similar gender gap in attainment to that noted more widely within the sector.
In this paper we examine the role of the use of multiple choice questions (MCQs) in this attainment gap, firstly by considering their use in a summative setting. Our 60 credit core physics module at Level 2 (S207) and the two 30 credit astronomy modules (S282 and S283) are the first opportunity for the study of physics or astronomy as individual disciplines and present key topics at an introductory to intermediate level. The physics course covers a substantial fraction of the Core of Physics as defined by the UK Institute of Physics, including material on all the major topics. The astronomy modules form an introduction to the Sun, stars, galaxies and cosmology (S282) and planetary science and astrobiology (S283). Data collected over three presentations of each module (totalling 1270 students on S207 (24% female), 712 students on S282 (30% female) and 601 students on S283 (32% female)), allow us to compare the performances of male and female students in the multiple choice computermarked section of the exam with those in the constructed response section.
Additionally, we make use of interactive computer-marked assignments (iCMAs), short problems requiring numerical open responses or selected responses (such as multiple choice), that are used in formative assessment in Level 2 physics. In this study we analyse iCMA responses from a total of 1411 students (75% male; 25% female) to identify any gender bias and its variation by question type, to allow us to explore the difference between the use of multiple choice-type questions in formative and summative contexts.

Summative assessment
In table 1 we present the mean percentage scores in the MCQ and written sections of the end of course examination of the male and female students from each of three cohorts of the Level 2 physics and astronomy modules. In the majority of cohorts, we see both males and females achieving higher marks in the MCQ section of the exam. We also see a tendency for the increase in the scores of the males in the MCQ section to be greater than that of the females, and the sixth column of table 1 shows the female score difference subtracted from that of the males. Positive values indicate situations where the male score has increased by more, or decreased by less, than the female one in the multiple choice section. Although this is indicative of possible male bias, it is not conclusive as in both situations bias will be convolved with any difference in ability between the male and female cohorts.
To give us an indication of the magnitude of this effect, we consider the male and female results from the two summative assessment sections separately. By carrying out a Welch's t-test, we find the probability that the male and female scores on each section represent no real difference between the mean scores in that section. The probabilities are presented in the final two columns of table 1. A low probability of the true means of the male and female scores in the section being the same could be caused either by gender bias within the section, or by a difference in ability between the male and female students. Marked differences between the two sections are hence of interest as these indicate factors other than ability are at play. We test to a significance level of 0.05, with a Bonferroni correction applied for nine tests so that each cohort is tested at 0.0056. Only S282 in 2013-4 shows a significant difference between the means, in the MCQ section of the assessment. However, the 0.089 probability of the same mean in the written section implies ability cannot be discounted here. The opposite effect is seen in the 2014-5, with p=0.013 in the MCQ and 0.933 in the written section, which is suggestive of bias in the MCQ section for that particular paper, although not to a level that is statistically significant. Comparison of these figures for all cohorts shows that there is limited evidence of male bias in MCQ sections of specific examination papers but provides no support for consistent male advantage. Overall, the variation between modules and cohorts is notable and no consistent significant effect is seen around the use of MCQs in the summative setting.

Formative assessment
We consider now iCMA responses from three Level 2 physics cohorts, overlapping with the summative cohorts (through 2012-3 and 2013-4). (iCMAs are used only in the physics modules and not in astronomy.) In the iCMAs, we identified 15 of the 56 questions as taking multiple choice formats. Of these, eight were multiple response questions (MRQs), four were text-based MCQs, or questions containing such an MCQ element, one was a graph-based Table 1. Differing attainment in MCQ and written sections of summative assessment; the gender difference in the variation between scores in the MCQ and written sections (MCQ-written scores for the females subtracted from MCQ-written scores for the males); probability of male and female scores coming from distributions with the same mean. Scores are given as percentages. The distributions of the mean scores are typically truncated normal, with a standard deviation of around 20.

Cohort
Mean MCQ and two were in a true/false choice format. Two of these question types are shown in figure 1 to illustrate the variation beyond the basic MCQs that are used in the summative setting. Given these subtleties in question presentation, we wished to evaluate whether individual questions demonstrated any significant male bias, while accounting for student ability. Students were divided into strata by ability, defined by their overall performance on the full iCMA question set. We then calculated a Mantel-Haenszel alpha for each question, which finds the ratio of the success probabilities between the groups of the male and female students by evaluating  Conversely, a positive value suggests a female bias, with values of *  a | | 1 MH deemed to be potentially significant in either respect. Using a chi-squared distribution, each alpha value is also tested against the null hypothesis that the odds ratio is equal to one at each stratum, with an alternative hypothesis that at least one odds ratio is different from unity. A question is deemed to have significant bias if p.05, in addition to *  a | | 1 MH (Zwick 2012). Applying the Bonferroni correction here would suggest p0.001 1 as appropriate to determine significance.
Of the 15 questions in multiple choice formats, none are observed to have a significant male or female bias. Interestingly, the two questions with the lowest values of p (0.02 and 0.04) demonstrate a slight female bias (with corresponding * a MH values of 0.14 and 0.39). Both were MRQs, one on the topic of mechanics (illustrated in figure 1(a)) and the other, quantum physics. The next lowest (p=0.06), showed a stronger female bias ( * a = 1.5 MH ) in one cohort; this question contained a substantial MCQ element and covered aspects of thermodynamics. The full set of * a MH and p values is shown in table 2. It is particularly interesting to note that the S207 2012-3 and 2013-4 cohorts show no significant evidence of bias, in parallel with the behaviour they demonstrated in the summative setting. In total, we find that there is no evidence to suggest a female disadvantage owing to multiple choice-type questions within a formative setting.

Future work
Our finding of no significant male bias around the use of multiple choice-type questions in either a formative or summative setting is not in agreement with a number of other studies concerning the use of MCQs in physics assessment. Hazel et al (1997) suggest a MCQ attainment gap in physics and call for the use of the question type only in a diagnostic setting where common misconceptions can be employed as distractors. More recently, Wilson et al (2016) continue to note a gender gap around the use of MCQs in the competitive setting of In the formative data, we found suggestion of occasional female advantage. As we described, and illustrated in figure 1, the iCMA questions are not conventional MCQs, but can involve reading a reasonable quantity of text, which has been noted to favour female students and, for example, is one of the factors used by Wilson et al (2016) in analysis of their MCQ question set. The structure of our formative assessment also permits multiple attempts, with limited feedback. Whilst not all students wish to engage with the questions in this way, its availability may provide the student with a more positive experience of these question types than if they had only ever encountered them in a competitive, summative setting, and could hence be related to the idea of gender difference reflecting cumulative effects. Exploring this potential connection to the student experience and the students' wider background is of interest for future work.

Conclusions
We find equivocal evidence of an increased gain for males over females when MCQs are used in summative assessment. Our findings highlight the need for careful use of this question type in examinations but do not support the view that it is intrinsically problematic. When used in formative assessment, we found no evidence of male bias across a variety of MCQ formats and topics in physics covering mechanics, optics and electromagnetism, thermodynamics, quantum mechanics and solid-state physics.