This guest blog has been kindly provided by Dr Dennis Sherwood of Silver Bullet machine, an intelligent innovation consultancy, who was a speaker at the first of this year’s Recruitment & Admission Forum series of webcasts.
Calling all engineers!
Engineers love solving problems, and are very good at it. So this blog poses a real problem, a problem that has eluded solution for at least a decade, and a problem that does much damage every year. You are invited to think of a solution – or indeed more than one – and either post your thoughts in the comments on this page or in the thread on the Engineering Academics Network page on LinkedIn.
The problem – the Great Grading Scandal
Every year, about 6 million GCSE, AS and A level grades are awarded in England. And every year, about 1.5 million of those grades are wrong – about half too high, half too low. That’s, on average, 1 wrong grade in every 4. In this context, “wrong” means “the originally-awarded grade would be changed if the script were to be re-marked by a senior examiner, whose mark, and hence grade, is deemed by Ofqual, the exam regulator, to be ‘definitive’” – or, in more every-day language, ‘right’.
But when a student is informed “Physics, Grade B”, the student is more likely to think “Oh dear, I didn’t do as well as I had hoped”, rather than “the system got it wrong – the grade should have been an A”. So there are very few appeals: for example in 2019 in England, there were 343,905 appeals resulting in 69,760 grade changes, when in fact, as I have just mentioned, nearly 1.5 million grades were wrong. Exam grades are therefore highly unreliable, but very few people know. That’s what I call the “Great Grading Scandal”.
The evidence – Ofqual’s research
Ofqual’s November 2018 report, Marking Consistency Metrics – An update, presents the results of a study in which whole cohorts of GCSE, AS and A level scripts, in each of 14 subjects, were marked twice, once by an ordinary examiner and once by a senior examiner. For each subject, Ofqual could then determine the percentage of the originally-awarded grades for each subject that were confirmed by a senior examiner, so determining a measure of the reliability of that subject’s grades. Since this research involved whole cohorts, the results are unbiased – unlike studies based on appeals, which tend to be associated with scripts marked just below grade boundaries.
If grades were fully reliable, 100% of the scripts in each subject would have their original grades confirmed. In fact, Ofqual’s results ranged from 96% for Maths to 52% for the combined A level in English Language and Literature. Physics grades are about 88% reliable; Economics, about 74%; Geography, 65%; History, 56%. The statement “1 grade in 4 is wrong” is an average, and masks the variability by subject, and also by mark within subject (in all subjects, any script marked at or very close to a grade boundary has a probability of about 50% of being right – or indeed wrong).
The cause – “fuzzy” marks
Why are there so many erroneous grades? The answer is not because of “sloppy marking”, although that does not help. The answer is attributable to a concept familiar to every engineer reading this: measurement uncertainty. Except for the most narrowly defined questions, one examiner might give a script 64, and another 66. Neither examiner has made any mistakes; both marks are legitimate. We all know that.
In general, a script marked m is a sample from a population in the range m ± f, where f is the measure of the subject’s “fuzziness” – a measure that, unsurprisingly, varies by subject with Maths having a smaller value for f , and History a larger value.
Ofqual’s current policies
This fundamental fact is not recognised by Ofqual. Their policy for determining grades – a policy that is current and has been in place for years – is to map the mark m given to a script by the original examiner onto a pre-determined grade scale. And their policy for appeals is that if a script is re-marked m*, then the originally awarded grade is changed if m* corresponds to a grade different from that determined by the original mark m.
Ofqual policies therefore assume that the originally-given mark m and the re-mark m* are precise measurements. In fact, they are not. That’s the problem.
Your challenge is to identify as many alternatives as you can for one or both of these policies such that your solutions:
- recognise that the original mark m is not a precise measurement, but rather of the form m ± f, where the fuzziness f is a constant for each subject (and not dependent, for example, on the mark m, and which, for the purposes of this challenge, is assumed to be known), and
- result in assessments, as shown on candidates’ certificates, that have a high probability (approaching 100%) of being confirmed, not changed, as the result of a fair re-mark m*, thereby ensuring that the first-awarded assessment is reliable.
Genuinely, we want to hear your thoughts either in the comments on this page or in the thread on the Engineering Academics Network page on LinkedIn.
Click here for more details about the forthcoming webcasts in the EPC Recruitment and Admissions Forum Series and to book your place.