Home

Confidence Scores for Exam Questions

Guessing correctly doesn't necessarily indicate knowledge.


Contents


The Problem With Guessing

Both multiple-choice (MCE) and free-response exams (FRE) don't necessarily show how confident the student is in their answers—they just show the final answer or the thought process, respectively. But that means the student can just guess and get it right, even if they don't know for sure that the answer or thought process is correct!

Take MCEs. For a four-choice question, the student could improve their odds of guessing from 25% to 33-50% chance just by eliminating one or two answer choices. While this demonstrates some knowledge about what the answer isn't, isn't necessarily proof that they know what it is, even if they choose correctly.

Or FREs. A student may have some idea of what the question is asking, but could take a guess at which formula/process to apply and still get it correct. (This is much less likely than on a MCE, but still possible based on personal experience.)

This is unfair for the students who truly know the answer, but get scored the same as the guesser. What would happen if students had to note down how confidence they are that they have the correct answer?


Brier Scores for Exams

Introducing Brier scores (BS). The formula for a BS (as related to exams) is:

\[BS = \frac{1}{N} \sum_{t=1}^{N} \left(p_{t} - o_{t}\right)^{2}\]

where:

A perfect Brier score is 0, indicating that all predictions were correct, since 1-1 = 0 ("I think I got it right and I did get it right") and 0-0 = 0 ("I think I got it wrong and I did get it wrong"). One issue here is that if the Brier score was all that mattered, is that exams could be easily gamed by just intentionally getting the wrong answer and being confident about it, which is generally easier to do than getting the right answer.

The question becomes how are correct, confident answers rewarded?


Implementation

Predictions must be greater than 0.5 (50%) because the goal is to get the answer correct. If the student is less than 50% confident they can get the answer right, then they should change their answer to be more confident or be penalized accordingly. This will make each question have two sections of answers: section A, for the actual answer choices, and section B, for the student's confidence level in the respective answer choice (50%, 60%, 70%, 80%, 90%, or 100%).

Scantron forms or tests can be modified to have two sets of answers per question to calculate the BS for that question. The total BS is then calculated to give the student their score with the lowest score being the best (alternatively, the BS can be inverted and the highest score wins).


Benefits

Exams that sort people by skill should rarely have perfect scores because the signal gets lost, at least to some extent. One person getting a perfect LSAT or perfect SAT signals that person is incredibly intelligent; 1% of the test-taking population getting perfect scores indicate they're pretty smart and the test is too easy.

Adding confidence scores lets further sorting happen. If Ernest, Jameson, Douglas, Alistair, Tannatt, and Raymond all get question X correct, but their confidences are 50-60-70-80-90-100%, respectively, then their scores will be 0.25, 0.16, 0.09, 0.04, 0.01, and 0.00, indicating Raymond is top dog amongst his peers. This is done without needing to make the test more difficult, but simply by figuring out how correct each person really is.

Combine this with a more challenging test and the stars will really shine.


Literature

After writing all of the above, I found that literature already exists under the name "confidence-based marking". There truly is very little that's new left under the sun.

A.R. Gardner-Medwin discusses it in their Confidence-Based Marking - towards deeper learning and better exams, whose abstract states:

... A critical point is that they [students] benefit either by finding reasons to place greater reliance on an answer or by seeing reasons for reservation. This places a premium on careful thinking, and on checks and the tying together of different facets of knowledge, thereby encouraging deeper learning. In exams it generates higher quality data than conventional scores, with greater statistical reliability and validity as a measure of knowledge, and less contamination from chance factors associated with weak and uncertain knowledge. The puzzle remains, why this seemingly sensible strategy for objectively marked tests is so readily embraced by students and yet so little used by teachers.

I'd guess that high performers are fans because it helps distinguish them from the low performers and low performers aren't fans because it they are more likely to get worse scores than in the binary correct-incorrect scoring system.

I won't go through every study I can find, but Gardner-Medwin's paper has some good references.


See Also