Writing Exam Questions I

September 30, 2021 Martin Jones

Practical advice for writing software development exams that are fit for purpose.

I’ve written many exam papers and marked many, many more. This, and the following posts, contain suggestions that I follow when writing exams. I show examples of how to fix poor form and provide a checklist for good exam questions.

TLDR; Here’s the checklist.

These posts represent my own, practical experience settings exam questions for software developers. Although formal research has been conducted, this is most definitely not it! Instead, this is a view from the trenches.

These are some of the biggest posts I have written, so I have broken them into three sections and a checklist:

Fit for Purpose (the contents of this post) - does the question meet our assessment needs
Phrasing - does the writing and structure allow the exam taker to focus on creating an answer
Weighting and Marking - are the questions weighted fairly, transparently, and do not hinder a marker
Exam Question Writing Checklist - a handy checklist of all the points to consider

Fit for Purpose

An exam, and its questions, must be fit for purpose. That, of course, begs the question of the exam’s purpose. It could be purely educational, e.g. highlighting a taker’s weaknesses so that the taker may address them. It might be part of ongoing assessments that periodically encourage the taker to prepare and understand what was taught. It might also be used to assess the quality of instruction rather than the taker’s ability. And, as a last example, it might serve no direct educational purpose but is instead a trial to be completed to demonstrate competence.

Regrettably, one does not have to look far to find exams that are not fit for purpose. Or at least, not the purpose that I would assume. Let’s provide two real-world examples from the DMVs of New York and New Jersey.

I took my NY driving test late in life, but I had held a British licence for many years prior. I could have exchanged my British licence for a NY one, but I opted to return it to the DVLA and start a fresh licence in NY, taking both the practical test and written test. Example questions for the written test were available from New York and third parties for a price. When I took the written portion (not a lot of actual writing was required) I was shocked to find the questions were identical to the examples!

This means that anyone that knows this (and apparently it was common knowledge) can pass the test merely by memorizing the example questions. Memorizing is clearly a skill, but does it align with the purpose of the exam? No, well, at least not to my mind. The exam should be a measure of driving competence, not memory competence. The stakes are high here; the driving incompetent should not be rewarded with a licence.

The NY written driving test appeared to me to be unreliable - if my experience wasn’t a fluke, then people can pass by memorizing a small subset of questions rather than the larger set of facts and rules they’re expected to know as drivers. A reliable exam strongly correlates ability with the score and will allow for a wide spread between competent and incompetent takers.

For the second example, we’ll consider New Jersey’s knowledge test. This is not a test I have taken myself but helping a friend from Europe prepare allowed me to experience it vicariously. My friend, with several safe years of driving in Europe under her belt, failed. She was upset and furious. Livid, in fact. Was the NJ knowledge test harder than the tests she had taken in Europe? No. From her experience, it’s an invalid test. She failed because of multiple questions on irrelevant details, something like ‘at what age can you get licence X’.

A valid exam must focus on the relevant.

Exams that we consider fit for purpose are both valid and reliable. In addition they need to be well phrased and well weighted. I found out, after deciding on the terms valid and reliable, that they are more formally defined for very similar purposes!

Returning to our examples, could they be fit for purpose, and I have just misunderstood their purpose? Yes, absolutely. For the NY test, the ease of administration at the DMV building might be an alternative measure of fitness. In addition, it might be politically considered ‘fitter’ to allow more people to drive in support of the economy whilst being able to point to an exam that exists but is rarely questioned by people once they have passed.

For the NJ knowledge test, it’s possible the DMV sees the exam as partly educational to ensure that new drivers (predominantly young I assume) know the rules around licensing and are encouraged to commit that to memory. Personally, I strongly feel that should have happened early, and driving tests are only about demonstrating competence in, you know, safe driving. A slightly less generous interpretation is that they have attempted to make it more reliable but have used the wrong measure - number of passes/failures rather than correlation of pass/fail to competence. Irrelevant questions can give the illusion of making the test harder, and maybe they had been asked to toughen it up. This might make a graph of test number vs percentage passing look as expected, e.g. 50% of first-time takers pass, 90% of second-time takers pass, etc. That’s not the right measurement, though. ~100% of good, competent drivers should pass first time. Bad drivers ideally would never pass.

A good exam and good questions are ‘fit for purpose’. The purpose needs to be agreed upon and communicated. My disappointment with NJ and NY can potentially be explained by a difference in perceived purpose. The following points demonstrate how to improve the general fitness of exam questions.

FFP1: Test Application, not Recall

Testing knowledge by assessing fact recall is not a good measure of a software developer’s skill. A developer’s job is one of application, creativity, and problem-solving. In other disciplines, fact recall is critical. A nurse incorrectly recalling a medicine dosage could have fatal consequences.

So, even if the purpose of the exam is to see if a taker has knowledge of X, I believe it is better to test the existence of that knowledge through application.

Consider the following C# exam question:

Describe when you would implement an interface implicitly versus explicitly.

wrong

This question is valid but unreliable - it can be answered purely by recall rather than application or problem-solving. This also assumes that the exam is being taken in a closed-book setting. Given that COVID-19 had forced most training to be remote at the time of writing, closed-book exams are not a viable option without invasive and unreliable proctoring software.

Let’s make some minor modifications to make this recall question an application question with minimal changes.

The GEM library within your current project provides the IGEMControlled interface. Most of the other classes within the project are expected to implement this interface. The functions of the interface are only called from within the GEM library.

Would you recommend implementing this interface explicitly? Justify your answer.

correct

Any taker that has understood the distinctions between implicit and explicit implementation should be able to deduce that, because the functions should not be part of the public interface of the implementing type, explicit implementation is the correct choice. Just stating whether implicit or explicit is a coin flip and not a reliable measure of ability. Hence we need the taker to justify their choice. We could even award all the marks to the justification, but we would need to make that clear to the taker.

A taker using reference materials is now no longer a concern. In fact, let’s just make it open book because the answer cannot be found directly in the course text, reference materials, or through an online search. If searching for this question yields a result, then it should be retired immediately. Such a search result would demonstrate that it has been shared. It might even provide a clue as to who leaked the question.

Should the GEM library be considered a distraction or a red herring? It’s neither; it’s useful because it makes the question more readable by being concrete. It also requires the taker to determine which facts are relevant without overloading them.

FFP2: Tie Each Question Back to Learning Point(s)

As a sanity check, every question should be tied back to a learning point or points. This might be as simple as a statement in the marker’s scheme. For our C# question above, question’s purpose is trivial to see: know when to use explicit interface implementation.

Tying a question to a learning point or points forces us to check that it is relevant for this course, and hence valid. Its presence in the marking scheme allows a marker to see the intended purpose, thus permitting them to assign marks according to purpose. Of course, all of the marking must remain consistent across all takers and all markers.

FFP3: Tie Each Question Back to the Lab(s) that Exercises Learning Point(s)

As well as tying it back to a learning point in the material, we should also check that there is a lab that has covered that learning point so that we know the taker has had a chance to apply this knowledge. Make sure that the question isn’t merely a rephrasing of what we have in the lab!

In an ideal world, the topics for a course are determined early (it would be wishful thinking to have learning points this early, though). Once topics have been determined, then course material, labs, and exams should ideally be written by different people. Realistically, it’s likely to be the same person, but we should still try to maintain that separation. The process of tying the course material, labs, and exams to learning points, highlights missed learning points and whether the material is too general, or too specific.

FFP4: Fair Distribution, not Bell Curve

Reliable exams use the full range of marks; 0% - 100% should be available to the taker. Don’t make 100% artificially hard to achieve, and don’t make the passing mark artificially easy to achieve. A reliable exam should allow us to determine the relative abilities of the takers. We probably don’t need or want to rank them, but takers need to see that the exam is fair and marks represent their competence.

Bell curves, or expected mark distributions, are useful for determining discrepancies but not fixing them. A bell curve used to ‘fix’ results by moving them closer to an expected distribution is problematic at least and may be a sign of academic corruption. It is, of course, desirable to have different papers to be considered of the same level, but fitting to a bell curve does not consider differences between the students and the instructors.

FFP5: Reliable Questions Widen the Spread of Marks

Some exam questions are, by their nature, less reliable. True/false questions are unreliable because takers have a 50% chance of getting the correct answer. Even putting a large number of binary questions together can be an issue if they’re correlated; a good exam taker will rightly use that to their advantage. This is likely to lessen the spread between marks, possibly meaning that we have to lower the passing mark and depriving takers of a strong connection between their mark and competence.

Multiple-choice questions provide more possibilities than a binary question meaning that there is less chance that guessing will be beneficial, but it’s hard to make them as reliable as a written question. Ease of marking is a valid consideration, though.

FFP6: Be Ready to Accept New Answers

Written answers permit flexibility even to the point where a taker may have an original answer that would score poorly by the marking scheme but is well reasoned enough that the marking scheme should be disregarded.

This is a major downside to multiple-choice questions. And, because a taker cannot challenge any of the answers, the writer must ensure that one of the possibilities is unambiguously a better answer than all the others. Writing good multiple-choice questions is hard!

FFP7: No Trick Questions

Challenging questions can be great. Misleading (trick) questions are not. It’s the exam writer’s job to provide a way for a taker to demonstrate what they know. As well as being unnecessarily antagonistic, trick questions are less reliable.

FFP8: Review with Fresh and Contrary Eyes

You should have a colleague read your exam questions with contrary eyes, exploring different ways of misinterpreting instructions and thinking of ways of answering which are not in the spirit of the question. Unlike having no trick questions, this is a place where adversarial is okay.

If you can’t have someone else review, then make sure you do a review a day or two after writing the exam question.

FFP9: Focus on Relevant Information

Only use irrelevant information if it is useful, like the C# exam question above. The GEM library was not relevant to the question other than it made the question more concrete and therefore easier to understand. Consider this to be exam writing’s Chekov’s gun.

Some exam questions have multiple distracting pieces of information. There could be a valid reason for this, such as it really being a question about determining what is relevant. But, aside from a case like that, an attempt to bamboozle leads to a less reliable exam question.

FFP10: All of the Above, None of the Above, Neither of Those

Don’t bother with ‘all of the above’ and ’none of the above’ options. Consider the exam question below:

In C++, which of the following things can be const?
1. member functions
2. member variables
3. function parameters
4. all of A, B, C
5. none of A, B, C

wrong

Let’s say that a taker is pretty sure that A and B are true but is uncertain about C. The taker cannot select both A and B - only one option may be selected. Therefore, the answer must be D. They came to the correct solution by good exam technique, which is, admittedly, satisfying on the part of the taker.

The corrected question is presented below:

Select which of the following things in C++ can be const:
1. member functions
2. member variables
3. function parameters

correct

The taker has three binary choices and no logic or correlations to ponder. The earlier version had five possibilities with a bias towards ‘all of the above’, and the improved version has eight possibilities with no bias. The rewritten question is more reliable.

FFP11: Use Current Code

Only use code that would be considered reasonable code. For example, do not test on pre generic collection types in C#. They haven’t been necessary since 2005 and can be learned readily through the APIs if the test taker understands the concepts.

FFP12: Use a Die if Writing Multiple Choice Questions

Use a genuine random device to decide which option, or options, is correct in multiple-choice questions. People are bad at being random number generators. You can splash out and buy yourself a fancy set of gaming dice.

FFP13: No Hint-Laden Review Sessions

Review sessions must not be used to provide hints about what is on the imminent exam. It makes the exam less reliable by unfairly rewarding those that turn up for the review session but have not studied all topics.

While hints cannot be given about what is on the actual exam, stressing the most important concepts, and hence what is most likely to be on any exam, is okay. This should not be new information because time spent on a concept should be proportional to its importance and exam appearance likelihood.

The easiest way of not being able to provide hints is not to know the exam questions when doing the review!

There may be situations where this suggestion has to be relaxed a little. Consider multiple instructors teaching separate groups of people taking the same exam. If you realize that you devoted far less time to a topic that will be on the exam than the other instructors, then you may wish to redress the imbalance. There are at least two ways of doing this. Firstly, you can simply say that other instructors devoted more time to an important topic, and in the review you want to make it clear that it is important. The other way is to let a Q&A session ’naturally’ end up at a question which will allow you to spend some time on the topic. The second way is more useful if the topic is not as important. My quotes around ’naturally’ are there to imply that you, as the instructor, will have nudged the questions in that direction by steering or seeding. Good luck seeding in a remote environment though!

FFP14: Provide Past Papers

Providing past papers makes the exam more reliable because takers know the mechanics of the questions and can focus on exam topic competence.

What if you don’t have past papers? Write them! I have had to do this, and, yes, it was more than double the work. Good exam questions are elusive - you have to hunt for them. If you need twice as many, then it will take a lot longer.

I also recommend releasing the marking scheme, if you’re willing. That requires a good quality marking scheme, which I discuss in the third post on exams. In my experience, marking schemes are rarely of a high enough quality and remain hidden to avoid causing a riot. If you know that the marking scheme will be seen, then that can help encourage quality.