Test scores and knowledge

10 March 2024

Let me introduce you to Andrea, Blaise, Casey, and Desi.

They are students of astro-statistics at the University of the Moon. It’s quite a complex subject matter, but they’re all pretty brilliant pupils.

Space exams

Now, at the University of the Moon they know how to administer tests. They won’t make silly mistakes such as hinting at the answer within the question.

This particular test includes 100 extremely specific and well-crafted True-or-False questions. Each correct answer will score 1 point. Time limit is 2 hours.

Here’s the results:

Andrea: 77 pts
Blaise: 56 pts
Casey: 42 pts
Desi: 0 pts

One for them, one for you

After all is said and done, it is my time to test you.

Who should pass? And why?

I invite you to give it some thought. (Remember: it was a pretty hard test.)

Whenever you’re ready to proceed, you’ll find my reasoning below.

Scores and interpretations

I imagine Andrea would pass in pretty much any scenario, many people would let through Blaise and some even Casey, while most will fail the poor Desi.

(I don’t actually know, though. I haven’t tested this on anyone. I just woke up this morning with a few ideas in mind, and crafted this post to convey them.)

Well, in any case, don’t be so quick to jump to conclusions!

Here on the Moon, we know all too well that lone numbers often mean jack shit.

But let’s make a few steps back, and start from the beginning.

Terrestrial schools

On Earth, people can be extremely practical. They set a threshold, and let the data alone decide. After all… that’s easy, impartial, and objective. Right?

Maaaaybe.

You see, this process is prone to errors. The most obvious of which is probably the threshold itself. What should it be?

If you’re overly practical, you could think something like: 100 questions, 2 options, if I divide 100 by 2 I get 50. Anything above that is good enough.

But maybe the test was really difficult, why not bring the threshold down to 40? Or maybe yours is an elite school, so why not raising it to 60 or even 70?

Anyone who studied statistics should know better, though.

Earth science (for starters)

If you set the threshold at 50, Blaise would pass… albeit ranking quite close to a chimpanzee – to paraphrase Hans Rosling, the author of Factfulness.

(In short, this means that even chimps picking random answers are expected to score about 50 points, despite clearly having zero knowledge on the subject.)

Does that mean that anything below 50 is pure trash?

Well, no. There are many more errors of judgement lying around, and some of them are so well hidden that manage to evade the numbers entirely.

More data, more insight

Remember it was a hard test. And there was a time limit. Maybe that was on purpose? Maybe the average student was not expected to answer everything?

At this point, we can only be reasonably sure that Andrea gave us an irrefutable proof of knowledge: at the very least, they got 77% of the answers right.

This is already sensibly better than a congregation of bonobos and highly smart piglets. But we can’t possibly know by how much, without looking at the test.

For all we know, Andrea could have answered in sequence the first 84 questions before their time ran out, scoring slightly over 90%. Not bad!

By following this logic, what if Casey got 42 out of 69 questions? They may still rank as the second top-performer, in case Blaise answered everything.

(Casey would score roughly 61%, while Blaise would lag behind at 56%.)

And what about Desi? If you think they’re just awful, you might be surprised.

Beyond numbers, beyond logic

In fact, Desi may very well be a rebel genius. They answered each and every question, deliberatly picking the wrong answer. And with no mistake at that!

Think about it: to pick the wrong choice, you need to be able to tell the right one. If you make a mistake, you’ll end up with 1 point – ruining your streak.

So yeah, maybe Desi is our true top of the class, decidedly outperforming even Andrea who went for a more classical approach.

Or maybe…! Maybe Desi felt suddenly sick, and had to hand in a blank paper. 0 out of 0. How do you even rank that?

Lots of maybes. And there are even more.

Casey could be a slow reader, or maybe they have dyslexia, which would put them at an unfair disadvantage (under these particular circumstances).

Blaise could have got 50 answers right before feeling time-pressured to fill in everything, so they tried their luck with extraordinary poor results.

Andrea may have got the same 50 answers right, paired with exceptional luck.

There’s more than meets the eye

We could go on with several more hypothetical scenarios. We could find more and more hidden pitfalls. Even better, we could question our system entirely.

Is that test the best we can do? Is it clear what exactly we want to assess? Are we able to steer clear of biases and interpret the results correctly?

Do we even need objective rankings all the time?

Maybe Andrea had the potential to reach 92, but they were feeling lazy. Maybe Casey leveraged all their current knowledge and yet couldn’t go past 42.

On the Moon, we carefully consider all of the above – and more. Not because we are flawless, but quite the opposite! We are well aware we’re not perfect.

So we look at the stars, and we hear them whisper: competence is important, we all would like to shine brighter, but love and beauty is what we live for.