When Machines Cheat

How annotation artifacts in evaluation data can distort evaluation results

Oct 09, 2025

In my previous post, I described a short experiment I ran to see how well and consistently the LLM-as-a-Judge approach works when calculating the faithfulness metric across different RAG evaluation toolkits. In particular, I wanted to understand how well such toolkits handle three fundamental types of answer–context relationships: entailment, neutrality, and contradiction.

For that, I used the SNLI benchmark dataset, a well-known benchmark in natural language inference (NLI). The results showed that even when two toolkits used the same judge LLM, their behavior could diverge significantly.

But then, a colleague of mine raised an interesting question: how reliable and trustworthy is SNLI as an evaluation dataset in the first place?

A partial answer to that question can be traced back to 2018, when a team of NLP researchers began to suspect that NLI benchmarks might be leaking clues that allowed AI models to game the evaluation and appear to perform better than they actually did . To test this suspicion, they took two of the most widely used NLI datasets at the time, SNLI and MultiNLI and did something quite unorthodox: they removed the premise sentences entirely and trained a model to predict entailment, contradiction, or neutrality using only the hypothesis. In theory, such a model should perform no better than random chance.

However, that’s not what happened. When they ran the numbers, they found that on SNLI a simple hypothesis-only classifier correctly predicted the label about 67% of the time, and on MultiNLI it achieved over 50% accuracy. In other words, the model could often guess the correct label without ever looking at the premise.

So how was that possible?

The researchers dug deeper and found that the datasets contained what they called annotation artifacts, namely hidden statistical patterns introduced unintentionally by the people who wrote and labeled the data. For example, hypotheses containing words like nobody, never, or no were overwhelmingly labeled as contradictions, while hypotheses with abstract or generic terms such as animal, instrument, or outdoors tended to be labeled as entailments, as annotators often generalized from the premise. These patterns created spurious correlations that models could exploit without actually understanding the text.

The team also divided the data into “easy” examples (those their hypothesis-only model could solve) and “hard” ones (those it couldn’t). When they re-evaluated the top-performing NLI models of that time, performance dropped sharply on the hard subset. In other words, what had looked like strong inferential ability turned out, in large part, to be clever pattern matching.

So what does this mean? Well, for my own experiment, it probably means I should repeat it using only the “hard” subset of SNLI. But more broadly, it’s a good reminder that how we build and choose our evaluation data matters just as much as the models and systems we test.

In my upcoming book on AI evaluation, I dedicate quite some space to the topic of evaluation data design, considering both human annotators and AI systems in the process. But if you’ve encountered similar issues in your own work, or have thoughts on how to make evaluation data more reliable, please share them in the comments.

Thank you for reading, till next time!
Panos

News and Updates

On October 21st and 22nd, 2025, I will be teaching the 1st edition of my live online course AI Evaluations Bootcamp, at the O’Reilly Learning Platform. You can see more details and register here. The course is free if you are already subscribed in the platform, and you can also take advantage of a 10-day free trial period.
On November 20th, I will be giving a masterclass titled “Grounding LLMs on Solid Knowledge: Assessing and Improving Knowledge Graph Quality in GraphRAG Applications” at the Connected Data London 2025 conference, in London. Tickets are available here,
I have already started writing my new book on Evaluating AI Systems, and I am in the hunt for “war stories” on AI evaluation. If you have such stories, use cases, techniques, tools, or lessons you would like to share, I’d love to hear from you.
If you are interested in the field of semantic data modeling and knowledge graphs, my book Semantic Modeling for Data - Avoiding Pitfalls and Breaking Dilemmas remains available at O’Reilly, Amazon, and most known bookstores. Also, if you’ve already read it and have thoughts, I’d really appreciate it if you left a rating or review; it helps others discover the book and join the conversation.

The Codex And The Compass

Discussion about this post

Ready for more?