Breaking Things

What is adversarial evaluation and how to perform it on your AI system

Feb 03, 2026

a man holding a hammer up to a car — Photo by Giu Vicente on Unsplash

Last week I taught the second edition of my AI Evaluation Bootcamp on the O’Reilly learning platform. The goal of the course is to give people a systematic way of thinking about and executing AI evaluations, understanding what works, what doesn’t, and where things are likely to break.

This time around, more than 1,000 people registered, and over 130 joined the live sessions. The conversations were lively and practical, with lots of questions about trade-offs, failure modes, and what actually happens once models leave the lab.

One session of the course focuses on an evaluation approach called adversarial evaluation, typically used to assess robustness and safety of AI systems. What caught me off guard was how many participants said they had never run an adversarial evaluation before, or had only heard the term without seeing how it applies in practice.

That reaction stuck with me, and it’s what prompted this post.

What is adversarial evaluation

In introductory machine learning courses or tutorials we are taught a simple and powerful recipe: split our available data into training, validation, and test sets, and then use the training data to fit our model, the validation data to tune decisions and compare alternatives, and the test data to estimate how the model will perform once it leaves the lab. This is sound advice, but it relies on a strong assumption, namely that future inputs will be drawn from the same distribution as the data you used to train and test the system.

In practice, this assumption rarely holds. Once an ML or AI system is deployed, it is highly likely that will be exposed to new users, new contexts, and new behaviors. Input distribution changes, sometimes gradually and sometimes abruptly, and when that happens, performance measured on a standard test set can become a poor indicator of how the system actually behaves.

Adversarial evaluation addresses this gap by shifting the goal of evaluation. Instead of estimating performance in typical average conditions, it focuses on understanding how a system behaves under specific, targeted conditions that are known or suspected to be difficult and/or high-impact.

When do you need adversarial evaluation

Adversarial evaluation is important for assessing an AI system’s robustness (that is, the system’s resilience to variation, noise, adversarial inputs, or extreme conditions) and safety (that is, how the system behaves in situations where errors could cause harm, escalate risk, or violate policy).

Consider, for example, an LLM-powered customer support assistant whose main task is to read incoming customer messages and produce a structured summary that includes the user’s issue, relevant product, and urgency level. You have evaluated the system using a dataset of historical support tickets and it performs well: accuracy is high, summaries look reasonable, and most errors are minor. But then, during a demo, your manager starts asking questions:

What happens when a message mixes multiple issues in a single request?
How does the system handle long, unstructured complaints filled with irrelevant details?
Does performance degrade for messages written in non-native or heavily broken English?
What about inputs that include emotionally charged language or threats?
Should we worry if users implicitly require policy-sensitive handling, such as refunds, legal claims, or safety concerns?

To answer such questions you need to run an adversarial evaluation.

How to perform an adversarial evaluation

There are many ways to run an adversarial evaluation, but most follow the same basic steps.

Step 1: Define the task and systems under evaluation

Start by clearly specifying the task and the concrete system you are testing. In the customer support assistant example, the task is structured summarization of support requests, and the system is a specific LLM-based pipeline, including the prompt, model, and post-processing logic.

Step 2: Specify your adversaries

Identify the input patterns and behaviors you want to test. In the customer support assistant example, these could be structural, such as unusually long or poorly formatted inputs, or semantic, such as ambiguous requests, mixed intents, or language that requires implicit judgment or contextual interpretation.

Good sources for adversarial phenomena include:

Known limitations of the model or architecture
Implicit assumptions made by prompts or downstream logic
Historical failure cases or near-misses
Edge cases surfaced or suggested by domain experts or support staff
Scenarios where errors would be particularly costly or risky

Step 3: Acquire evaluation data

Collect or construct examples that exhibit these phenomena. Some may come from real support logs that were previously filtered out or underrepresented. Others can be written by domain experts who intentionally construct difficult cases, while additional examples may be generated synthetically to cover rare but plausible combinations.

Step 4: Apply the adversarial data and (re-)evaluate

Feed the adversarial inputs through the system and re-evaluate task performance (for example, whether the correct issue, product, and urgency are extracted) both overall and broken down by adversary. The latter is often more informative than aggregate scores as it allows you to see how the system behaves under specific conditions. You might discover, for instance, that the system is particularly vulnerable to long, multi-issue requests or indirect expressions of urgency, while remaining relatively resilient to spelling errors or informal language.

In a nutshell

Adversarial evaluation gives you a clearer understanding of where your system’s assumptions break and helps surface concrete weaknesses that standard testing tends to miss. Use it to decide whether a system is really ready for deployment, and identify where additional safeguards or monitoring are required.

How to learn more

If you want to go deeper and get hands-on practice, I’ll be teaching the 3rd edition of the AI Evaluations Bootcamp on the O’Reilly learning platform on March 3–4, 2026. You can see more details and register here. The course is free if you’re already subscribed to the platform; otherwise, you can take advantage of a 10-day free trial.

Separately, next week I’ll be sending out the first chapters of my upcoming book on Evaluating AI Systems for external review. I’m especially interested in feedback from practitioners who work with AI evaluation in real systems: engineers, technical leads, product managers, auditors, and others who regularly have to interpret evaluation results and make decisions based on them. If that sounds like you, and you’d be interested in reviewing one or two chapters and sharing candid feedback (including disagreements or blind spots), feel free to reach out.

Thank you for reading, till next time!
Panos

The Codex And The Compass

Discussion about this post

Ready for more?