Writing the 'Evaluating AI Systems' Book
An update from the trenches and an invitation to review
It’s been a little over three months since my last post, and that’s because I’ve been heads down writing the first chapters of my book on Evaluating AI Systems. As any writer will tell you, what you plan to write and what you actually end up writing can diverge quite a bit once you start wrestling seriously with the material.
For example, one challenge that I faced was that some chapters grew far longer than expected (well over forty pages) forcing me to decide whether (and how) to split them without losing coherence. Another recurring challenge was finding the right balance between being thorough and being concise, i.e., giving readers enough context and nuance to do the topic justice, without overwhelming them or letting the narrative lose focus.
To be honest, writing this book feels intimidating at times because it pulls me out of my comfort zone of semantic data modeling and knowledge graphs. My previous book, Semantic Modeling for Data, lived in a much more structured and predictable space, at least in my head. AI evaluation, by contrast, is messier, more interdisciplinary, and full of gray areas. That complexity makes it harder to write about with confidence, but it’s also precisely what makes the topic worth tackling.
With that said, I’m happy to share that I’m now finalizing the first full draft of the first five chapters, which I’ll soon send out for external review and feedback. Below is a brief overview of what each of them covers.
Chapter 1: The Importance of AI Evaluation
This opening chapter sets the stage for the whole book by clarifying what we actually mean by an AI system (rather than e.g, just a model) and why evaluation has become such a critical concern. It explores why AI evaluation is inherently challenging, given the complexity of real-world systems and the contexts in which they operate. The chapter also introduces a systematic way of thinking about evaluation, namely not as a single metric or benchmark, but as a structured process that supports real-world decisions.Chapter 2: Framing and Scoping the Evaluation
Before choosing metrics or collecting evaluation data, more fundamental questions need to be answered. This chapter focuses on how to set clear evaluation goals, such as benchmarking performance, assessing risk, informing deployment decisions, or diagnosing failures. It also discusses how to determine what exactly is being evaluated (a full system, a subsystem, a model, or data components), how to clearly define tasks and success criteria, and how to select appropriate evaluation dimensions such as effectiveness, robustness, fairness, explainability, safety, and societal impact.Chapter 3: Picking Evaluation Metrics
Metrics are often treated as the core of AI evaluation, yet they are frequently misunderstood or misapplied. This chapter examines how to choose and interpret metrics so that they meaningfully reflect real-world performance and decision-making needs. It covers different types of metrics, such as reference-based versus reference-free, automatic versus judgment-based, and execution-based, along with what makes a metric useful or misleading. It also describes how to design custom metrics and how to manage and evolve them over time as systems and goals change.Chapter 4: Acquiring Evaluation Data
Every evaluation depends on data, namely the inputs, examples, and test cases used to probe system behavior and compute metrics. This chapter guides the reader through the process of acquiring and constructing evaluation data that is credible, diverse, and aligned with evaluation goals. It explores where to look for evaluation data, how to collect and sample it, how to synthesize data at scale, how to assess data quality, and how to manage and maintain evaluation datasets as systems evolve.Chapter 5: Acquiring Judgments
At the heart of most evaluations are judgments, namely labels, ratings, or decisions that determine what counts as correct, relevant, appropriate, safe, or high-quality. This chapter focuses on how to design judgment processes deliberately. It discusses what judgments should look like, how to define rubrics and instructions, who should act as judges, how to calibrate and validate human judges, how to develop and assess LLM-based judges, and how to design judgment workflows that are both reliable and practical.
As these chapters go out for external review, I’m especially interested in feedback from practitioners who actively work with AI evaluation in real systems: engineers, technical leads, product managers, auditors, and others who regularly have to reason about evaluation results and make decisions based on them. If that sounds like you, and you’d be interested in reviewing one or two chapters and sharing candid feedback (including disagreements or blind spots), feel free to reach out.
Thank you for reading, till next time!
Panos
News and Updates
On January 26th and 27th 2026, as well as on March 3rd and 4th 2026, I will be teaching the 2nd and 3rd editions of my live online course AI Evaluations Bootcamp, at the O’Reilly Learning Platform. You can see more details and register here. The course is free if you are already subscribed in the platform. If not, you can take advantage of a 10-day free trial period.
If you are interested in the field of semantic data modeling and knowledge graphs, my book Semantic Modeling for Data - Avoiding Pitfalls and Breaking Dilemmas remains available at O’Reilly, Amazon, and most known bookstores. Also, if you’ve already read it and have thoughts, I’d really appreciate it if you left a rating or review; it helps others discover the book and join the conversation.


