Alex Strick van Linschoten - Notes on ‘AI Engineering’ (Chip Huyen) chapter 3

Really enjoyed this chapter. My tidied notes from my readings follow below. 150 pages in and we’re starting to get to the good stuff :)

Overview and Context

This chapter serves as the first of two chapters (Chapters 3 and 4) dealing with evaluation in AI Engineering. While Chapter 4 will delve into evaluation within systems, Chapter 3 addresses the fundamental question of how to evaluate open-ended responses from foundation models and LLMs at a high level. The importance of evaluation cannot be overstated, though the author perhaps takes this somewhat for granted. The chapter provides a comprehensive framework for understanding various evaluation methodologies and their applications.

Challenges in Evaluating Foundation Models

The evaluation of foundation models presents several unique and complex challenges that make systematic assessment difficult:

Existing benchmarks become increasingly inadequate as models improve in their capabilities
As models become better at writing and mimicking human-like responses, evaluation becomes more complex and nuanced
Many foundation models are API-driven black boxes, limiting access to internal workings
Models continuously develop new capabilities, requiring constant adaptation of evaluation methods
There has been notably limited investment in evaluation studies and technologies compared to the extensive resources devoted to enhancing model capabilities
The improvement in model performance necessitates the continuous development of new benchmarks
Without a systematic approach to evaluation, progress can be hindered by various headwinds

Language Model Metrics

The chapter includes a technically detailed section on understanding language model metrics, which while math-heavy, provides fundamental insights into model capabilities:

Entropy
Cross-entropy
Perplexity

These metrics serve as underlying measures to understand what’s happening within the models and assess their power and conversational abilities. While this section spans 4-5 pages of technical content, it provides some useful foundational understanding of how we can measure a language model’s intrinsic capabilities.

Downstream Task Performance Measurement

The chapter transitions from intrinsic metrics to evaluating actual capabilities, dividing evaluation into exact and subjective approaches.

Exact Evaluation

There are two principal approaches to exact evaluation:

Functional Correctness Assessment
- Evaluates whether the LLM can successfully complete its assigned tasks
- Focuses on practical capability rather than theoretical metrics
- Example: In coding tasks, checking if generated code passes all unit tests
- Provides clear, objective measures of success
Similarity Measurements Against Reference Data Four distinct methods are identified:
1. Human Evaluator Judgment
  - Requires manual comparison of texts by human evaluators
  - Highly accurate but time and resource-intensive
  - Limited scalability due to human involvement
  - Often considered the gold standard despite limitations
2. Exact Match Checking
  - Compares generated response against reference responses for exact matches
  - Most effective with shorter, specific outputs
  - Less useful for verbose or creative outputs
  - Provides binary yes/no results
3. Lexical Similarity
  - Employs established metrics like BLEU, ROUGE, and METEOR
  - Focuses on word overlap and structural similarities
  - Known to be somewhat crude in their assessment
  - Widely used despite limitations due to ease of implementation
4. Semantic Similarity
  - Utilizes embeddings for comparing textual meaning
  - Less sensitive to specific word choices than lexical approaches
  - Quality depends entirely on the underlying embeddings algorithm
  - May require significant computational resources
  - Generally provides more nuanced comparison than lexical methods

The chapter includes a brief but relevant sidebar on embeddings and their significance in evaluation, though this digression seemed a bit out of place in the overall flow.

AI as Judge

This section explores the increasingly popular approach of using AI systems to evaluate other AI systems.

Benefits

Significantly faster than human evaluation processes
Generally more cost-effective than human evaluation at scale
Studies have shown strong correlation with human evaluations in many cases
AI judges can provide detailed explanations for their decisions
Offers greater flexibility in evaluation approaches
Enables systematic and consistent evaluation at scale

Three Main Approaches

Individual Response Evaluation
- Assesses response quality based solely on the original question
- Often implements numerical scoring systems (e.g., 1-5 scale)
- Evaluates responses in isolation without comparison
Reference Response Comparison
- Evaluates generated response against established reference responses
- Usually produces binary (true/false) outcomes
- Helps ensure responses meet specific criteria
Generated Response Comparison
- Compares two generated responses to determine relative quality
- Predicts likely user preferences between options
- Particularly useful for:
  - Post-training alignment
  - Test-time compute optimization
  - Model ranking through comparative evaluation
  - Generating preference data

Implementation Considerations

Table 3-3 (page 139) provides an overview of different AI judge criteria used by various AI tools
Notable lack of standardization across different platforms and approaches (see above)
Various scoring systems available, each with their own trade-offs
Adding examples to prompts can improve accuracy but increases token count and costs
Careful balance needed between evaluation quality and resource consumption

Limitations and Challenges

AI judges can show inconsistency in their judgments
Costs can escalate quickly, especially when using stronger models or including more context
Evaluation criteria often remain ambiguous and difficult to standardize
Several inherent biases identified:
- Self-bias: Models tend to favor responses generated by themselves
- Verbosity bias: Tendency to favor longer, more detailed answers
- Other biases common to AI applications in general

Specialized Judges

This section presents an innovative challenge to the conventional wisdom of using the strongest available model as a judge. The author introduces a compelling alternative approach:

Small, specialized judges can be as effective as larger models for specific evaluation tasks
More cost-effective and efficient than using large language models
Can be trained for highly specific evaluation criteria
Demonstrates comparable performance to larger models like GPT-4 in specific domains

Three types of specialized judges are identified: 1. Reward models (evaluating prompt-response pairs) 2. Reference-based judges 3. Preference models

This represents a novel approach that could significantly impact evaluation methodology in the field.

Comparative Evaluation for Model Ranking

Methodology

Focuses on binary choices between two samples
Simpler for both humans and AI to make comparative judgments
Used in major leaderboards like LMSIS
Requires evaluation of multiple combinations to establish rankings
Various algorithms available for efficient comparison

Advantages

More intuitive evaluation process
Often more reliable than absolute scoring
Reduces cognitive load on evaluators
Provides clear preference data

Challenges

Highly data-intensive nature affects scalability
Lacks standardization across implementations
Difficulty in converting comparative measures to absolute metrics
Quality control remains a significant concern
Number of required comparisons can grow rapidly with model count

Key Takeaways and Future Implications

The emergence of smaller, specialized judge models represents a significant shift from the traditional approach of using the largest available models
Comparative evaluation offers promising approaches but requires careful consideration of scalability and implementation
The field continues to evolve rapidly, requiring flexible and adaptable evaluation strategies
Sets up crucial discussion for system-level evaluation in Chapter 4
Highlights the ongoing tension between evaluation quality and resource efficiency

The chapter effectively establishes the foundational understanding necessary for the more practical, system-focused evaluation discussions to follow in Chapter 4.