Alex Strick van Linschoten - Final notes on ‘Prompt Engineering for LLMs’

Here are the final notes from ‘Prompt Engineering for LLMs’, a book I’ve been reading over the past few days (and enjoying!).

Chapter 10: Evaluating LLM Applications

The chapter begins with an interesting anecdote about GitHub Copilot - the first code written in their repository was the evaluation harness, highlighting the importance of testing in LLM applications. The authors, who worked on the project from its inception, emphasise this as a best practice.

Evaluation Framework

When evaluating LLM applications, three main aspects can be assessed:

The model itself - its capabilities and limitations
Individual interactions with the model (prompts and responses)
The integration of multiple interactions within the broader application

As a general rule of thumb, you should always track and record:

Latency
Token consumption statistics
Overall system approach metrics

Offline Evaluation

Example Suites

The foundation of offline evaluation is creating example suites - collections of 10-20 (minimum) input-output pairs that serve as test cases. These should be accompanied by scripts that apply your application’s logic to each example and compare the results.

Example sources come from three main areas:

Existing examples from your project
Real-time user data collection
Synthetic creation

When using synthetic data, it’s crucial to use different LLMs for creation versus application/judging to avoid potential biases.

Evaluation Approaches

Gold Standard Matching

Can be exact or partial matching
Particularly effective for binary decisions or multi-label classification
Can leverage “logical frogs” tricks from Chapter 7 to assess model confidence
Free-form text requires more creative evaluation approaches
Tool-use scenarios may be easier to evaluate, especially in agent-driven applications

Functional Testing

A step up from unit tests but not full end-to-end testing
Focuses on testing specific system components

LLM as Judge

Currently trendy but requires careful implementation
Should include human verification loop, preferably multiple humans
Key insight: Always frame the evaluation as if the LLM is grading someone else’s work, never its own
Recommendations for quantitative measures:
- Use gradient and multi-aspect coverage (MA)
- Implement 1-5 scales with specific criteria
- Place all instructions and criteria before the content to be evaluated
- Break down “Goldilocks” questions (was it just right?) into separate questions about whether it was enough and whether it was too much

Online Evaluation

The chapter transitions to discussing why we need online testing despite having offline evaluation capabilities. While offline testing is safer and more scalable, real human interactions are unpredictable and require live testing.

Key points about online evaluation:

AB testing is the standard approach
Existing solutions include Optimizely, VWO Consulting, and AB Tasty
Applications need to support running in two modes (A and B)
Consider rollout timing and users on older versions

Five main metrics for online evaluation (from most to least straightforward):

Direct feedback (user responses to suggestions)
Functional correctness
User acceptance (following suggestions)
Achieved impact (user benefit)
Incidental metrics (surrounding measurements)

Direct feedback data is particularly valuable as it can later be used for model fine-tuning. It’s recommended to track more incidental metrics rather than fewer, both for quality indicators and investigating unexpected changes.

Chapter 11: Looking Ahead

The final chapter covers several forward-looking topics:

Multimodality in LLMs
User experience and interface considerations
Published artifacts from Anthropic
Risks and rewards of custom interfaces
Trends in model intelligence, cost, and speed

Book-Level Conclusions

Two main lessons emerge from the book:

LLMs as Text Completion Engines
- They fundamentally mimic training data
- Success comes from aligning prompts with training data patterns
- Particularly relevant for completion models
Empathy with LLMs

Think of them as mechanical friends with internet knowledge
Five key insights:
- LLMs are easily distracted; keep prompts focused
- If humans can’t understand the prompt, LLMs will struggle
- Provide clear instructions and examples
- Include all necessary information (LLMs aren’t psychic)
- Give space for “thinking out loud” (chain of thought)

Personal Reflections

The book, while not revolutionary, provides valuable insights and is a recommended read at 250 pages. It can be completed in about 10-11 days. The heavy focus on completion models versus chat models is interesting, likely due to the authors’ experience with GitHub Copilot. While some points were novel, none were completely mind-blowing. The book’s emphasis on completion models versus chat models is both intriguing and occasionally confusing, though this perspective is understandable given the authors’ background with GitHub Copilot.