Alex Strick van Linschoten - Notes on ‘AI Engineering’ chapter 9: Inference Optimisation

What follows are my notes on chapter 9 of Chip Huyen’s ‘AI Engineering’ book. This chapter was on optimising your inference and I learned a lot while reading it! There are interesting techniques like prompt caching and architectural considerations that I was vaguely aware of but hadn’t fully appreciated how they might work in real inference systems.

Chapter 9: Overview

Machine learning inference optimization operates across three fundamental domains: model optimization, hardware optimization, and service optimization. While hardware optimization often requires significant investment and may offer limited individual leverage, model and service optimizations provide substantial opportunities for AI engineers to improve performance.

Critical Cost Insight: A 2023 survey revealed that inference can account for up to 90% of machine learning costs in deployed AI systems, often exceeding training costs. This emphasizes why inference optimization isn’t just an engineering challenge - it’s a critical business necessity.

Core Concepts and Bottlenecks

Understanding inference bottlenecks is essential for effective optimization. Two primary types of computational bottlenecks impact inference performance:

Compute-Bound Bottlenecks: Tasks that are limited by raw computational capacity, typically involving complex mathematical operations that take significant time to complete. These bottlenecks are particularly evident in computationally intensive operations within neural networks.

Memory Bandwidth-Bound Bottlenecks: Limitations arising from data transfer requirements between system components, particularly between memory and processors. This becomes especially relevant in Large Language Models where significant amounts of data need to be moved between different memory hierarchies.

In Large Language Models (LLMs), different operations exhibit varying profiles of these bottlenecks. This understanding has led to architectural decisions such as decoupling the prefilling step from the decode step in production environments - a practice that has become increasingly common as organizations optimize their inference pipelines.

Inference APIs and Service Patterns

Two fundamental approaches to inference deployment exist:

Online Inference APIs
- Optimized for minimal latency
- Designed for real-time responses
- Typically more expensive per inference
- Critical for interactive applications
Batch Inference APIs
- Optimized for cost efficiency
- Can tolerate longer processing times (potentially hours)
- Allows providers to optimize resource utilization
- Ideal for bulk processing tasks

Inference Performance Metrics

Several key metrics help quantify inference performance:

Latency Components

Time to First Token
- Measures duration between query submission and initial response
- Critical for user experience in interactive applications
- Often a key optimization target for real-time systems
Time per Output Token
- Generation speed after the first token
- Impacts overall completion time
- Can vary based on model architecture and optimization
Inter-token Latency
- Time intervals between consecutive tokens
- Affects perceived smoothness of generation
- Important for streaming applications

Total latency can be expressed as: time_to_first_token + (time_per_token × number_of_tokens)

Throughput and Goodput Metrics

Throughput: The number of output tokens per second an inference service can generate across all users and requests. This raw metric provides insight into system capacity.

Goodput: The number of requests per second that successfully meet the Service Level Objective (SLO). This metric offers a more realistic view of useful system capacity.

Resource Utilization Metrics

Model FLOPS Utilization (MFU)
- Ratio of actual to theoretical FLOPS
- Indicates computational efficiency
- Key metric for hardware optimization
Model Bandwidth Utilization (MBU)
- Percentage of achievable memory bandwidth utilized
- Critical for memory-intensive operations
- Helps identify memory bottlenecks

Hardware Considerations and AI Accelerators

While NVIDIA GPUs dominate the market, various specialized chips exist for inference:

Popular AI Accelerators

NVIDIA GPUs (market leader)
AMD accelerators
Google TPUs
Various emerging specialized chips

Inference vs Training Hardware: Inference-optimized chips prioritize lower precision and faster memory access over large memory capacity, contrasting with training-focused hardware that requires substantial memory capacity.

Key hardware optimization considerations include:

Memory size and bandwidth requirements
Chip architecture specifics
Power consumption profiles
Physical chip architecture variations
Cost-performance ratios

Model Optimization Techniques

Core Approaches

Quantization
- Reduces numerical precision (e.g., 32-bit to 16-bit)
- Decreases memory footprint
- Weight-only quantization is particularly common
- Can halve model size with minimal performance impact
Pruning
- Removes non-essential parameters
- Preserves core model behavior
- Multiple techniques available
- Requires careful validation
Distillation
- Creates smaller, more efficient models
- Maintains key capabilities
- Covered extensively in Chapter 8

Advanced Decoding Strategies

Speculative Decoding

This approach combines a large model with a smaller, faster model:

Small model generates rapid initial outputs
Large model verifies and corrects as needed
Provides faster token generation
Easy to implement
Integrated into frameworks like VLLM and LamaCPU

Inference with Reference

Performs mini-RAG operations during decoding
Retrieves relevant context from input query
Requires additional memory overhead
Useful for maintaining context accuracy

Parallel Decoding

Rather than strictly sequential token generation, this method:

Generates multiple tokens simultaneously
Uses resolution mechanisms to maintain coherence
Implements look-ahead techniques
Algorithmically complex but offers significant speed benefits
Demonstrated success with look-ahead decoding method

Attention Optimization

Several strategies exist for optimizing attention mechanisms:

Key-Value Cache Optimization
- Critical for large context windows
- Requires substantial memory
- Various techniques for size reduction
Specialized Attention Kernels
- Flash Attention as leading example
- Hardware-specific implementations
- Flash Attention 3 for H100 GPUs

Service-Level Optimization

Batching Strategies

Static Batching
- Processes fixed-size batches
- Waits for complete batch (e.g., 100 requests)
- Simple but potentially inefficient
Dynamic Batching
- Uses time windows for batch formation
- Processes incomplete batches after timeout
- Balances latency and throughput
Continuous Batching
- Returns completed responses immediately
- Dynamically manages resource utilization
- Similar to a bus route that continuously picks up new passengers
- Optimizes occupation rate
- Based on Orca paper’s findings

Prefill-Decode Decoupling

Separates prefill and decode operations
Essential for large-scale inference providers
Allows optimal resource allocation
Improves overall system efficiency

Prompt Caching

Stores computations for overlapping text segments
Offered by providers like Gemini and Anthropic
May incur storage costs
Requires careful cost-benefit analysis
Must be explicitly enabled

Parallelism Strategies

Replica Parallelism
- Creates multiple copies of the model
- Distributes requests across replicas
- Simplest form of parallelism
Tensor Parallelism
- Splits individual tensors across devices
- Enables processing of larger models
- Requires careful coordination
Pipeline Parallelism
- Divides model computation into stages
- Assigns stages to different devices
- Optimizes resource utilization
- Reduces memory requirements
Context Parallelism
- Processes different parts of input context in parallel
- Particularly useful for long sequences
- Can significantly reduce latency
Sequence Parallelism
- Processes multiple sequences simultaneously
- Leverages hardware-specific features
- Requires careful implementation

Implementation Considerations

When implementing inference optimizations:

Multiple optimization techniques are typically combined in production
Hardware-specific optimizations require careful testing
Service-level optimizations often provide significant gains with minimal model modifications
Optimization choices depend heavily on specific use cases and requirements