Finally back on track and reading the next chapter of Chip Huyen’s book, ‘AI Engineering’. Here are my notes on the chapter.
Overview and Core Philosophy
“Data will be mostly just toil, tears and sweat.”
This is how we start the chapter :) This candid assessment frames dataset engineering as a discipline that requires both technical sophistication and pragmatic persistence. While the chapter’s placement might have been suitable earlier in the book, its position allows it to build effectively on previously established concepts.
Data Curation: The Foundation
Data curation addresses various use cases including fine-tuning, pre-training, and training from scratch, with specific considerations for chain of thought reasoning and tool use. The process addresses three fundamental aspects:
Data Quality: The equivalent of ingredient quality in cooking
Data Coverage: Analogous to having the right mix of ingredients
Data Quantity: Determining the optimal volume of ingredients
Quality Criteria
Data quality encompasses multiple dimensions:
- Relevance to task requirements
- Consistency in format and structure
- Sufficient uniqueness
- Regulatory compliance (especially critical in regulated industries)
Coverage Considerations
Coverage involves strategic decisions about data proportions:
- Large language models often utilize significant code data (up to 50%) in training, which appears to enhance logical reasoning capabilities beyond just coding
- Language distribution can be surprisingly efficient (even 1% representation of a language can enable meaningful capabilities)
- Training proportions may vary across different stages of the training process
Quantity and Optimization
A key phenomenon discussed is ossification, where extensive pre-training can effectively freeze model weights, potentially hampering fine-tuning adaptability. This effect is particularly pronounced in smaller models.
Key quantity considerations include:
- Task complexity correlation with data requirements
- Base model performance implications
- Model size considerations (OpenAI notes that with ~100 examples, more advanced models show superior fine-tuning performance)
- Potential for using lower quality or less relevant data for initial fine-tuning to reduce high-quality data requirements
- Recognition of performance plateaus where additional data yields diminishing returns
Data Acquisition Process
The chapter provides a detailed example workflow for creating an instruction-response dataset:
- Initial dataset identification (~10,000 examples)
- Low-quality instruction removal (reducing to ~9,000)
- Low-quality response filtering (removing 3,000)
- Manual response writing for remaining high-quality instructions
- Topic gap identification and template creation (100 templates)
- AI synthesis of 2,000 new instructions
- Manual annotation of synthetic instructions
Final result: 11,000 high-quality examples
Data Augmentation and Synthesis
Synthesis Objectives
- Increasing data quantity
- Expanding coverage
- Enhancing quality
- Addressing privacy concerns
- Enabling model distillation
Notable Research: An Anthropic paper (2022) found that language model-generated datasets can match or exceed human-written ones in quality for certain tasks.
Note that some teams actually prefer AI-generated preference data due to human fatigue and inconsistency factors.
Synthesis Applications
The chapter distinguishes between pre-training and post-training synthesis:
- Synthetic data appears more frequently in post-training
- Pre-training limitation: AI can reshape existing knowledge but struggles to synthesize new knowledge
LLaMA 3 Synthesis Pipeline
A comprehensive workflow example:
- AI generation of problem descriptions
- Solution generation in multiple programming languages
- Unit test generation
- Error correction
- Cross-language translation with test verification
- Conversation and documentation generation with back-translation verification
This pipeline generated 2.7 million synthetic coding examples for LLaMA 3.1’s supervised fine-tuning.
Model Collapse Considerations
The chapter addresses the risk of model collapse in synthetic data usage:
- Potential loss of training signal through repeated synthetic data use
- Current research suggests proper implementation can avoid collapse
- Importance of quality control in synthetic data generation
Model Distillation
Notable example: BuzzFeed’s fine-tuning of Flan T5 using LoRa and OpenAI’s text-davinci-003
generated examples, achieving 80% inference cost reduction.
Data Processing Best Practices
Expert Tip: “Manual inspection of data has probably the highest value to prestige ratio of any activity in machine learning.” - Greg Brockman, OpenAI co-founder
Processing Guidelines
The chapter emphasizes efficiency optimization:
Order optimization (e.g., deduplication before cleaning if computationally advantageous)
Trial run validation before full dataset processing
Data preservation (avoid in-place modifications)
Original data retention for:
- Alternative processing needs
- Team requirements
- Error recovery
Technical Processing Approaches
Deduplication strategies include:
- Pairwise comparison
- Hashing methods
- Dimensionality reduction techniques
Multiple libraries are referenced (page 400) for implementation.
Data Cleaning and Formatting
- HTML tag removal for signal enhancement
- Careful prompt template formatting, crucial for:
- Fine-tuning operations
- Instruction tuning
- Model performance optimization
Data Inspection
The chapter emphasizes the importance of manual data inspection:
- Utilize various data exploration tools
- Dedicate time to direct data examination (recommended: 15 minutes of direct observation)
- Consider this step non-optional in the process