What Is Context Engineering?
If you've been working with AI models in 2026, you've probably noticed something: the quality of your prompts matters less than the quality of context you feed your models. This shift has a name — context engineering — and a new peer-reviewed paper with 9,649 experiments just proved why it's replacing prompt engineering as the critical skill for AI practitioners.
Context engineering is the systematic practice of structuring, formatting, and delivering information to large language models (LLMs) through their context windows. Unlike prompt engineering, which focuses on how you ask, context engineering focuses on what information surrounds your request — the schemas, files, data formats, and retrieval architecture that determine whether a model succeeds or fails at complex tasks.
The paper "Structured Context Engineering for File-Native Agentic Systems" by Damon McMillan, published February 2026, provides the first large-scale empirical study of how context structure affects LLM agent performance.
The Study: 9,649 Experiments Across 11 Models
McMillan's research tested:
- 11 models spanning frontier and open-source tiers
- 4 data formats: YAML, Markdown, JSON, and TOON
- Schema scales from 10 to 10,000 database tables
- Two architectures: single-context vs. file-based context retrieval
Finding #1: Model Choice Dwarfs Everything Else
Frontier models (Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro) outperformed open-source models by a massive 21 percentage points on accuracy. That gap dwarfs any effect from format choice or retrieval architecture.
| Tier | Models Tested | Relative Accuracy |
|---|---|---|
| Frontier | Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro | +21 pts vs. open-source |
| Open Source | DeepSeek V3.2, Kimi K2, Llama 4 | Baseline |
Takeaway: Model selection is your highest-leverage decision — not prompt tweaking.
Finding #2: File-Based Context Helps Frontier Models, Hurts Open Source
For frontier models, file-based context retrieval improved accuracy by +2.7% (p=0.029). But for open-source models, it produced a -7.7% decrease (p<0.001).
If you're using tools like Claude Code or similar AI coding agents, the filesystem-native workflow matters — but only if the model behind it can handle it.
Finding #3: Format Doesn't Matter (Much)
Format choice had no statistically significant effect on aggregate accuracy (chi-squared=2.45, p=0.484). Whether you use YAML, Markdown, JSON, or TOON, the models performed roughly the same.
| Format | Token Efficiency | Best For |
|---|---|---|
| YAML | Good | Config files |
| Markdown | Moderate | Human-readable docs |
| JSON | Verbose | Programmatic interop |
| TOON | Most compact | Token-constrained scenarios |
Finding #4: The "Grep Tax" — Compact Doesn't Mean Faster
TOON, designed to minimize tokens, actually caused models to spend more tokens reasoning about unfamiliar format. Familiarity beats compression.
Finding #5: File-Native Agents Scale to 10,000 Tables
File-native agents can navigate schemas with up to 10,000 database tables using domain-partitioned schemas — far beyond any single context window.
Prompt Engineering vs. Context Engineering
| Dimension | Prompt Engineering | Context Engineering |
|---|---|---|
| Focus | How you ask | What surrounds the question |
| Scope | Single instruction | Entire info architecture |
| Scale | Hundreds of tokens | Thousands to millions |
| Primary lever | Wording, examples | Data format, retrieval, file org |
| Impact | Marginal at frontier | +2.7% to -7.7% depending on model |
Practical Takeaways
- Invest in model selection first — the 21-point gap is the largest effect
- Match architecture to model — file-based for frontier, single-context for open-source
- Don't obsess over format — no significant aggregate difference
- Beware the grep tax — familiar formats > hyper-optimized ones
- Organize for scale — domain-partitioned file structures work up to 10K tables
The Future of Context Engineering
Expect evolution in:
- Dynamic context assembly — agents auto-determining needed context
- Context-aware fine-tuning — models trained for specific context structures
- Standardized context protocols — like Anthropic's Model Context Protocol
- Context compression — better approaches without the grep tax
Originally published on Serenities AI
Top comments (0)