Everything you need to know about Context Windows, Prompt Engineering, and Building Better AI Systems
Maintained by Professor Milan Amrut Joshi Professor of Data Science, Northwestern University
A curated, research-backed guide to the emerging discipline of Context Engineering for Large Language Models.
Papers Β· Videos Β· Blog Posts Β· Tools Β· Techniques Β· Courses Β· Roadmap
- What is Context Engineering?
- Context Engineering vs Prompt Engineering
- Why It Matters (2025-2026)
- Key Concepts
- Context Window Sizes
- Research Papers
- YouTube Videos & Talks
- Blog Posts & Articles
- Tools & Frameworks
- Courses & Tutorials
- Techniques & Patterns
- Roadmap
- Contributing
- Citation
Context Engineering is the art and science of designing, managing, and optimizing the information provided to Large Language Models (LLMs) within their context window to maximize the quality, accuracy, and relevance of their outputs.
While prompt engineering focuses on how you ask, context engineering focuses on what information surrounds your ask β the retrieval strategy, the memory architecture, the token budget allocation, the ordering of information, and the system-level design of context pipelines.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTEXT WINDOW β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β SYSTEM β β RETRIEVED β β CONVERSATION β β
β β PROMPT β β DOCUMENTS β β HISTORY β β
β β β β (RAG) β β β β
β β - Role β β - Chunks β β - Past turns β β
β β - Rules β β - Metadata β β - Summaries β β
β β - Examples β β - Rankings β β - Key facts β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β TOOLS & β β FEW-SHOT β β USER β β
β β SCHEMAS β β EXAMPLES β β QUERY β β
β β β β β β β β
β β - Functions β β - Input/ β β - Current β β
β β - APIs β β Output β β request β β
β β - Formats β β pairs β β - Constraints β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β
β βΌ Token Budget Management βΌ β
β βΌ Information Ordering βΌ β
β βΌ Relevance Filtering βΌ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Dimension | Prompt Engineering | Context Engineering |
|---|---|---|
| Focus | Crafting the query/instruction | Designing the entire information environment |
| Scope | Single prompt | Full context pipeline (retrieval, memory, tools) |
| Abstraction | Text-level | System-level architecture |
| Key Question | "How do I phrase this?" | "What information does the model need, and how should it be structured?" |
| Includes | Instructions, few-shot examples | RAG, memory, tool definitions, token budgets, ordering |
| Skill Level | π’ Beginner to Intermediate | π‘ Intermediate to Advanced |
| Optimization | Wording, formatting, chain-of-thought | Retrieval quality, chunking, compression, caching |
| Analogies | Writing a good exam question | Designing the entire exam prep system |
| Dynamic? | Mostly static templates | Dynamic, adapts per query and session |
| Measurable Impact | Quality of single response | System-level accuracy, cost, latency |
- Context windows are exploding β From 4K tokens (GPT-3) to 2M+ tokens (Gemini). Managing this space effectively is a core engineering challenge.
- RAG is now standard β Every production LLM application uses some form of retrieval. Context engineering defines how retrieved data is structured and ranked.
- Agentic AI demands it β AI agents that use tools, maintain memory, and plan across steps require sophisticated context management.
- Cost optimization β Tokens cost money. Smart context engineering reduces costs by 50-90% while maintaining quality.
- Accuracy at scale β The "lost in the middle" problem and context dilution mean that more context is not always better. Engineering is required.
π Context Window
The fixed-size buffer of tokens an LLM can process in a single forward pass. Everything the model "knows" at inference time must fit within this window: system prompt, retrieved documents, conversation history, tool schemas, and the user query.
π Token Limits & Budget Allocation
Given a finite context window, context engineering involves deciding how many tokens to allocate to each component. A common allocation:
- System prompt: 5-10%
- Retrieved documents: 40-60%
- Conversation history: 15-25%
- Few-shot examples: 5-10%
- User query + response buffer: 10-20%
π Retrieval-Augmented Generation (RAG)
The pattern of retrieving relevant documents from an external knowledge base and injecting them into the context window. RAG bridges the gap between parametric knowledge (model weights) and non-parametric knowledge (external data).
π Memory Management
Strategies for maintaining information across sessions or long conversations: summarization, key-fact extraction, vector-based episodic memory, and hierarchical memory architectures (short-term, long-term, working memory).
π Context Compression
Techniques to reduce token usage while preserving information: extractive summarization, LLMLingua-style token pruning, semantic deduplication, and information-theoretic compression.
π Information Ordering
The position of information within the context window affects recall. Models exhibit primacy and recency biases. Context engineering accounts for this by placing critical information at the beginning and end of the context.
| Model | Context Window | Provider | Year | Notes |
|---|---|---|---|---|
| GPT-4o | 128K tokens | OpenAI | 2024 | Multimodal, widely deployed |
| GPT-o3 | 200K tokens | OpenAI | 2025 | Reasoning model, extended context |
| Claude 3.5 Sonnet | 200K tokens | Anthropic | 2024 | Strong long-context performance |
| Claude Opus 4 | 200K tokens | Anthropic | 2025 | Frontier model |
| Claude Sonnet 4 | 200K tokens | Anthropic | 2025 | Balanced performance and speed |
| Gemini 2.0 Flash | 1M tokens | 2025 | Fast, extended context | |
| Gemini 2.0 Pro | 2M tokens | 2025 | Largest production context window | |
| Llama 3.1 405B | 128K tokens | Meta | 2024 | Open-weight |
| Llama 4 Maverick | 1M tokens | Meta | 2025 | Open-weight, MoE architecture |
| Mistral Large 2 | 128K tokens | Mistral | 2024 | European AI lab |
| DeepSeek V3 | 128K tokens | DeepSeek | 2025 | MoE, cost-efficient |
| DeepSeek R1 | 128K tokens | DeepSeek | 2025 | Reasoning-focused |
| Command R+ | 128K tokens | Cohere | 2024 | RAG-optimized |
| Grok-2 | 128K tokens | xAI | 2024 | Real-time data access |
| Qwen 2.5 72B | 128K tokens | Alibaba | 2024 | Multilingual |
Note: Context window size alone does not determine quality. Effective utilization across the full window varies significantly between models. See the RULER benchmark and Needle-in-a-Haystack for empirical evaluations.
Full details, abstracts, and annotations available in papers/README.md
| # | Paper | Authors | Year | Key Contribution |
|---|---|---|---|---|
| 1 | Lost in the Middle: How Language Models Use Long Contexts | Liu et al. | 2023 | |
| 2 | Extending Context Window of LLMs via Positional Interpolation | Chen et al. | 2023 | |
| 3 | LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens | Ding et al. | 2024 | |
| 4 | Ring Attention with Blockwise Transformers | Liu et al. | 2023 | |
| 5 | YaRN: Efficient Context Window Extension of LLMs | Peng et al. | 2023 | |
| 6 | Effective Long-Context Scaling of Foundation Models | Xiong et al. (Meta) | 2023 | |
| 7 | LongLoRA: Efficient Fine-tuning of Long-Context LLMs | Chen et al. | 2023 | |
| 8 | RULER: What's the Real Context Size of Your LLM? | Hsieh et al. | 2024 | |
| 9 | Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention | Munkhdalai et al. (Google) | 2024 | |
| 10 | Data Engineering for Scaling Language Models to 128K Context | Fu et al. | 2024 |
| # | Paper | Authors | Year | Key Contribution |
|---|---|---|---|---|
| 11 | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | Lewis et al. | 2020 | |
| 12 | Self-RAG: Learning to Retrieve, Generate, and Critique | Asai et al. | 2023 | |
| 13 | RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval | Sarthi et al. | 2024 | |
| 14 | Corrective Retrieval-Augmented Generation (CRAG) | Yan et al. | 2024 | |
| 15 | Active Retrieval Augmented Generation (FLARE) | Jiang et al. | 2023 | |
| 16 | Dense Passage Retrieval for Open-Domain QA | Karpukhin et al. | 2020 | |
| 17 | ColBERT: Efficient and Effective Passage Search via Late Interaction | Khattab & Zaharia | 2020 | |
| 18 | Adaptive-RAG: Learning to Adapt Retrieval-Augmented LLMs | Jeong et al. | 2024 | |
| 19 | Seven Failure Points When Engineering a RAG System | Barnett et al. | 2024 | |
| 20 | A Survey on RAG Meets LLMs | Fan et al. | 2024 |
| # | Paper | Authors | Year | Key Contribution |
|---|---|---|---|---|
| 21 | Chain-of-Thought Prompting Elicits Reasoning in LLMs | Wei et al. | 2022 | |
| 22 | Tree of Thoughts: Deliberate Problem Solving with LLMs | Yao et al. | 2023 | |
| 23 | DSPy: Compiling Declarative Language Model Calls | Khattab et al. | 2023 | |
| 24 | Automatic Prompt Optimization with Gradient Descent and Beam Search | Pryzant et al. | 2023 | |
| 25 | Large Language Models Are Human-Level Prompt Engineers | Zhou et al. | 2022 | |
| 26 | Principled Instructions Are All You Need | Bsharat et al. | 2023 | |
| 27 | Graph of Thoughts: Solving Elaborate Problems with LLMs | Besta et al. | 2023 |
| # | Paper | Authors | Year | Key Contribution |
|---|---|---|---|---|
| 28 | MemGPT: Towards LLMs as Operating Systems | Packer et al. | 2023 | |
| 29 | Reflexion: Language Agents with Verbal Reinforcement Learning | Shinn et al. | 2023 | |
| 30 | LLMLingua: Compressing Prompts for Accelerated Inference | Jiang et al. | 2023 | |
| 31 | Voyager: An Open-Ended Embodied Agent with LLMs | Wang et al. | 2023 | |
| 32 | Cognitive Architectures for Language Agents | Sumers et al. | 2023 | |
| 33 | LongMem: Augmenting LLMs with Long-Term Memory | Wang et al. | 2023 | |
| 34 | Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading | Chen et al. | 2023 |
Full playlist with timestamps and key takeaways in videos/README.md
Full list with summaries and key takeaways in blogs/README.md
Full comparison with features, pricing, and use cases in tools/README.md
| Tool | Description | Language | Stars | License |
|---|---|---|---|---|
| LangChain | Comprehensive LLM application framework | Python/JS | 98k+ | MIT |
| LlamaIndex | Data framework for LLM context augmentation | Python | 37k+ | MIT |
| DSPy | Programming (not prompting) language models | Python | 19k+ | MIT |
| Haystack | End-to-end NLP / RAG framework | Python | 17k+ | Apache 2.0 |
| Semantic Kernel | Microsoft's LLM orchestration SDK | C#/Python | 22k+ | MIT |
| CrewAI | Multi-agent orchestration framework | Python | 24k+ | MIT |
| AutoGen | Multi-agent conversation framework | Python | 35k+ | MIT |
| Database | Type | Hosted | Open Source | Key Feature |
|---|---|---|---|---|
| Pinecone | Cloud-native | Yes | No | Fully managed, enterprise-grade |
| Weaviate | Hybrid | Yes | Yes | GraphQL API, hybrid search |
| Chroma | Embedded/Cloud | Yes | Yes | Developer-friendly, lightweight |
| Milvus | Distributed | Yes | Yes | Billion-scale vector search |
| Qdrant | Cloud/Self-hosted | Yes | Yes | Rust-based, filtering support |
| pgvector | PostgreSQL extension | No | Yes | Use existing Postgres infra |
| FAISS | Library | No | Yes | Meta's similarity search library |
| Model | Provider | Dimensions | Context | Notes |
|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 | 8191 | Best commercial embedding |
| text-embedding-3-small | OpenAI | 1536 | 8191 | Cost-effective |
| embed-v4 | Cohere | 1024 | 512 | Multilingual, compressed |
| voyage-3 | Voyage AI | 1024 | 32000 | Long-context embeddings |
| BGE-M3 | BAAI | 1024 | 8192 | Best open-source multilingual |
| GTE-Qwen2 | Alibaba | 1536 | 32000 | Long-context open-source |
| NomicEmbed | Nomic | 768 | 8192 | Fully open-source, auditable |
| Tool | Purpose | Key Feature |
|---|---|---|
| MemGPT / Letta | LLM memory management | Virtual context, self-editing memory |
| Mem0 | Memory layer for AI | Personalized memory for agents |
| LangMem | Long-term memory for LangChain | Persistent conversational memory |
| Instructor | Structured output from LLMs | Pydantic-based extraction |
| Guardrails AI | LLM output validation | Structure, type, and quality checks |
Full list with curriculum details in courses/README.md
| # | Course | Provider | Level | Format | Cost |
|---|---|---|---|---|---|
| 1 | ChatGPT Prompt Engineering for Developers | DeepLearning.AI + OpenAI | π’ | Video | Free |
| 2 | Building Systems with the ChatGPT API | DeepLearning.AI + OpenAI | π‘ | Video | Free |
| 3 | LangChain for LLM Application Development | DeepLearning.AI + LangChain | π‘ | Video | Free |
| 4 | Building and Evaluating Advanced RAG | DeepLearning.AI + LlamaIndex | π‘ | Video | Free |
| 5 | Stanford CS324: Large Language Models | Stanford | π΄ | Lecture | Free |
| 6 | Stanford CS25: Transformers United | Stanford | π΄ | Seminar | Free |
| 7 | Hugging Face NLP Course | Hugging Face | π’ | Interactive | Free |
| 8 | Full Stack LLM Bootcamp | FSDL | π‘ | Video | Free |
| 9 | Practical Deep Learning for Coders | fast.ai | π‘ | Video | Free |
| 10 | LLM University | Cohere | π’ | Interactive | Free |
| 11 | Prompt Engineering Specialization | DeepLearning.AI | π’ | Video | Paid |
| 12 | UC Berkeley LLM Agents MOOC | UC Berkeley | π΄ | Video | Free |
Full deep-dive with code examples in techniques/README.md
The way you split documents determines retrieval quality.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CHUNKING STRATEGIES β
ββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββββ€
β Fixed-Size β Semantic β Recursive β
β β β β
β Split every N β Split by β Try large chunks β
β tokens with β meaning/topic β first, then split β
β overlap β boundaries β smaller if needed β
β β β β
β β
Simple β β
Coherent β β
Adaptive β
β β
Predictable β β
Better β β
Respects β
β β Breaks β retrieval β document β
β meaning β β Expensive β structure β
β β β Complex β β More complex β
ββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββββ
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββ
β Naive RAG ββββ>β Advanced RAG ββββ>β Modular RAG β
β β β β β β
β Query ββ> β β Query Rewrite β β Router ββ> RAG β
β Retrieve -> β β ββ> HyDE β β ββ> Agent β
β Generate β β ββ> Retrieve β β ββ> Direct β
β β β ββ> Rerank β β β
β β β ββ> Generate β β Composable β
β β β β β pipelines β
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββ
π’ π‘ π΄
Reduce token usage while preserving signal:
- Extractive: Select only the most relevant sentences/paragraphs
- Abstractive: Summarize retrieved chunks before injection
- Token-level: Use LLMLingua to prune low-information tokens (up to 20x compression)
- Semantic deduplication: Remove redundant information across retrieved chunks
Reuse expensive context across requests:
- Prefix caching: Cache system prompts and few-shot examples (supported by Anthropic, Google)
- KV-cache sharing: Share key-value caches across similar requests
- Semantic caching: Cache responses for semantically similar queries
Full Attention (O(n^2)): Sliding Window (O(n * w)):
βββββββββββββββ βββββββββββββββ
β β β β β β β β β β β β Β· Β· Β· β
β β β β β β β β β Β· β β β Β· Β· β
β β β β β β β β β Β· Β· β β β Β· β
β β β β β β β β β Β· Β· Β· β β β β
β β β β β β β β β Β· Β· Β· Β· β β β
β β β β β β β β β Β· Β· Β· Β· Β· β β
βββββββββββββββ βββββββββββββββ
Chain multiple retrieval steps to answer complex questions:
- Decompose query into sub-questions
- Retrieve for each sub-question independently
- Synthesize intermediate answers
- Use intermediate answers to refine retrieval
- Generate final comprehensive answer
- Dynamic few-shot: Select examples most similar to the current query
- Diverse few-shot: Ensure coverage of edge cases and formats
- Ordered few-shot: Place most relevant examples closest to the query (recency bias)
ββββββββββββββββββββββββββββββββββββββββββββββ
β SYSTEM PROMPT LAYERS β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. IDENTITY β Role, persona, expertise β
β 2. CONTEXT β Background information β
β 3. RULES β Constraints, boundaries β
β 4. FORMAT β Output structure β
β 5. EXAMPLES β Reference behaviors β
β 6. FALLBACK β Edge case handling β
ββββββββββββββββββββββββββββββββββββββββββββββ
π― CONTEXT ENGINEERING MASTERY
β
ββββββββββββββ΄βββββββββββββ
β β
FOUNDATIONS APPLICATIONS
β β
βββββββββ΄ββββββββ ββββββββ΄βββββββ
β β β β
THEORY PRACTICE PRODUCTION RESEARCH
β β β β
βΌ βΌ βΌ βΌ
π’ BEGINNER (Weeks 1-4)
βββ Understand transformer attention mechanisms
βββ Learn token counting and context window basics
βββ Master basic prompt engineering patterns
βββ Study the Illustrated Transformer blog post
βββ Complete DeepLearning.AI prompt engineering course
βββ Build a simple chatbot with system prompts
π‘ INTERMEDIATE (Weeks 5-10)
βββ Implement naive RAG with vector database
βββ Learn chunking strategies (fixed, semantic, recursive)
βββ Study embedding models and similarity search
βββ Implement context compression techniques
βββ Build an advanced RAG system with reranking
βββ Learn evaluation metrics (faithfulness, relevance, recall)
βββ Study "Lost in the Middle" paper and information ordering
βββ Complete LangChain / LlamaIndex course
π΄ ADVANCED (Weeks 11-16)
βββ Design multi-agent systems with shared context
βββ Implement hierarchical memory (short/long/working)
βββ Build modular RAG pipelines with routing
βββ Study DSPy for programmatic prompt optimization
βββ Implement context caching and cost optimization
βββ Learn to evaluate with RAGAS, DeepEval, or custom evals
βββ Study agentic RAG patterns (CRAG, Self-RAG, FLARE)
βββ Build a production system with monitoring and fallbacks
β EXPERT (Ongoing)
βββ Contribute to open-source context engineering tools
βββ Publish research on novel context management techniques
βββ Design context architectures for enterprise systems
βββ Optimize for cost, latency, and quality simultaneously
βββ Mentor others in context engineering practices
We welcome contributions from the community. Here is how you can help:
- Add a resource β Open a PR with a new paper, video, blog post, or tool
- Fix errors β Found a broken link or incorrect information? Open an issue
- Improve explanations β Help make the techniques section clearer
- Add code examples β Contribute working code for context engineering patterns
- Translate β Help translate this guide to other languages
Please read our contribution guidelines before submitting.
# Fork the repository
git clone https://github.com/mlnjsh/context-engineering.git
cd context-engineering
# Create a feature branch
git checkout -b add-new-resource
# Make your changes and commit
git add .
git commit -m "Add [resource type]: [resource name]"
# Push and create a PR
git push origin add-new-resourceIf you find this resource helpful in your research or work, please consider citing it:
@misc{joshi2025contextengineering,
title = {Context Engineering: The Complete Guide},
author = {Joshi, Milan Amrut},
year = {2025},
url = {https://github.com/mlnjsh/context-engineering},
note = {A curated guide to context engineering for large language models}
}This work is licensed under the MIT License.
Built with care by Professor Milan Amrut Joshi
Professor of Data Science, Northwestern University
If this resource helped you, please consider giving it a star.
![]() Milan Amrut Joshi Project Author |
![]() Simon Willison LLM context & prompt engineering expert |
![]() Brex Prompt engineering best practices |


