Context Engineering: Giving AI the Right Knowledge

November 10, 2024 AI Tools AI Engineering, Paradigm Evolution, Context Engineering, RAG AI Engineering Series 2152 words 11 min read

🔊

What is Context Engineering?

In June 2025, Andrej Karpathy provided a definition of Context Engineering on the OpenAI engineering blog: “the delicate art and science of filling the context window with just the right information for the model to take the next step.”

This definition captures the core distinction from Prompt Engineering:

Prompt Engineering: Optimizes “what you say” – focuses on how input instructions are expressed
Context Engineering: Optimizes “what the model knows” – focuses on what information the model can access

Using a chef as an analogy: Prompt Engineering adjusts the menu instructions given to the chef, while Context Engineering manages the complete ingredient warehouse the chef can draw from.

From ChatGPT-3 to today’s GPT-4o and DeepSeek-V2, model context windows have exploded from 4K to 1M. This shift has changed our focus from “how to condense prompts” to “how to effectively utilize massive context space.”

Origins and Development

Bill Schilit’s Pioneering Work

The concept of Context Engineering can be traced back to 1995 in Bill Schilit’s PhD thesis “A System Architecture for Context-Aware Computing” (note: 1995, not 1994). This paper introduced the concept of Context-Aware Computing:

mermaid
flowchart TD
    A[User Behavior] --> B[Context Collection]
    B --> C[Context Analysis]
    C --> D[Intelligent Response]

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef out fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    class A src
    class B,C proc
    class D out

Schilit defined three types of context:

Computational Context: Device status, network conditions
User Context: Location, time, identity
Physical Context: Environment, sensor data

Although targeting mobile computing at the time, these ideas directly influenced later AI context management.

The Emergence of RAG

In 2020, Lewis et al. published “Retrieval-Augmented Generation over Pre-trained Language Models” at the NeurIPS conference, formally introducing RAG (Retrieval-Augmented Generation):

mermaid
flowchart TD
    A[User Query] --> B[Retrieval Module]
    B --> C[Vector Database]
    C --> D[Relevant Documents]
    D --> E[Large Model]
    E --> F[Enhanced Response]

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef store fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    classDef spec fill:#f3e5f5,stroke:#9C27B0,color:#4A148C
    classDef out fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    class A src
    class B proc
    class C store
    class D store
    class E spec
    class F out

The core innovation of RAG: combining external knowledge bases with large models, solving model knowledge updates and hallucination problems.

Formal Introduction of Context Engineering

In September 2025, Anthropic formally proposed “Context Engineering” as an independent engineering discipline. In June of the same year, Karpathy strongly advocated for this concept on X (Twitter), allowing it to rapidly spread throughout the AI engineering community.

Four Pillars

1. Knowledge Layer (What the model knows)

This is the foundational layer of Context Engineering, determining the model’s basic cognitive framework.

Core Components:

System Prompts: Define the model’s basic behavioral guidelines
Tool Definitions: Specifications for callable tools
Pre-training Knowledge: The model’s inherent capabilities

Design Principles:

yaml
1
2
3
4
Knowledge Layer Design Principles:
  - Minimization: Avoid unnecessary redundant information
  - Structured: Use clear formats for easy parsing
  - Hierarchical: Core knowledge first, specialized knowledge loaded on demand

Practical Example:

python
1
2
3
4
5
6
7
8
# Excellent system prompt design
system_prompt = """
You are a professional software development assistant, specializing in:
1. Code Quality: Readability, maintainability, performance
2. Best Practices: Design patterns, architectural principles, coding standards  
3. Security Considerations: Input validation, error handling, access control
Please respond in Chinese, keeping professional terms in English.
"""

2. Memory Layer (What the model remembers)

Manages information storage and memory management during model conversations.

Memory Types:

Short-term Memory: Current conversation history
Long-term Memory: Persistent memory in vector databases
Working Memory: State information for current tasks

KV Cache Basics: Modern large models use KV Cache to cache attention computation results:

1
2
3
4
5
Token 1 → Key1, Value1
Token 2 → Key2, Value2  
Token 3 → Key3, Value3
...
Token N → KeyN, ValueN

Optimization Strategies:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Intelligent memory compression
def compress_memory(memory_chunks, max_tokens=4000):
    """
    Compress memory based on importance ranking
    """
    # 1. Rank by importance
    ranked_chunks = rank_by_importance(memory_chunks)
    
    # 2. Progressive compression
    compressed = []
    current_tokens = 0
    
    for chunk in ranked_chunks:
        if current_tokens + chunk['tokens'] <= max_tokens:
            compressed.append(chunk)
            current_tokens += chunk['tokens']
        else:
            break
    
    return compressed

3. Retrieval Layer (What the model retrieves)

This is the most complex yet critical component of Context Engineering, responsible for finding the most relevant information from massive data.

In-depth RAG Analysis:

Vectorization (Embedding)

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Document vectorization
documents = [
    "Python is an interpreted programming language",
    "Machine learning is a branch of artificial intelligence", 
    "Deep learning uses neural network architectures"
]

embeddings = model.encode(documents)

Chunking Strategy

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def smart_chunking(text, max_length=512, overlap=50):
    """
    Intelligent chunking algorithm:
    1. Natural segmentation by paragraphs
    2. Maintain semantic integrity
    3. Add necessary context markers
    """
    chunks = []
    paragraphs = text.split('\n\n')
    
    for para in paragraphs:
        if len(para) <= max_length:
            chunks.append(para)
        else:
            # Further split by sentences
            sentences = split_into_sentences(para)
            current_chunk = ""
            
            for sentence in sentences:
                if len(current_chunk) + len(sentence) <= max_length:
                    current_chunk += sentence + " "
                else:
                    chunks.append(current_chunk.strip())
                    current_chunk = sentence + " "
            
            if current_chunk:
                chunks.append(current_chunk.strip())
    
    return chunks

Vector Databases

Major vector database comparison:

Database	Release Date	Features	Use Cases
Pinecone	2021.08	Cloud-native, easy to use	Quick deployment in production
Chroma	2022.10	Open-source, lightweight	Development, testing, local deployment
FAISS	2017.03	Facebook open-source, high-performance	Large-scale vector computation
Milvus	2019.04	Distributed, scalable	Ultra-large-scale data processing

Reranking

The relevance ranking issue of retrieval results. The most famous discovery is the “Lost in the Middle” problem:

In 2023, Liu et al. (not Gao) discovered: in long documents, the most relevant information often appears at the beginning or end, while information in the middle is easily overlooked.

Solution:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from rank_bm25 import BM25Okapi

def lost_in_middle_fix(documents, query):
    """
    Algorithm to solve "Lost in the Middle" problem
    """
    # 1. BM25 initial ranking
    tokenized_docs = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    
    # 2. Position weight adjustment
    doc_scores = []
    for i, doc in enumerate(documents):
        # Give higher weights to beginning and end
        position_weight = 1.0
        if i < len(documents) * 0.2 or i > len(documents) * 0.8:
            position_weight = 1.5
        
        score = bm25.get_scores(doc.split(), query)[0] * position_weight
        doc_scores.append((i, score))
    
    # 3. Re-sort
    doc_scores.sort(key=lambda x: x[1], reverse=True)
    return [documents[i] for i, _ in doc_scores]

4. Generation Layer (What the model generates)

Ensures the model’s output meets requirements and is well-structured.

Core Control Mechanisms:

Output Constraints: JSON format, word count limits
Chain of Thought (CoT) Guidance: Step-by-step reasoning
Format Control: Markdown tables, code blocks, etc.

Practical Example:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def structured_generation(prompt, schema):
    """
    Structured generation controller
    """
    # 1. Add format constraints
    constrained_prompt = f"""
Please strictly follow the format below:

{schema['format']}

User Question: {prompt}

Please ensure:
- Use the specified format
- Content is accurate and complete
- Language is clear and easy to understand
"""
    
    # 2. Call the model
    response = call_llm(constrained_prompt)
    
    # 3. Format validation and correction
    if validate_format(response, schema):
        return response
    else:
        return fix_format(response, schema)

Core Challenges in Context Management

Token Budget Allocation

Different information types have vastly different value densities:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def value_based_allocation(context_tokens, total_budget):
    """
    Value-based budget allocation strategy
    """
    priorities = {
        'system_prompt': 0.15,      # System instructions 15%
        'core_context': 0.25,       # Core context 25%
        'recent_history': 0.20,     # Recent history 20%
        'tool_definitions': 0.15,   # Tool definitions 15%
        'retrieved_docs': 0.25      # Retrieved documents 25%
    }
    
    allocated = {}
    remaining = total_budget
    
    for component, ratio in priorities.items():
        component_budget = int(total_budget * ratio)
        allocated[component] = min(component_budget, remaining)
        remaining -= allocated[component]
    
    return allocated

Context Window Evolution

Model	Release Date	Context	Features
GPT-3	2020.06	4K	Foundational work
GPT-3.5	2022.03	16K	First major breakthrough
GPT-4	2023.03	32K	Commercially available
Claude 2	2023.07	100K	Long text specialist
GPT-4 Turbo	2023.11	128K	Practical long context
DeepSeek-V2	2024.09	1M	Open-source long context benchmark
Gemini 1.5	2024.02	1M	Multimodal long context

Prompt Caching Technology

Anthropic’s Prompt Caching can save 50-90% in costs:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
class PromptCache:
    def __init__(self):
        self.cache = {}
    
    def get_cached_prompt(self, prompt_hash):
        return self.cache.get(prompt_hash)
    
    def cache_prompt(self, prompt_hash, prompt_data):
        self.cache[prompt_hash] = {
            'data': prompt_data,
            'timestamp': time.time(),
            'usage_count': 0
        }

Hybrid Strategy: RAG + Long Context

Modern systems typically combine multiple strategies:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
hybrid_context_strategy = {
    'short_term': {
        'memory_type': 'working_memory',
        'max_tokens': 2000,
        'refresh_rate': 'turn_by_turn'
    },
    'long_term': {
        'memory_type': 'vector_db', 
        'max_tokens': 8000,
        'refresh_rate': 'hourly'
    },
    'retrieval': {
        'method': 'semantic_search',
        'top_k': 5,
        'reranking': True
    }
}

Getting Started: Building a Document Q&A Bot

The following builds a complete RAG system as a best practice for understanding Context Engineering.

Step 1: Environment Setup

python
1
2
3
4
5
6
7
# requirements.txt
sentence-transformers==2.2.2
chromadb==0.4.18
langchain==0.1.0
openai==1.3.7
numpy==1.24.3
pandas==2.0.3

bash

1
pip install -r requirements.txt

Step 2: Document Processing

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from document_processor import DocumentProcessor

# Initialize processor
processor = DocumentProcessor(
    chunk_size=512,
    chunk_overlap=50,
    embedding_model='all-MiniLM-L6-v2'
)

# Load documents
documents = processor.load_documents([
    'data/company_handbook.pdf',
    'data/tech_specs.md',
    'data/policies.txt'
])

# Process documents
processed_chunks = processor.process_documents(documents)

# Create vector database
vector_db = processor.create_vector_db(processed_chunks)

Step 3: Retrieval System

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class DocumentRetriever:
    def __init__(self, vector_db, reranker=True):
        self.vector_db = vector_db
        self.reranker = reranker
        self.bm25 = None
        
        if reranker:
            self.initialize_bm25()
    
    def initialize_bm25(self):
        """Initialize BM25 reranker"""
        from rank_bm25 import BM25Okapi
        
        # Prepare BM25 index
        all_chunks = [chunk['text'] for chunk in self.vector_db.get_all_chunks()]
        tokenized_chunks = [doc.split() for doc in all_chunks]
        self.bm25 = BM25Okapi(tokenized_chunks)
    
    def retrieve(self, query, top_k=5):
        """Two-stage retrieval: vector search + reranking"""
        # Stage 1: Vector search
        vector_results = self.vector_db.search(query, top_k * 2)
        
        # Stage 2: Reranking
        if self.reranker and self.bm25:
            reranked = self.rerank_results(query, vector_results)
            return reranked[:top_k]
        else:
            return vector_results[:top_k]
    
    def rerank_results(self, query, results):
        """Handle Lost in the Middle problem"""
        texts = [r['text'] for r in results]
        scores = self.bm25.get_scores(query.split(), texts)
        
        # Add position weights
        final_scores = []
        for i, (result, score) in enumerate(zip(results, scores)):
            position_weight = 1.0
            if i < len(results) * 0.2 or i > len(results) * 0.8:
                position_weight = 1.3
            
            final_scores.append((result, score * position_weight))
        
        # Re-sort
        final_scores.sort(key=lambda x: x[1], reverse=True)
        return [result for result, _ in final_scores]

Step 4: Q&A System

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class RAGSystem:
    def __init__(self, retriever, llm_client):
        self.retriever = retriever
        self.llm_client = llm_client
    
    def generate_response(self, query, context_window=4000):
        """Generate response"""
        # 1. Retrieve relevant documents
        relevant_docs = self.retriever.retrieve(query)
        
        # 2. Build context
        context = self.build_context(relevant_docs, context_window)
        
        # 3. Generate response
        prompt = self.create_prompt(query, context)
        response = self.llm_client.generate(prompt)
        
        return response, relevant_docs
    
    def build_context(self, docs, max_tokens):
        """Intelligent context building"""
        context_parts = []
        used_tokens = 0
        
        # Sort documents by importance
        sorted_docs = self.rank_documents_by_importance(docs)
        
        for doc in sorted_docs:
            if used_tokens + doc['tokens'] <= max_tokens:
                context_parts.append(doc['content'])
                used_tokens += doc['tokens']
            else:
                break
        
        return "\n\n".join(context_parts)
    
    def rank_documents_by_importance(self, docs):
        """Rank documents by importance"""
        # Simple implementation: sort by relevance score
        return sorted(docs, key=lambda x: x['score'], reverse=True)

Step 5: System Integration

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def main():
    """Main program"""
    # Initialize components
    processor = DocumentProcessor()
    retriever = DocumentRetriever(processor.vector_db)
    llm_client = OpenAIClient(api_key="your-api-key")
    
    # Create RAG system
    rag_system = RAGSystem(retriever, llm_client)
    
    # Test Q&A
    query = "What is the company's overtime policy?"
    response, sources = rag_system.generate_response(query)
    
    print("Response:", response)
    print("Sources:", sources)

if __name__ == "__main__":
    main()

Limitations and Future

Current Limitations

Passive Information Supply: Context Engineering mainly optimizes “what information to provide” but cannot proactively decide “what information should be provided”
Missing Execution Control: Knowing correct information ≠ correct execution
Insufficient Feedback Loops: Lack of verification and correction mechanisms for execution results
Long-term Autonomy: Cannot handle complex tasks requiring multi-step decision making

Context Engineering addresses the question of “how to give the model the right information,” and serves as the foundation for the subsequent AI Harness Engineering phase, which handles execution control, feedback correction, and multi-step decision making.

Part of series: AI Engineering Series

← Previous From Prompts to Context: Why Clear Instructions Alone Are Not Enough Next → From Context to Harness: Info Is Ready, But AI Is Still Unreliable