Context Engineering: Giving AI the Right Knowledge

What is Context Engineering?

In June 2025, Andrej Karpathy provided a definition of Context Engineering on the OpenAI engineering blog: “the delicate art and science of filling the context window with just the right information for the model to take the next step.”

This definition is exceptionally elegant. The core distinction from Prompt Engineering lies in:

  • Prompt Engineering: Optimizes “what you say” – focuses on how input instructions are expressed
  • Context Engineering: Optimizes “what the model knows” – focuses on what information the model can access

Imagine this:

  • Prompt Engineering is like adjusting the menu instructions given to a chef
  • Context Engineering is like preparing a complete ingredient warehouse for the chef

From ChatGPT-3 to today’s GPT-4o and DeepSeek-V2, model context windows have exploded from 4K to 1M. This shift has changed our focus from “how to condense prompts” to “how to effectively utilize massive context space.”

Origins and Development

Bill Schilit’s Pioneering Work

The concept of Context Engineering can be traced back to 1995 in Bill Schilit’s PhD thesis “A System Architecture for Context-Aware Computing” (note: 1995, not 1994). This paper introduced the concept of Context-Aware Computing:

mermaid
flowchart LR
    A[User Behavior] --> B[Context Collection]
    B --> C[Context Analysis]
    C --> D[Intelligent Response]

Schilit defined three types of context:

  1. Computational Context: Device status, network conditions
  2. User Context: Location, time, identity
  3. Physical Context: Environment, sensor data

Although targeting mobile computing at the time, these ideas directly influenced later AI context management.

The Emergence of RAG

In 2020, Lewis et al. published “Retrieval-Augmented Generation over Pre-trained Language Models” at the NeurIPS conference, formally introducing RAG (Retrieval-Augmented Generation):

mermaid
flowchart LR
    A[User Query] --> B[Retrieval Module]
    B --> C[Vector Database]
    C --> D[Relevant Documents]
    D --> E[Large Model]
    E --> F[Enhanced Response]

The core innovation of RAG: combining external knowledge bases with large models, solving model knowledge updates and hallucination problems.

Formal Introduction of Context Engineering

In September 2025, Anthropic formally proposed “Context Engineering” as an independent engineering discipline. In June of the same year, Karpathy strongly advocated for this concept on X (Twitter), allowing it to rapidly spread throughout the AI engineering community.

Four Pillars

1. Knowledge Layer (What the model knows)

This is the foundational layer of Context Engineering, determining the model’s basic cognitive framework.

Core Components:

  • System Prompts: Define the model’s basic behavioral guidelines
  • Tool Definitions: Specifications for callable tools
  • Pre-training Knowledge: The model’s inherent capabilities

Design Principles:

yaml
1
2
3
4
Knowledge Layer Design Principles:
  - Minimization: Avoid unnecessary redundant information
  - Structured: Use clear formats for easy parsing
  - Hierarchical: Core knowledge first, specialized knowledge loaded on demand

Practical Example:

python
1
2
3
4
5
6
7
8
# Excellent system prompt design
system_prompt = """
You are a professional software development assistant, specializing in:
1. Code Quality: Readability, maintainability, performance
2. Best Practices: Design patterns, architectural principles, coding standards  
3. Security Considerations: Input validation, error handling, access control
Please respond in Chinese, keeping professional terms in English.
"""

2. Memory Layer (What the model remembers)

Manages information storage and memory management during model conversations.

Memory Types:

  • Short-term Memory: Current conversation history
  • Long-term Memory: Persistent memory in vector databases
  • Working Memory: State information for current tasks

KV Cache Basics: Modern large models use KV Cache to cache attention computation results:

1
2
3
4
5
Token 1 → Key1, Value1
Token 2 → Key2, Value2  
Token 3 → Key3, Value3
...
Token N → KeyN, ValueN

Optimization Strategies:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Intelligent memory compression
def compress_memory(memory_chunks, max_tokens=4000):
    """
    Compress memory based on importance ranking
    """
    # 1. Rank by importance
    ranked_chunks = rank_by_importance(memory_chunks)
    
    # 2. Progressive compression
    compressed = []
    current_tokens = 0
    
    for chunk in ranked_chunks:
        if current_tokens + chunk['tokens'] <= max_tokens:
            compressed.append(chunk)
            current_tokens += chunk['tokens']
        else:
            break
    
    return compressed

3. Retrieval Layer (What the model retrieves)

This is the most complex yet critical component of Context Engineering, responsible for finding the most relevant information from massive data.

In-depth RAG Analysis:

Vectorization (Embedding)

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Document vectorization
documents = [
    "Python is an interpreted programming language",
    "Machine learning is a branch of artificial intelligence", 
    "Deep learning uses neural network architectures"
]

embeddings = model.encode(documents)

Chunking Strategy

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def smart_chunking(text, max_length=512, overlap=50):
    """
    Intelligent chunking algorithm:
    1. Natural segmentation by paragraphs
    2. Maintain semantic integrity
    3. Add necessary context markers
    """
    chunks = []
    paragraphs = text.split('\n\n')
    
    for para in paragraphs:
        if len(para) <= max_length:
            chunks.append(para)
        else:
            # Further split by sentences
            sentences = split_into_sentences(para)
            current_chunk = ""
            
            for sentence in sentences:
                if len(current_chunk) + len(sentence) <= max_length:
                    current_chunk += sentence + " "
                else:
                    chunks.append(current_chunk.strip())
                    current_chunk = sentence + " "
            
            if current_chunk:
                chunks.append(current_chunk.strip())
    
    return chunks

Vector Databases

Major vector database comparison:

DatabaseRelease DateFeaturesUse Cases
Pinecone2021.08Cloud-native, easy to useQuick deployment in production
Chroma2022.10Open-source, lightweightDevelopment, testing, local deployment
FAISS2017.03Facebook open-source, high-performanceLarge-scale vector computation
Milvus2019.04Distributed, scalableUltra-large-scale data processing

Reranking

The relevance ranking issue of retrieval results. The most famous discovery is the “Lost in the Middle” problem:

In 2023, Liu et al. (not Gao) discovered: in long documents, the most relevant information often appears at the beginning or end, while information in the middle is easily overlooked.

Solution:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from rank_bm25 import BM25Okapi

def lost_in_middle_fix(documents, query):
    """
    Algorithm to solve "Lost in the Middle" problem
    """
    # 1. BM25 initial ranking
    tokenized_docs = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    
    # 2. Position weight adjustment
    doc_scores = []
    for i, doc in enumerate(documents):
        # Give higher weights to beginning and end
        position_weight = 1.0
        if i < len(documents) * 0.2 or i > len(documents) * 0.8:
            position_weight = 1.5
        
        score = bm25.get_scores(doc.split(), query)[0] * position_weight
        doc_scores.append((i, score))
    
    # 3. Re-sort
    doc_scores.sort(key=lambda x: x[1], reverse=True)
    return [documents[i] for i, _ in doc_scores]

4. Generation Layer (What the model generates)

Ensures the model’s output meets requirements and is well-structured.

Core Control Mechanisms:

  • Output Constraints: JSON format, word count limits
  • Chain of Thought (CoT) Guidance: Step-by-step reasoning
  • Format Control: Markdown tables, code blocks, etc.

Practical Example:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def structured_generation(prompt, schema):
    """
    Structured generation controller
    """
    # 1. Add format constraints
    constrained_prompt = f"""
Please strictly follow the format below:

{schema['format']}

User Question: {prompt}

Please ensure:
- Use the specified format
- Content is accurate and complete
- Language is clear and easy to understand
"""
    
    # 2. Call the model
    response = call_llm(constrained_prompt)
    
    # 3. Format validation and correction
    if validate_format(response, schema):
        return response
    else:
        return fix_format(response, schema)

Core Challenges in Context Management

Token Budget Allocation

Different information types have vastly different value densities:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def value_based_allocation(context_tokens, total_budget):
    """
    Value-based budget allocation strategy
    """
    priorities = {
        'system_prompt': 0.15,      # System instructions 15%
        'core_context': 0.25,       # Core context 25%
        'recent_history': 0.20,     # Recent history 20%
        'tool_definitions': 0.15,   # Tool definitions 15%
        'retrieved_docs': 0.25      # Retrieved documents 25%
    }
    
    allocated = {}
    remaining = total_budget
    
    for component, ratio in priorities.items():
        component_budget = int(total_budget * ratio)
        allocated[component] = min(component_budget, remaining)
        remaining -= allocated[component]
    
    return allocated

Context Window Evolution

ModelRelease DateContextFeatures
GPT-32020.064KFoundational work
GPT-3.52022.0316KFirst major breakthrough
GPT-42023.0332KCommercially available
Claude 22023.07100KLong text specialist
GPT-4 Turbo2023.11128KPractical long context
DeepSeek-V22024.091MOpen-source long context benchmark
Gemini 1.52024.021MMultimodal long context

Prompt Caching Technology

Anthropic’s Prompt Caching can save 50-90% in costs:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
class PromptCache:
    def __init__(self):
        self.cache = {}
    
    def get_cached_prompt(self, prompt_hash):
        return self.cache.get(prompt_hash)
    
    def cache_prompt(self, prompt_hash, prompt_data):
        self.cache[prompt_hash] = {
            'data': prompt_data,
            'timestamp': time.time(),
            'usage_count': 0
        }

Hybrid Strategy: RAG + Long Context

Modern systems typically combine multiple strategies:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
hybrid_context_strategy = {
    'short_term': {
        'memory_type': 'working_memory',
        'max_tokens': 2000,
        'refresh_rate': 'turn_by_turn'
    },
    'long_term': {
        'memory_type': 'vector_db', 
        'max_tokens': 8000,
        'refresh_rate': 'hourly'
    },
    'retrieval': {
        'method': 'semantic_search',
        'top_k': 5,
        'reranking': True
    }
}

Getting Started: Building a Document Q&A Bot

Let’s build a complete RAG system – the best practice for understanding Context Engineering.

Step 1: Environment Setup

python
1
2
3
4
5
6
7
# requirements.txt
sentence-transformers==2.2.2
chromadb==0.4.18
langchain==0.1.0
openai==1.3.7
numpy==1.24.3
pandas==2.0.3
bash
1
pip install -r requirements.txt

Step 2: Document Processing

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from document_processor import DocumentProcessor

# Initialize processor
processor = DocumentProcessor(
    chunk_size=512,
    chunk_overlap=50,
    embedding_model='all-MiniLM-L6-v2'
)

# Load documents
documents = processor.load_documents([
    'data/company_handbook.pdf',
    'data/tech_specs.md',
    'data/policies.txt'
])

# Process documents
processed_chunks = processor.process_documents(documents)

# Create vector database
vector_db = processor.create_vector_db(processed_chunks)

Step 3: Retrieval System

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class DocumentRetriever:
    def __init__(self, vector_db, reranker=True):
        self.vector_db = vector_db
        self.reranker = reranker
        self.bm25 = None
        
        if reranker:
            self.initialize_bm25()
    
    def initialize_bm25(self):
        """Initialize BM25 reranker"""
        from rank_bm25 import BM25Okapi
        
        # Prepare BM25 index
        all_chunks = [chunk['text'] for chunk in self.vector_db.get_all_chunks()]
        tokenized_chunks = [doc.split() for doc in all_chunks]
        self.bm25 = BM25Okapi(tokenized_chunks)
    
    def retrieve(self, query, top_k=5):
        """Two-stage retrieval: vector search + reranking"""
        # Stage 1: Vector search
        vector_results = self.vector_db.search(query, top_k * 2)
        
        # Stage 2: Reranking
        if self.reranker and self.bm25:
            reranked = self.rerank_results(query, vector_results)
            return reranked[:top_k]
        else:
            return vector_results[:top_k]
    
    def rerank_results(self, query, results):
        """Handle Lost in the Middle problem"""
        texts = [r['text'] for r in results]
        scores = self.bm25.get_scores(query.split(), texts)
        
        # Add position weights
        final_scores = []
        for i, (result, score) in enumerate(zip(results, scores)):
            position_weight = 1.0
            if i < len(results) * 0.2 or i > len(results) * 0.8:
                position_weight = 1.3
            
            final_scores.append((result, score * position_weight))
        
        # Re-sort
        final_scores.sort(key=lambda x: x[1], reverse=True)
        return [result for result, _ in final_scores]

Step 4: Q&A System

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class RAGSystem:
    def __init__(self, retriever, llm_client):
        self.retriever = retriever
        self.llm_client = llm_client
    
    def generate_response(self, query, context_window=4000):
        """Generate response"""
        # 1. Retrieve relevant documents
        relevant_docs = self.retriever.retrieve(query)
        
        # 2. Build context
        context = self.build_context(relevant_docs, context_window)
        
        # 3. Generate response
        prompt = self.create_prompt(query, context)
        response = self.llm_client.generate(prompt)
        
        return response, relevant_docs
    
    def build_context(self, docs, max_tokens):
        """Intelligent context building"""
        context_parts = []
        used_tokens = 0
        
        # Sort documents by importance
        sorted_docs = self.rank_documents_by_importance(docs)
        
        for doc in sorted_docs:
            if used_tokens + doc['tokens'] <= max_tokens:
                context_parts.append(doc['content'])
                used_tokens += doc['tokens']
            else:
                break
        
        return "\n\n".join(context_parts)
    
    def rank_documents_by_importance(self, docs):
        """Rank documents by importance"""
        # Simple implementation: sort by relevance score
        return sorted(docs, key=lambda x: x['score'], reverse=True)

Step 5: System Integration

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def main():
    """Main program"""
    # Initialize components
    processor = DocumentProcessor()
    retriever = DocumentRetriever(processor.vector_db)
    llm_client = OpenAIClient(api_key="your-api-key")
    
    # Create RAG system
    rag_system = RAGSystem(retriever, llm_client)
    
    # Test Q&A
    query = "What is the company's overtime policy?"
    response, sources = rag_system.generate_response(query)
    
    print("Response:", response)
    print("Sources:", sources)

if __name__ == "__main__":
    main()

Limitations and Future

Current Limitations

  1. Passive Information Supply: Context Engineering mainly optimizes “what information to provide” but cannot proactively decide “what information should be provided”
  2. Missing Execution Control: Knowing correct information ≠ correct execution
  3. Insufficient Feedback Loops: Lack of verification and correction mechanisms for execution results
  4. Long-term Autonomy: Cannot handle complex tasks requiring multi-step decision making

Future Development Directions

  1. Active Context Selection: AI autonomously decides what information is needed
  2. Execution Safety Controls: Real-time constraints during execution
  3. Multi-round Feedback Mechanisms: Dynamically adjust context based on execution results
  4. Autonomous Decision-making Capability: Leap from “can answer” to “can act”

Context Engineering is an important milestone in the AI engineering process. It has evolved us from “how to make AI say the right things” to “how to make AI know the right information,” laying the foundation for the next phase of AI Harness Engineering.