What is Context Engineering?
In June 2025, Andrej Karpathy provided a definition of Context Engineering on the OpenAI engineering blog: “the delicate art and science of filling the context window with just the right information for the model to take the next step.”
This definition is exceptionally elegant. The core distinction from Prompt Engineering lies in:
- Prompt Engineering: Optimizes “what you say” – focuses on how input instructions are expressed
- Context Engineering: Optimizes “what the model knows” – focuses on what information the model can access
Imagine this:
- Prompt Engineering is like adjusting the menu instructions given to a chef
- Context Engineering is like preparing a complete ingredient warehouse for the chef
From ChatGPT-3 to today’s GPT-4o and DeepSeek-V2, model context windows have exploded from 4K to 1M. This shift has changed our focus from “how to condense prompts” to “how to effectively utilize massive context space.”
Origins and Development
Bill Schilit’s Pioneering Work
The concept of Context Engineering can be traced back to 1995 in Bill Schilit’s PhD thesis “A System Architecture for Context-Aware Computing” (note: 1995, not 1994). This paper introduced the concept of Context-Aware Computing:
flowchart LR
A[User Behavior] --> B[Context Collection]
B --> C[Context Analysis]
C --> D[Intelligent Response]
Schilit defined three types of context:
- Computational Context: Device status, network conditions
- User Context: Location, time, identity
- Physical Context: Environment, sensor data
Although targeting mobile computing at the time, these ideas directly influenced later AI context management.
The Emergence of RAG
In 2020, Lewis et al. published “Retrieval-Augmented Generation over Pre-trained Language Models” at the NeurIPS conference, formally introducing RAG (Retrieval-Augmented Generation):
flowchart LR
A[User Query] --> B[Retrieval Module]
B --> C[Vector Database]
C --> D[Relevant Documents]
D --> E[Large Model]
E --> F[Enhanced Response]
The core innovation of RAG: combining external knowledge bases with large models, solving model knowledge updates and hallucination problems.
Formal Introduction of Context Engineering
In September 2025, Anthropic formally proposed “Context Engineering” as an independent engineering discipline. In June of the same year, Karpathy strongly advocated for this concept on X (Twitter), allowing it to rapidly spread throughout the AI engineering community.
Four Pillars
1. Knowledge Layer (What the model knows)
This is the foundational layer of Context Engineering, determining the model’s basic cognitive framework.
Core Components:
- System Prompts: Define the model’s basic behavioral guidelines
- Tool Definitions: Specifications for callable tools
- Pre-training Knowledge: The model’s inherent capabilities
Design Principles:
1
2
3
4
| Knowledge Layer Design Principles:
- Minimization: Avoid unnecessary redundant information
- Structured: Use clear formats for easy parsing
- Hierarchical: Core knowledge first, specialized knowledge loaded on demand
|
Practical Example:
1
2
3
4
5
6
7
8
| # Excellent system prompt design
system_prompt = """
You are a professional software development assistant, specializing in:
1. Code Quality: Readability, maintainability, performance
2. Best Practices: Design patterns, architectural principles, coding standards
3. Security Considerations: Input validation, error handling, access control
Please respond in Chinese, keeping professional terms in English.
"""
|
2. Memory Layer (What the model remembers)
Manages information storage and memory management during model conversations.
Memory Types:
- Short-term Memory: Current conversation history
- Long-term Memory: Persistent memory in vector databases
- Working Memory: State information for current tasks
KV Cache Basics:
Modern large models use KV Cache to cache attention computation results:
1
2
3
4
5
| Token 1 → Key1, Value1
Token 2 → Key2, Value2
Token 3 → Key3, Value3
...
Token N → KeyN, ValueN
|
Optimization Strategies:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| # Intelligent memory compression
def compress_memory(memory_chunks, max_tokens=4000):
"""
Compress memory based on importance ranking
"""
# 1. Rank by importance
ranked_chunks = rank_by_importance(memory_chunks)
# 2. Progressive compression
compressed = []
current_tokens = 0
for chunk in ranked_chunks:
if current_tokens + chunk['tokens'] <= max_tokens:
compressed.append(chunk)
current_tokens += chunk['tokens']
else:
break
return compressed
|
3. Retrieval Layer (What the model retrieves)
This is the most complex yet critical component of Context Engineering, responsible for finding the most relevant information from massive data.
In-depth RAG Analysis:
Vectorization (Embedding)
1
2
3
4
5
6
7
8
9
10
11
12
13
| from sentence_transformers import SentenceTransformer
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Document vectorization
documents = [
"Python is an interpreted programming language",
"Machine learning is a branch of artificial intelligence",
"Deep learning uses neural network architectures"
]
embeddings = model.encode(documents)
|
Chunking Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| def smart_chunking(text, max_length=512, overlap=50):
"""
Intelligent chunking algorithm:
1. Natural segmentation by paragraphs
2. Maintain semantic integrity
3. Add necessary context markers
"""
chunks = []
paragraphs = text.split('\n\n')
for para in paragraphs:
if len(para) <= max_length:
chunks.append(para)
else:
# Further split by sentences
sentences = split_into_sentences(para)
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) <= max_length:
current_chunk += sentence + " "
else:
chunks.append(current_chunk.strip())
current_chunk = sentence + " "
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
|
Vector Databases
Major vector database comparison:
| Database | Release Date | Features | Use Cases |
|---|
| Pinecone | 2021.08 | Cloud-native, easy to use | Quick deployment in production |
| Chroma | 2022.10 | Open-source, lightweight | Development, testing, local deployment |
| FAISS | 2017.03 | Facebook open-source, high-performance | Large-scale vector computation |
| Milvus | 2019.04 | Distributed, scalable | Ultra-large-scale data processing |
Reranking
The relevance ranking issue of retrieval results. The most famous discovery is the “Lost in the Middle” problem:
In 2023, Liu et al. (not Gao) discovered: in long documents, the most relevant information often appears at the beginning or end, while information in the middle is easily overlooked.
Solution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| from rank_bm25 import BM25Okapi
def lost_in_middle_fix(documents, query):
"""
Algorithm to solve "Lost in the Middle" problem
"""
# 1. BM25 initial ranking
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
# 2. Position weight adjustment
doc_scores = []
for i, doc in enumerate(documents):
# Give higher weights to beginning and end
position_weight = 1.0
if i < len(documents) * 0.2 or i > len(documents) * 0.8:
position_weight = 1.5
score = bm25.get_scores(doc.split(), query)[0] * position_weight
doc_scores.append((i, score))
# 3. Re-sort
doc_scores.sort(key=lambda x: x[1], reverse=True)
return [documents[i] for i, _ in doc_scores]
|
4. Generation Layer (What the model generates)
Ensures the model’s output meets requirements and is well-structured.
Core Control Mechanisms:
- Output Constraints: JSON format, word count limits
- Chain of Thought (CoT) Guidance: Step-by-step reasoning
- Format Control: Markdown tables, code blocks, etc.
Practical Example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| def structured_generation(prompt, schema):
"""
Structured generation controller
"""
# 1. Add format constraints
constrained_prompt = f"""
Please strictly follow the format below:
{schema['format']}
User Question: {prompt}
Please ensure:
- Use the specified format
- Content is accurate and complete
- Language is clear and easy to understand
"""
# 2. Call the model
response = call_llm(constrained_prompt)
# 3. Format validation and correction
if validate_format(response, schema):
return response
else:
return fix_format(response, schema)
|
Core Challenges in Context Management
Token Budget Allocation
Different information types have vastly different value densities:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| def value_based_allocation(context_tokens, total_budget):
"""
Value-based budget allocation strategy
"""
priorities = {
'system_prompt': 0.15, # System instructions 15%
'core_context': 0.25, # Core context 25%
'recent_history': 0.20, # Recent history 20%
'tool_definitions': 0.15, # Tool definitions 15%
'retrieved_docs': 0.25 # Retrieved documents 25%
}
allocated = {}
remaining = total_budget
for component, ratio in priorities.items():
component_budget = int(total_budget * ratio)
allocated[component] = min(component_budget, remaining)
remaining -= allocated[component]
return allocated
|
Context Window Evolution
| Model | Release Date | Context | Features |
|---|
| GPT-3 | 2020.06 | 4K | Foundational work |
| GPT-3.5 | 2022.03 | 16K | First major breakthrough |
| GPT-4 | 2023.03 | 32K | Commercially available |
| Claude 2 | 2023.07 | 100K | Long text specialist |
| GPT-4 Turbo | 2023.11 | 128K | Practical long context |
| DeepSeek-V2 | 2024.09 | 1M | Open-source long context benchmark |
| Gemini 1.5 | 2024.02 | 1M | Multimodal long context |
Prompt Caching Technology
Anthropic’s Prompt Caching can save 50-90% in costs:
1
2
3
4
5
6
7
8
9
10
11
12
13
| class PromptCache:
def __init__(self):
self.cache = {}
def get_cached_prompt(self, prompt_hash):
return self.cache.get(prompt_hash)
def cache_prompt(self, prompt_hash, prompt_data):
self.cache[prompt_hash] = {
'data': prompt_data,
'timestamp': time.time(),
'usage_count': 0
}
|
Hybrid Strategy: RAG + Long Context
Modern systems typically combine multiple strategies:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| hybrid_context_strategy = {
'short_term': {
'memory_type': 'working_memory',
'max_tokens': 2000,
'refresh_rate': 'turn_by_turn'
},
'long_term': {
'memory_type': 'vector_db',
'max_tokens': 8000,
'refresh_rate': 'hourly'
},
'retrieval': {
'method': 'semantic_search',
'top_k': 5,
'reranking': True
}
}
|
Getting Started: Building a Document Q&A Bot
Let’s build a complete RAG system – the best practice for understanding Context Engineering.
Step 1: Environment Setup
1
2
3
4
5
6
7
| # requirements.txt
sentence-transformers==2.2.2
chromadb==0.4.18
langchain==0.1.0
openai==1.3.7
numpy==1.24.3
pandas==2.0.3
|
1
| pip install -r requirements.txt
|
Step 2: Document Processing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| from document_processor import DocumentProcessor
# Initialize processor
processor = DocumentProcessor(
chunk_size=512,
chunk_overlap=50,
embedding_model='all-MiniLM-L6-v2'
)
# Load documents
documents = processor.load_documents([
'data/company_handbook.pdf',
'data/tech_specs.md',
'data/policies.txt'
])
# Process documents
processed_chunks = processor.process_documents(documents)
# Create vector database
vector_db = processor.create_vector_db(processed_chunks)
|
Step 3: Retrieval System
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
| class DocumentRetriever:
def __init__(self, vector_db, reranker=True):
self.vector_db = vector_db
self.reranker = reranker
self.bm25 = None
if reranker:
self.initialize_bm25()
def initialize_bm25(self):
"""Initialize BM25 reranker"""
from rank_bm25 import BM25Okapi
# Prepare BM25 index
all_chunks = [chunk['text'] for chunk in self.vector_db.get_all_chunks()]
tokenized_chunks = [doc.split() for doc in all_chunks]
self.bm25 = BM25Okapi(tokenized_chunks)
def retrieve(self, query, top_k=5):
"""Two-stage retrieval: vector search + reranking"""
# Stage 1: Vector search
vector_results = self.vector_db.search(query, top_k * 2)
# Stage 2: Reranking
if self.reranker and self.bm25:
reranked = self.rerank_results(query, vector_results)
return reranked[:top_k]
else:
return vector_results[:top_k]
def rerank_results(self, query, results):
"""Handle Lost in the Middle problem"""
texts = [r['text'] for r in results]
scores = self.bm25.get_scores(query.split(), texts)
# Add position weights
final_scores = []
for i, (result, score) in enumerate(zip(results, scores)):
position_weight = 1.0
if i < len(results) * 0.2 or i > len(results) * 0.8:
position_weight = 1.3
final_scores.append((result, score * position_weight))
# Re-sort
final_scores.sort(key=lambda x: x[1], reverse=True)
return [result for result, _ in final_scores]
|
Step 4: Q&A System
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| class RAGSystem:
def __init__(self, retriever, llm_client):
self.retriever = retriever
self.llm_client = llm_client
def generate_response(self, query, context_window=4000):
"""Generate response"""
# 1. Retrieve relevant documents
relevant_docs = self.retriever.retrieve(query)
# 2. Build context
context = self.build_context(relevant_docs, context_window)
# 3. Generate response
prompt = self.create_prompt(query, context)
response = self.llm_client.generate(prompt)
return response, relevant_docs
def build_context(self, docs, max_tokens):
"""Intelligent context building"""
context_parts = []
used_tokens = 0
# Sort documents by importance
sorted_docs = self.rank_documents_by_importance(docs)
for doc in sorted_docs:
if used_tokens + doc['tokens'] <= max_tokens:
context_parts.append(doc['content'])
used_tokens += doc['tokens']
else:
break
return "\n\n".join(context_parts)
def rank_documents_by_importance(self, docs):
"""Rank documents by importance"""
# Simple implementation: sort by relevance score
return sorted(docs, key=lambda x: x['score'], reverse=True)
|
Step 5: System Integration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| def main():
"""Main program"""
# Initialize components
processor = DocumentProcessor()
retriever = DocumentRetriever(processor.vector_db)
llm_client = OpenAIClient(api_key="your-api-key")
# Create RAG system
rag_system = RAGSystem(retriever, llm_client)
# Test Q&A
query = "What is the company's overtime policy?"
response, sources = rag_system.generate_response(query)
print("Response:", response)
print("Sources:", sources)
if __name__ == "__main__":
main()
|
Limitations and Future
Current Limitations
- Passive Information Supply: Context Engineering mainly optimizes “what information to provide” but cannot proactively decide “what information should be provided”
- Missing Execution Control: Knowing correct information ≠ correct execution
- Insufficient Feedback Loops: Lack of verification and correction mechanisms for execution results
- Long-term Autonomy: Cannot handle complex tasks requiring multi-step decision making
Future Development Directions
- Active Context Selection: AI autonomously decides what information is needed
- Execution Safety Controls: Real-time constraints during execution
- Multi-round Feedback Mechanisms: Dynamically adjust context based on execution results
- Autonomous Decision-making Capability: Leap from “can answer” to “can act”
Context Engineering is an important milestone in the AI engineering process. It has evolved us from “how to make AI say the right things” to “how to make AI know the right information,” laying the foundation for the next phase of AI Harness Engineering.