From Context to Harness: Info Is Ready, But AI Is Still Unreliable

Scenario: Information Is Correct, But Execution Goes Wrong

Let’s start with a real-world story:

Background: A company deployed a RAG-based technical documentation Q&A system. This system worked perfectly—when users asked “How to configure Redis cluster?” it could accurately retrieve relevant information from technical documents and provide detailed configuration steps.

Problem: When a user asked “Delete temporary files in the test directory,” the system correctly retrieved the right technical documentation, but during execution it mistakenly deleted the entire project’s core code.

Result: Technical knowledge transfer was perfect, but the execution result was catastrophic.

This scenario reveals a critical issue: Context Engineering solved the knowledge problem but not the execution problem.

The Core Problem

Two Dimensions of Challenges

Context Engineering gives the model the right information but doesn’t control how the model processes that information. This introduces two critical challenge dimensions:

1. Safety Dimension (What NOT to do)

  • Permission Boundaries: What the model should and shouldn’t do
  • Safety Red Lines: Absolutely prohibited operations
  • Compliance Requirements: Legal regulations and company policies

2. Reliability Dimension (How to verify)

  • Result Verification: How to judge if execution results are correct
  • Error Detection: Mechanisms to detect execution anomalies
  • Correction Capability: Remediation when problems occur

Knowledge vs. Execution Differences

DimensionContext EngineeringExecution Challenges
GoalProvide correct informationControl correct execution
DifficultyInformation retrieval and integrationBehavioral constraints and verification
Focus pointInformation qualityBehavioral safety
MethodOptimize contextDesign constraint systems

The Maturity of Tool Calling Capabilities

Starting from 2023, AI systems’ tool calling capabilities have undergone rapid evolution, directly driving the engineering paradigm shift.

OpenAI Function Calling (June 2023)

OpenAI officially launched Function Calling in June 2023:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# OpenAI Function Calling example
import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": "What's the weather like in Boston?"}
    ],
    functions=[
        {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    ],
    function_call="auto"
)

This breakthrough allowed models to:

  • Understand tool purposes: Understand tool functionality through function descriptions
  • Parameter parsing: Automatically extract parameters from user intent
  • Execution coordination: Call external tools on demand

ReAct Pattern (2022-2023)

The ReAct (Reasoning + Acting) pattern proposed by Yao et al. at ICLR 2023:

mermaid
flowchart TB
    A[User Question] --> B[Think]
    B --> C{Need Tool?}
    C -->|Yes| D[Call Tool]
    C -->|No| E[Direct Answer]
    D --> F[Observe Result]
    F --> B
    E --> G[Final Answer]

Core innovations of ReAct:

  • Reasoning loop: Complete closed loop of think-act-observe
  • Tool orchestration: Ordered calls of multiple tools
  • Result integration: Integrate tool results into final answers

Toolformer (November 2023)

Toolformer proposed by Schick et al. at NeurIPS 2023:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Toolformer autonomously learning to use tools
class Toolformer:
    def __init__(self, model, tools):
        self.model = model
        self.tools = tools
    
    def learn_tool_usage(self, training_data):
        # 1. Identify scenarios needing tools
        tool_needs = self.identify_tool_needs(training_data)
        
        # 2. Learn calling patterns autonomously
        for need in tool_needs:
            tool_call = self.model.generate_tool_call(need)
            self.tools.execute(tool_call)
            
            # 3. Integrate results into training data
            enhanced_data = self.integrate_results(tool_call, training_data)
        
        return enhanced_data

The revolutionary significance of Toolformer:

  • Autonomous learning: Models learn when to use tools by themselves
  • Tool library expansion: Not dependent on predefined tool lists
  • Context awareness: Dynamically select tools based on conversation

AutoGPT (March 2023)

AutoGPT marked the emergence of the first autonomous Agent framework:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# AutoGPT's autonomous execution mode
class AutoAgent:
    def __init__(self, name, objective):
        self.name = name
        self.objective = objective
        self.tasks = []
        self.completed_tasks = []
    
    def generate_plan(self):
        # 1. Decompose objective into tasks
        self.tasks = self.decompose_objective(self.objective)
        
        # 2. Generate execution plan
        plan = self.create_execution_plan(self.tasks)
        return plan
    
    def execute_plan(self):
        # 3. Execute task sequence autonomously
        for task in self.tasks:
            if task not in self.completed_tasks:
                result = self.execute_task(task)
                self.completed_tasks.append((task, result))
        
        return self.evaluate_completion()

The Emergence of New Requirements

Shift from “Can Answer” to “Can Execute”

With the maturation of tool calling capabilities, the focus of AI systems has undergone a fundamental shift:

PhaseFocusKey Question
EarlyCan it answer“Does it work?”
Prompt EngineeringHow to answer better“How to make it better?”
Context EngineeringWhat information to know“What should it know?”
Agent EraCan it execute safely?Can it act safely?

Complexity of Execution Scenarios

Modern AI Agents face increasingly complex execution scenarios:

1. File System Operations

python
1
2
3
4
5
# Risks in file operations
file_operations = {
    "safe": ["read_file", "list_directory", "create_file"],
    "dangerous": ["delete_directory", "modify_system_file", "execute_script"]
}

2. Network Access

python
1
2
3
4
5
# Risks in network operations
network_operations = {
    "safe": ["fetch_public_data", "send_api_request"],
    "dangerous": ["access_internal_system", "modify_database", "exfiltrate_data"]
}

3. Code Execution

python
1
2
3
4
5
# Risks in code execution
code_execution = {
    "safe": ["run_python_code", "execute_query"],
    "dangerous": ["system_command", "eval_user_input", "import_untrusted_module"]
}

The Contradiction Between Safety and Reliability

While pursuing AI execution capabilities, we face a dilemma:

Need: AI needs sufficient capability to complete complex tasks Risk: The stronger the capability, the greater the potential damage Challenge: How to find balance between capability and safety

The Core Cognitive Shift

From Information Optimization to Behavior Control

Context Engineering focuses on optimizing information flow, while Harness Engineering focuses on controlling behavior flow:

Optimization DirectionContext EngineeringHarness Engineering
Focus pointInformation qualityBehavioral constraints
MethodProvide correct informationDesign safety mechanisms
GoalKnowledge accuracyExecution safety
EvaluationInformation relevanceBehavioral reliability

Human Steer vs. Agent Execute

Core Philosophy: Human Steer, Agent Execute

  • Human Steer: Humans set objectives, define constraints, monitor processes
  • Agent Execute: AI executes autonomously within constraint frameworks
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Human Steer, Agent Execute example
class ControlledAgent:
    def __init__(self, human_constraints):
        self.constraints = human_constraints
        self.execution_context = None
    
    def execute_task(self, task):
        # 1. Human-defined task and constraints
        if not self.validate_task_constraints(task):
            return "Task violates constraints"
        
        # 2. AI execution within constraint framework
        self.execution_context = self.setup_execution_context(task)
        
        # 3. Continuous monitoring during execution
        result = self.monitor_execution(task)
        
        # 4. Result verification and reporting
        return self.validate_and_report(result)

Evolution of Engineering Paradigms

1
2
3
Prompt Engineering → Context Engineering → Harness Engineering
   Optimize language         Optimize info            Control behavior
"How to say right"      "What to know"          "How to act"

This evolution reflects the transformation of AI systems from language models to action systems.

Preview: Core Solutions of Harness Engineering

Context Engineering identified the existence of execution problems but didn’t provide complete solutions. The next article will detail Harness Engineering, specifically designed to solve AI execution safety and reliability.

The core of Harness Engineering includes:

  1. Tool Injection System: Safe tool calling mechanisms
  2. State Management System: Task execution state tracking
  3. Verification Loop System: Execution result verification and correction
  4. Constraint Layering System: Multi-level execution constraints

These systems together form the “safety reins” for AI Agents, evolving AI from “can answer” to “can execute safely.”

Harness Engineering isn’t about limiting AI capabilities but ensuring they are exercised responsibly and controllably. This marks the entry of the AI engineering process into a new stage.