CodeAct: Executable Code Actions Elicit Better LLM Agents

Review date: 2026-05-25 Review author: Zhongzhu Zhou Paper reviewed: Executable Code Actions Elicit Better LLM Agents Paper authors: Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji arXiv: 2402.01030 Status/Venue: ICML 2024 (Proceedings of the 41st International Conference on Machine Learning, PMLR 235)

Short Answer

CodeAct is a simple but high-impact insight: instead of asking an LLM agent to emit JSON objects or free-text commands to call tools one at a time, just let it write and execute Python code. Code is a universal action language — it already supports loops, conditionals, variables, imported libraries, and built-in error messages. By treating each agent turn as a Python code execution rather than a single JSON tool invocation, CodeAct collapses a fragmented action space into a single, expressive, and well-debuggable medium. Across 17 LLMs tested on API-Bank and the authors’ new M³ToolEval benchmark, CodeAct achieves up to 20% higher success rates than JSON or text-based alternatives. More importantly, CodeAct enables LLMs to self-debug and improve across multiple turns — something JSON agents fundamentally cannot do. The authors build CodeActAgent, a fine-tuned open-source 7B model on 7,139 trajectories, that matches the agent-task performance of models five to ten times its size.

Prerequisites: What You Need to Know First

Before diving into the technical details, I want to make sure the core concepts are solid. CodeAct operates at the intersection of language models, tool use, and software execution environments. Readers coming from a pure NLP background will want to understand the agent loop and why action formats matter; readers from a systems background will want to understand why existing LLMs struggle with multi-step tasks.

1. The Standard LLM Agent Loop

An LLM agent is a language model that interacts with an external environment over multiple turns. The standard loop looks like this:

sequenceDiagram
    participant U as User
    participant A as LLM Agent
    participant E as Environment (Python Interpreter / APIs / Web)

    U->>A: Instruction (natural language)
    loop Multi-turn interaction
        A->>A: Think / Plan (chain-of-thought, optional)
        A->>E: Action (tool call, code, JSON, text...)
        E->>A: Observation (execution result, API response, error)
    end
    A->>U: Final Answer

The agent receives an instruction, decides on an action, sends it to the environment, observes the result, and repeats until it has enough information to answer. Simple in concept, but the format of the action turns out to matter enormously — which is exactly what CodeAct is about.

2. What “Tool Use” Means in LLMs

Modern LLMs like GPT-4 and Claude can call external tools. In practice, “tool use” typically means the model generates a structured string — often JSON — describing which function to call and with what arguments. For example, to look up a phone price in Germany, a JSON-action agent might emit:

{"tool": "lookup_phone_price", "country": "Germany"}

The framework then parses this JSON, calls the actual Python function, and returns the result as text. The model never sees Python code at execution time — it just produces a string that gets interpreted.

The problem: This JSON is a very constrained language. It can describe exactly one function call at a time. If you need to call five tools and combine their results, you need five separate turns, each with a new JSON object. There’s no way to express “call tool A and tool B in a loop and average the results” in a single JSON action.

3. The Action Format Landscape

Before CodeAct, three main action formats were used in LLM agents:

FormatExampleStrengthsWeaknesses
Free TextAction: lookup_rates, country: GermanyHuman-readableHard to parse reliably; no structure
JSON{"tool": "lookup_rates", "country": "Germany"}Parseable; schema-validatedOne tool per turn; no loops; no data flow
Code (CodeAct)rates = lookup_rates("Germany")Full expressiveness; libraries; self-debugModel must know Python

The key insight of CodeAct is that code is not just a programming construct — it is a general-purpose action language that already supports everything JSON cannot: loops, conditionals, variable binding, error handling, and library imports.

4. Why Python Specifically?

The authors chose Python for four reasons:

  1. Pre-training data richness. Python is the #1 language on the TIOBE index (2024) and dominates ML/data science GitHub repositories. LLMs have seen vastly more Python than any structured JSON schema.
  2. Existing ecosystem. import pandas, import numpy, import sklearn — the entire scientific Python stack is available. No need to pre-define tools.
  3. Built-in feedback. Python’s TypeError, ValueError, and traceback output are already in natural language. Agents can read their own error messages and self-debug.
  4. Control and data flow. for, if, while, list comprehensions, generator expressions — everything an agent might need to compose tool outputs.

5. The Multi-Turn Self-Debug Mechanism

One under-appreciated aspect is that CodeAct enables a form of automated curriculum learning through error feedback. When a Python snippet fails, the interpreter returns a traceback. This traceback is fed back as the next “observation.” A sufficiently capable LLM reads the traceback and produces a corrected code block. No human intervention needed. This is qualitatively different from JSON agents, where a failed API call just returns {"error": "Invalid argument"} — there is no path trace, no line number, no type information.

flowchart LR
    A["Agent writes Python code\n(Turn N)"] --> B["Python interpreter executes"]
    B -- "Success" --> C["Observation: result\nAgent answers / continues"]
    B -- "Failure" --> D["Observation: traceback\n(TypeError, NameError, ...)"]
    D --> E["Agent reads traceback\nand self-debugs\n(Turn N+1)"]
    E --> A

This loop is the secret behind CodeAct’s multi-turn improvement — it closes the feedback cycle without human-curated error mappings.

The CodeAct Framework: Design and Formalization

1. A Unified Action Space

CodeAct’s central claim is: all agent–environment interactions should be expressed as a single block of executable Python code. The formal setup is:

  • Agent A\mathcal{A}: the LLM
  • User U\mathcal{U}: the human providing instructions
  • Environment E\mathcal{E}: Python interpreter + APIs + external tools (exposed as Python functions)

At each turn tt, the agent receives an observation oto_t (either from the user or from code execution) and produces an action ata_t, which is a Python code block enclosed in <execute>...</execute> tags. The environment executes ata_t and returns ot+1o_{t+1} (the stdout, stderr, or result of the code).

Formally, a trajectory is:

τ=(o1,a1,o2,a2,,oT,aT,oT+1)(1)\tau = (o_1, a_1, o_2, a_2, \ldots, o_T, a_T, o_{T+1}) \tag{1}

where o1o_1 is the initial user instruction, each atCa_t \in \mathcal{C} (the space of valid Python programs), and ot+1=E(at)o_{t+1} = \mathcal{E}(a_t) is the execution output.

The total trajectory length TT is bounded by a maximum turn limit (set to 10 in the M³ToolEval experiments).

2. Advantages Over JSON: A Formal Comparison

Let me make the four claimed advantages precise.

Advantage 1 — Control Flow. In JSON, expressing “compute the sum of tool outputs for a list of inputs” requires NN separate JSON turns. In CodeAct:

countries = ["USA", "Japan", "Germany", "India"]
final_prices = {}
for country in countries:
    local_price = lookup_phone_price("CodeAct 1", country)
    rate = lookup_rates(country)
    converted = convert_and_tax(local_price, rate, tax_rate)
    shipping = estimate_shipping_cost(country)
    final_prices[country] = estimate_final_price(converted, shipping)
best = min(final_prices, key=final_prices.get)
print(best, final_prices[best])

This is one CodeAct turn versus four or more JSON turns. The loop saves both wall-clock time and LLM API calls.

Advantage 2 — Data Flow. JSON actions are stateless: each action is a fresh call, and values from prior calls cannot be stored as intermediate variables. In Python, rate computed in turn NN can be referenced in turn N+1N+1 without re-querying. The Python interpreter maintains state across <execute> blocks in the same session (similar to a Jupyter notebook).

Advantage 3 — Existing Packages. Compare the effort to add a new tool:

Action TypeCost to Add New Tool
JSON / TextDefine schema, register handler, update prompt
CodeActNone — import tool_library and call directly

The entire PyPI ecosystem (~500,000 packages) becomes the tool library. The model’s pre-training exposure to package documentation means it often knows how to use packages without any explicit tool definition.

Advantage 4 — Automated Feedback. Python tracebacks are structured natural-language error messages. Compare:

# JSON agent error
{"status": "error", "code": -1}

# CodeAct Python traceback
TypeError: unsupported operand type(s) for *: 'str' and 'float'
  File "<execute>", line 3, in <module>
    converted = local_price * rate

The traceback tells the agent exactly what failed, on which line, with what types. This is actionable information for self-correction.

3. System Prompt Design

The CodeAct system prompt (Appendix E) instructs the model to:

  1. Interact with a “Python Jupyter Notebook” environment
  2. Enclose code in <execute>...</execute> tags
  3. Use !pip install [package] for missing packages
  4. Attempt fewer things per block (shorter, more targeted code)
  5. Stop and provide an answer once the execution result is sufficient

This is notably minimal. Unlike many agent frameworks that require extensive few-shot demonstrations, CodeAct’s zero-shot prompt is just 6 lines of instructions. The rest is handled by the model’s pre-trained Python knowledge.

graph TD
    subgraph System["System Prompt (zero-shot, 6 lines)"]
        SP1["You interact with a Python Jupyter environment"]
        SP2["Code in &lt;execute&gt; tags"]
        SP3["Install packages with !pip install"]
        SP4["Attempt fewer things per block"]
        SP5["Stop and answer once result obtained"]
    end
    subgraph Turn["Turn Structure"]
        OBS["Observation (user or env)"]
        THINK["Optional think / CoT"]
        CODE["&lt;execute&gt; Python code &lt;/execute&gt;"]
        RESULT["Environment returns stdout/stderr"]
    end
    System --> Turn

Research Question 1: Does Code Familiarity Help for Atomic Tool Calls?

Experimental Setup

The first experiment tests the simplest possible scenario: a single tool call. The task is to call one API function correctly with the right arguments. This ablates the control/data-flow advantage of code, leaving only the question: does using code syntax help LLMs call tools compared to JSON or text?

The authors repurpose API-Bank (Li et al., 2023), a benchmark of 113 level-1 (single tool call) instances covering calendar, finance, weather, and more APIs.

Evaluation metric: Correctness — whether the model-generated API call’s output exactly matches the ground-truth output.

Models tested: 8 open-source (CodeLlama-7B/13B/34B, Llama-2-7B-chat/13B/70B, Mistral-7B-Instruct, Lemur-70B) and 8 closed-source (claude-2, claude-instant-1, gemini-pro, gpt-3.5-turbo-0613/1106, gpt-4-0613, gpt-4-1106-preview, text-davinci-002/003).

Results and Analysis (Table 2)

CorrectnessCodeActCorrectnessJSON for 8/17 models(2)\text{Correctness}_{\text{CodeAct}} \geq \text{Correctness}_{\text{JSON}} \text{ for 8/17 models} \tag{2}

For most LLMs, CodeAct achieves comparable or better performance than JSON and text, even on single-tool-call tasks where CodeAct’s control-flow advantage cannot contribute. Key observations:

  1. Open-source LLMs favor Code over JSON. For CodeLlama-13B (38.1 vs. 12.0 vs. 17.8), Llama-2-70B (35.6 vs. 14.3 vs. 36.6), and Lemur-70B (58.6 vs. 46.6 vs. 56.1), code format substantially outperforms JSON. The hypothesis: open-source models have heavy code pre-training data but no JSON tool-call fine-tuning, so code format is “native” while JSON tool calling is foreign.

  2. Closed-source LLMs show smaller gaps. GPT-4, claude-2, and gpt-3.5-turbo already have JSON tool-calling fine-tuning in their training, so the margin is smaller. Still, CodeAct is competitive.

  3. Text format sometimes wins for base models. Llama-2-70B-chat performs best on text (36.6 vs. 35.6 for code). This suggests text is an even more natural format than code for instruction-tuned chat models — but JSON is consistently the weakest for open-source models.

xychart-beta
    title "API-Bank Correctness (%) by Action Format"
    x-axis [CodeLlama-7B, CodeLlama-13B, Llama-2-70B-chat, Lemur-70B, gpt-4-0613, gpt-4-1106]
    y-axis "Correctness (%)" 0 --> 90
    bar [12.5, 38.1, 35.6, 58.6, 75.4, 76.8]
    bar [12.0, 12.0, 14.3, 46.6, 82.0, 82.7]
    bar [17.0, 14.0, 36.6, 56.1, 74.4, 75.4]

The takeaway for RQ1: Code is a natural tool-calling format for LLMs with substantial code pre-training. JSON is not an obvious choice for general-purpose LLMs that haven’t been specifically fine-tuned for it.

Research Question 2: Does Control Flow Help for Complex Tasks?

The M³ToolEval Benchmark

The second experiment tackles what JSON fundamentally cannot do: tasks requiring multiple tool calls, across multiple turns, with composition of results. The authors curate M³ToolEval — Multiple Models Multi-turn Tool Evaluation — consisting of 82 human-annotated instances spanning:

  • Web browsing (7 tools: click, scroll, go back, etc.)
  • Finance (price lookup, currency conversion, tax calculation, shipping)
  • Travel itinerary (flight search, hotel booking, weather)
  • Science information processing (data retrieval, calculation, summarization)

Each instance requires calling multiple different tools in a single task. Table A.7 in the paper compares M³ToolEval against prior benchmarks:

BenchmarkMulti-turnMulti-toolNo external API dependency
ToolBench
APIBench
API-Bank
M³ToolEval

This is the first benchmark that simultaneously requires multi-turn interaction, multiple tools, and uses local APIs (no external service dependencies that could cause spurious failures).

Experimental Setup

  • Maximum 10 interaction turns
  • Metric: success rate (answer matches ground truth) + average turns (lower is better)
  • Same 17 LLMs as RQ1
  • Zero-shot setting (no demonstrations in prompt)

Results (Table 3)

This is where CodeAct’s advantage becomes decisive:

SuccessRateCodeActSuccessRateJSON20% absolute (best model)(3)\text{SuccessRate}_{\text{CodeAct}} - \text{SuccessRate}_{\text{JSON}} \leq 20\% \text{ absolute (best model)} \tag{3}

For gpt-4-1106-preview: CodeAct = 74.4%, JSON = 53.7%, Text = 53.7%. The 20.7% absolute improvement is the headline result of the paper.

For average turns: CodeAct = 5.5 turns vs. JSON = 7.6 vs. Text = 7.7. CodeAct achieves more with 2.1 fewer turns on average.

xychart-beta
    title "M³ToolEval Success Rate (%) — CodeAct vs JSON vs Text"
    x-axis [claude-2, gpt-3.5-0613, gpt-4-0613, gpt-4-1106, text-davinci-003, gemini-pro]
    y-axis "Success Rate (%)" 0 --> 80
    bar [54.9, 51.2, 67.1, 74.4, 20.7, 22.0]
    bar [39.0, 26.8, 56.1, 53.7, 18.3, 19.5]
    bar [29.3, 20.7, 45.1, 53.7, 7.3, 11.0]

Why does CodeAct use fewer turns? Because it can do in one turn what JSON needs multiple turns for. In the phone-price example from Figure 1, CodeAct writes a loop that calls lookup_rates, lookup_phone_price, convert_and_tax, and estimate_shipping_cost for all four countries in one <execute> block. JSON needs at least 4 × 4 = 16 separate turns.

The gap pattern is instructive: The improvement is largest for stronger models (gpt-4 > gpt-3.5 > open-source). This is because CodeAct requires the model to write correct, runnable Python — a higher bar than producing a syntactically valid JSON string. Weaker models produce broken Python more often, partially erasing the conceptual advantage.

Why Doesn’t JSON Close the Gap with Better Prompting?

One might ask: can’t we just give the JSON agent more turns to compensate? Yes, up to the turn limit — but the turn limit itself matters. Every extra turn costs LLM API time, money, and latency. More fundamentally, the data flow problem cannot be solved by more turns: if the JSON agent needs the output of turn NN as input to turn N+3N+3, it must store that value somewhere, and storing/retrieving intermediate values in JSON is an awkward manual process (asking the LLM to “remember” a number). Python variables handle this natively.

Research Question 3: Multi-Turn Interactions and Existing Software

CodeActAgent in Action

Figure 3 in the paper shows a complete example of CodeActAgent (Mistral-7B fine-tuned) handling a data analysis task:

  1. User uploads a dataset and asks to “look at it, check for missing values, train a regression model.”
  2. Agent emits <execute>import pandas; df = pd.read_csv(url); df.info()</execute> → sees data types and missing values.
  3. Finds a type error (a column is str instead of float due to a $ character). Fixes it with df[col] = df[col].str.replace('$', '').astype(float).
  4. Trains sklearn LinearRegression with train-test split → prints MSE and R².
  5. User asks for coefficient visualization → agent writes Matplotlib code → gets an error about tick_params argument → reads the traceback → corrects the argument → produces the plot.
sequenceDiagram
    participant U as User
    participant A as CodeActAgent
    participant E as Python Interpreter

    U->>A: "Download dataset, check missing values, train regression model"
    A->>E: import pandas; df = pd.read_csv(url); df.info(); df.isnull().sum()
    E->>A: [shape, dtypes, null counts — 'car name' has '$']
    A->>E: df['price'] = df['price'].str.replace('$','').astype(float); ...split/train...
    E->>A: MSE=10.79, R²=0.795
    U->>A: "Visualize regression coefficients, rotate x-axis 45°"
    A->>E: plt.xticks(feature_names, rotation=45); plt.show()
    E->>A: TypeError: tick_params() got unexpected keyword argument 'ticks'
    A->>E: plt.xticks(feature_names, rotation=45, ha='right', fontsize=12); plt.show()
    E->>A: [Figure 6464×480]
    A->>U: Visualization complete. Coefficients plotted.

This multi-turn, self-debugging interaction is what makes CodeAct powerful in practice. The agent never needed a pre-defined “fix_column_type” tool — it invented the fix on the fly using Python string methods it learned during pre-training.

The Self-Debug Rate

In the CodeActInstruct dataset, the authors specifically curate trajectories where the model encounters errors and recovers:

“We selectively preserve those trajectories wherein the model initially encounters errors but subsequently rectifies these inaccuracies in later interactions.”

This curation is critical. If you train on trajectories that always succeed on the first try, you don’t teach the model how to recover from errors. CodeActInstruct explicitly includes “stumble-then-recover” trajectories to build the self-debug capability.

CodeActInstruct: Building the Training Dataset

Design Philosophy

CodeActInstruct is a purpose-built instruction-tuning dataset for CodeAct. It covers five domains that together stress-test different aspects of the agent-environment interaction:

DomainCapabilitySource Dataset# Instances
Information SeekingWeb search via Wikipedia APIHotpotQA3,000
Software Package (Tool)Math with sympyMATH1,732
Software Package (Tool)Self-debug Python codeAPPS (code gen)647
External MemoryTable queries (SQLite + Pandas)WikiTableQuestion1,065
Robot PlanningEmbodied task via ALFWorldALFWorld2,031
Total7,139

Trajectory Generation Pipeline

Generating CodeActInstruct trajectories is a three-step pipeline:

flowchart LR
    DS["Source Dataset\n(HotpotQA, MATH, APPS,\nWikiTableQuestion, ALFWorld)"]
    CON["Convert to multi-turn\n(MINT framework)\nSingle-turn → interactive with\nmax 5 turns + self-debug"]
    GEN["Trajectory Generation\ngpt-3.5-turbo-0613,\nclaude-2,\ngpt-4-0613 (hard problems only)"]
    FILT["Data Selection\n1. Code-as-Actions filter\n2. Self-Improving filter\n3. Instruction-Following filter"]
    FINAL["Final: 7,139 trajectories\n(6,728 from gpt-3.5/claude + 411 from gpt-4)"]

    DS --> CON --> GEN --> FILT --> FINAL

Step 1 — Convert to multi-turn: The MINT framework (Wang et al., 2023e) converts single-turn problems into multi-turn interactive settings. For APPS (code generation), this means giving the model up to 5 attempts to pass test cases rather than requiring a correct one-shot solution.

Step 2 — Generate trajectories: The authors use gpt-3.5-turbo-0613 and claude-2 as teacher models for most problems (cheaper), and gpt-4-0613 for problems that gpt-3.5 cannot solve.

Step 3 — Data selection heuristics (critical):

  1. Code-as-Actions: Exclude trajectories where the model doesn’t follow the code format (either calls APIs incorrectly or produces non-executable actions).

  2. Self-Improving: Keep trajectories where the model initially fails but later corrects. Exclude trajectories that fail throughout (useless for learning recovery) and exclude trajectories that succeed without ever encountering an error (useful but don’t teach debugging).

  3. Instruction-Following: Exclude trajectories with an odd number of turns (indicates the model didn’t follow the turn-taking format correctly).

After all three filters: 6,728 from gpt-3.5 and claude, 411 from gpt-4.

Comparison with Prior Datasets

CodeActInstruct is 3.8× larger than FireAct and 5× larger than AgentInstruct in terms of trajectories, and covers 5 domains vs. FireAct’s 2 (QA + search). Its key differentiator is the explicit multi-turn self-improvement data.

CodeActInstruct=7,139vsAgentInstruct=1,866vsFireAct=2,063(4)|\text{CodeActInstruct}| = 7,139 \quad \text{vs} \quad |\text{AgentInstruct}| = 1,866 \quad \text{vs} \quad |\text{FireAct}| = 2,063 \tag{4}

CodeActAgent: Fine-Tuning Open-Source LLMs

Training Setup

The authors fine-tune two open-source backbones:

  • LLaMA-2 7B (Touvron et al., 2023)
  • Mistral-7B (Jiang et al., 2023)

Training is supervised fine-tuning (SFT) on a mixture of CodeActInstruct (7,139 agent trajectories) and general conversation data (69,230 examples from OpenOrca, ShareGPT, and CapyBara). The mixture ratio is important: agent data alone would hurt general task performance.

Training infrastructure:

  • 4× A100-40GB SXM nodes
  • Megatron-LLM fork (Cano et al., 2023)
  • Tensor Parallel degree: 4
  • Learning rate: 1×1051 \times 10^{-5} with 50 warmup steps, cosine decay to 1×1061 \times 10^{-6}
  • 5 epochs, batch size 32
  • Sequence length: 4,096 (LLaMA-2), 16,384 (Mistral)
  • Use 3rd epoch checkpoint (best empirically)
  • Loss computed only on assistant responses (not user/system turns)

Evaluation Protocol

CodeActAgent is evaluated on:

Task TypeBenchmarkIn-domain (ID) or Out-of-domain (OD)
Code as Action (agent)MINT (subset)ID for agent tasks
Text as Action (agent)Miniwob++, ScienceWorldOD (never seen during training)
General LLMMMLU, HumanEval, GSM8K, MTBenchOD

The split between ID and OD is crucial: CodeActAgent (Mistral) achieves 57.4 MINT (ID) and 32.4 MINT (OD), while on M³ToolEval (OD) it achieves 12.2 — already competitive with some closed-source models.

Key Results (Table 5)

xychart-beta
    title "CodeActAgent Mistral-7B vs. Baselines (MINT benchmark)"
    x-axis ["Mistral Base", "Mistral Instruct", "AgentLM-7B", "FireAct-7B", "CodeActAgent (Mistral)", "gpt-3.5-0613", "gpt-4-0613"]
    y-axis "MINT Score (ID)" 0 --> 80
    bar [0, 18.8, 0, 0, 57.4, 33.9, 68.6]

CodeActAgent (Mistral, 7B) achieves 57.4 on MINT ID — higher than gpt-3.5-turbo-0613 (33.9) and within striking distance of gpt-4 (68.6). This is a ~24-point gain over AgentLM-7B and FireAct-7B, which use similar-size backbones but different data.

Three critical findings:

  1. Generalizes to text actions (OD). CodeActAgent (LLaMA-2) achieves 25.5 on MiniWob++ text actions and 17.6 on ScienceWorld — comparable to AgentLM-7B (28.9 / 13.7) despite AgentLM being explicitly tuned for text actions. This suggests CodeAct training generalizes to other action modalities.

  2. Maintains general LLM capability. On MMLU (59.1), HumanEval (34.7), GSM8K (58.0), CodeActAgent (Mistral) matches or exceeds Mistral Instruct, confirming that the mixed training does not degrade general performance (except a slight MMLU drop for LLaMA-2 variant).

  3. LLaMA-2 variant unexpectedly doesn’t improve. CodeActAgent (LLaMA-2) shows near-zero improvement on most agent tasks. The authors attribute this to LLaMA-2’s generally weak instruction-following capability — fine-tuning on agent data can’t compensate for the base model’s weaknesses on complex multi-step planning.

Ablation Study

What Components Actually Matter?

Table A.8 in the appendix performs a systematic ablation of CodeActAgent (Mistral) by removing training data components:

ModelMINT (ID)MINT (OD)Miniwob++SciWorldMMLUHumanEvalGSM8KOverall
CodeActAgent (Mistral)57.432.446.215.959.134.758.046.8
w/o CodeAct data32.923.047.817.059.933.259.546.2
w/o general conversations50.513.90.011.052.427.926.822.6

Interpretation:

  • Removing CodeAct training data causes a ~25 point drop on MINT ID and ~9 point drop on OD — confirming agent-specific data is essential.
  • Removing general conversation data causes catastrophic failure on Miniwob++ (46.2 → 0.0) and GSM8K (58.0 → 26.8) — showing that CodeAct training alone causes the model to “forget” general conversational and reasoning abilities. Both data types are necessary.

This is an important negative result: you cannot get a good general-purpose agent model by training exclusively on agent trajectories. The mixture is the trick.

L(θ)=E(x,y)DCodeActDgeneral[tlogpθ(ytx,y<t)](5)\mathcal{L}(\theta) = \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}_{\text{CodeAct}} \cup \mathcal{D}_{\text{general}}} \left[ - \sum_{t} \log p_\theta(y_t \mid \mathbf{x}, y_{<t}) \right] \tag{5}

where the loss is only computed over assistant tokens y\mathbf{y}, not user/system tokens.

vs. Voyager (Wang et al., 2023a)

Voyager also uses code for agent actions in Minecraft, but writes function definitions (entire plans) rather than imperative code blocks. This disallows dynamic atomic action adjustment: in CodeAct, each <execute> block is a small, targeted action. In Voyager, the agent writes one function that handles all possible cases, which is harder and requires more prompt engineering. CodeAct is more flexible and works zero-shot.

vs. OpenCodeInterpreter (Zheng et al., 2024)

Concurrent work focused specifically on competitive programming. Useful for code generation / debugging tasks but not designed as a general agent framework. No CodeActInstruct-equivalent training data for diverse domains.

vs. TaskWeaver (Qiao et al., 2023)

The closest conceptually — also uses code in the action space. But TaskWeaver relies on qualitative closed-source model demonstrations and doesn’t provide rigorous quantitative benchmarks, open-source models, or training data. CodeAct provides all three.

Limitations and Boundary Conditions

1. Python Knowledge Required

CodeAct’s advantage is proportional to the model’s Python capability. For models with minimal code pre-training (e.g., some smaller open-source instruction-tuned chat models), CodeAct offers no advantage and may hurt performance due to syntax errors. JSON may be a better choice for models below a certain Python competence threshold.

2. Security and Sandboxing

CodeAct directly executes arbitrary Python code. This is a fundamental security concern. The paper acknowledges:

“CodeAct directly grants access for the agent to freely execute code in a sandbox environment, in the worst scenario (e.g., in Sci-Fi movies), such an agent may potentially break free of the sandbox restriction and cause harm to the world through cyber-attack.”

In production, CodeAct requires a robust sandboxed execution environment (e.g., Docker containers with no network access, resource limits, and filesystem isolation). This is an engineering burden that JSON-based systems avoid.

3. Hallucination of Package APIs

LLMs sometimes confidently write import nonexistent_package or call df.some_made_up_method(). In a pure text agent, this produces a wrong answer silently. In CodeAct, this produces an ImportError or AttributeError — which is actually better (the agent can detect and correct it), but only if the model has enough capability to recover. Weaker models may loop on errors without converging.

4. Gap Between Open and Closed-Source Models

On M³ToolEval, the best open-source model (Lemur-70B) achieves 13.4% while the best closed-source (gpt-4-1106-preview) achieves 74.4%. CodeAct narrows this gap proportionally but doesn’t eliminate it. The fundamental capability gap in multi-step reasoning remains.

5. Turn-Limit Sensitivity

The paper uses a 10-turn limit for M³ToolEval. Real-world tasks may require more. Every additional turn costs LLM inference compute and introduces the risk of context-window overflow for models with short contexts (LLaMA-2’s 4,096 token limit is a practical concern for long trajectories).

Reproducibility Notes

The paper provides strong reproducibility:

  • Code and data: github.com/xingyaoww/code-act — includes all training code, model checkpoints, evaluation scripts.
  • Models: CodeActAgent (LLaMA-2 7B) and CodeActAgent (Mistral 7B) are released publicly.
  • Dataset: CodeActInstruct (7,139 trajectories, 10.5M tokens) is released.
  • Benchmark: M³ToolEval (82 instances, 4 domains) is available for evaluation.
  • Training details: Fully specified in Appendix D — optimizer, learning rate schedule, batch size, hardware, sequence length.

One important note: trajectory generation used gpt-3.5-turbo-0613 and claude-2, which may be deprecated or changed by the time you run this. Regenerating trajectories with current model versions may yield different results.

The 3rd-epoch checkpoint (not the final checkpoint) is used. This is unusual and suggests overfitting occurs in later epochs — worth noting if you reproduce training.

My Analysis and Takeaways

Why This Paper Matters

CodeAct is important not because it introduces a new architecture or training algorithm, but because it demonstrates a design principle: when you’re already forcing LLMs to generate structured text, make that structure a general-purpose programming language rather than a task-specific schema.

The implications are broader than the paper’s experiments. If code is the natural action format for LLMs, then:

  1. Tool definitions are unnecessary. You don’t need to write function schemas for every capability — just provide Python packages and let the model discover the API from documentation or trial and error.

  2. Error handling is built-in. Python’s exception system is a natural curriculum for agent self-improvement. You don’t need human-curated error mappings.

  3. Arbitrary computation is possible. Sorting, filtering, math, string manipulation — anything computable is available to the agent as a first-class citizen.

The Fundamental Insight About Action Spaces

The paper implicitly makes a deep point about the relationship between expressiveness and learnability of action spaces:

  • Too restrictive (atomic JSON calls): easy to parse, hard to compose, many turns
  • Too expressive (unconstrained text): hard to parse, flexible but fragile
  • Code: expressive, parseable (interpreter), learnable (pre-training data), composable (data flow)

Code threads the needle. This is why the concept has since been adopted in OpenHands (formerly OpenDevin), Claude Computer Use, and essentially every serious agent framework built after 2024.

Open Questions

  1. Optimal code granularity. How much code should each <execute> block contain? The paper recommends “fewer things at a time,” but the optimal tradeoff between block length and turn count is unexplored.

  2. Non-Python languages. Bash, JavaScript, SQL are all plausible alternatives. When would you prefer SQL to Python for a data agent? This is underexplored.

  3. Mixture ratio for training. The paper uses a specific ratio of CodeActInstruct to general conversation data. The optimal ratio (and whether it depends on task distribution at test time) is not systematically ablated.

  4. Longer context requirements. As tasks become more complex, trajectory lengths grow. The 4,096-token limit of LLaMA-2 is already a bottleneck in the paper. Future work should explore explicit trajectory compression.

Summary

AspectSummary
Core ideaUse executable Python as the agent action space
Key advantagesControl flow, data flow, existing packages, automated error feedback
BenchmarksAPI-Bank (atomic calls), M³ToolEval (multi-tool, multi-turn)
Main resultUp to 20% absolute improvement in success rate; 30% fewer turns
DatasetCodeActInstruct: 7,139 trajectories across 5 domains
ModelCodeActAgent (Mistral-7B): matches closed-source on agent benchmarks
Key limitationRequires Python competence; security sandbox needed
VenueICML 2024
Codegithub.com/xingyaoww/code-act

CodeAct is a conceptually clean, well-executed paper that changes how the field thinks about agent action spaces. Its influence is visible in every major agent framework built since its release — which is the clearest indicator of a paper that got something fundamentally right.

Deep Dive: The CodeAct Agent Loop Algorithm

Let me formalize the CodeAct agent loop more precisely to make the mechanics explicit. I’ll write it as pseudocode and then explain each step.

Pseudocode: CodeAct Multi-Turn Agent Loop

Algorithm 1: CodeAct Agent Loop

Input:
  instruction  : str     -- User's natural-language task description
  tools        : dict    -- Python functions available to the agent (may be empty)
  max_turns    : int     -- Maximum number of interaction turns (default: 10)
  llm          : LLM     -- Language model backbone (e.g., Mistral-7B, gpt-4)
  interpreter  : PythonInterpreter  -- Stateful Python execution environment

Output:
  answer       : str     -- Final answer to the user's instruction

1:  context ← [system_prompt] + [("user", instruction)]
2:  turn ← 0

3:  while turn < max_turns:
4:      response ← llm.generate(context)
5:
6:      if "<execute>" in response:
7:          code_block ← extract_between_tags(response, "<execute>", "</execute>")
8:          observation ← interpreter.run(code_block)    // stdout + stderr
9:          context.append(("assistant", response))
10:         context.append(("environment", observation))
11:         turn ← turn + 1
12:
13:     else if "Answer:" in response:
14:         answer ← extract_answer(response)
15:         return answer
16:
17:     else:
18:         // Free-form natural language response (e.g., asking for clarification)
19:         context.append(("assistant", response))
20:         user_reply ← wait_for_user()
21:         context.append(("user", user_reply))
22:         turn ← turn + 1
23:
24: return None   // Exceeded max turns without answer

Line-by-line explanation:

  • Line 1: The context is initialized with the system prompt (the 6-line CodeAct instruction from Appendix E) and the user’s instruction. The tools dictionary, if non-empty, is also appended as Python function definitions in the system prompt.

  • Line 4: The LLM generates a response conditioned on the full context window. This is a standard autoregressive generation step.

  • Line 6–11: If the response contains an <execute> block, the code is extracted and passed to the Python interpreter. The interpreter is stateful — variables defined in turn tt are available in turn t+1t+1. The observation (stdout + stderr) is appended to context as an “environment” turn.

  • Line 13–15: If the response contains “Answer:”, the agent believes it has sufficient information and returns the answer. This is the success path.

  • Line 17–22: If neither <execute> nor “Answer:” appears, the agent is asking the user for clarification. This is uncommon in practice (the system prompt encourages executing code rather than asking).

  • Line 24: If TT turns elapse without an answer, the interaction terminates. Success rate measurements count this as failure.

Key Invariant: Interpreter State Persistence

The most important property of the CodeAct loop is that the Python interpreter maintains state across turns:

State(t+1)=State(t)NewBindings(codet)(6)\text{State}(t+1) = \text{State}(t) \cup \text{NewBindings}(\text{code}_t) \tag{6}

After executing x = 42 in turn 3, the variable x is available in turn 4. This is the fundamental mechanism behind data flow — and it’s simply how a Python REPL works. The implementation detail that makes this work is that the interpreter is not restarted between turns, i.e., it behaves like a persistent Jupyter kernel.

Comparison: JSON Agent Loop Algorithm

For contrast, here is the equivalent JSON agent loop:

Algorithm 2: JSON Tool-Calling Agent Loop (for comparison)

1:  context ← [system_prompt + tool_schemas] + [("user", instruction)]
2:  turn ← 0

3:  while turn < max_turns:
4:      response ← llm.generate(context)
5:
6:      if "Action:" in response:
7:          json_str ← extract_json(response)
8:          tool_name ← json_str["tool"]
9:          args ← json_str["args"]
10:         if tool_name in registered_tools:
11:             result ← registered_tools[tool_name](**args)
12:         else:
13:             result ← {"error": f"Unknown tool {tool_name}"}
14:         context.append(("assistant", response))
15:         context.append(("observation", str(result)))
16:         turn ← turn + 1
17:
18:     else if "Answer:" in response:
19:         return extract_answer(response)
20:
21: return None

Key differences:

  1. Line 8–11 vs. Code line 7–8: JSON requires parsing the tool name from a structured string and dispatching to a pre-registered handler. CodeAct directly executes arbitrary Python — no dispatch table needed.

  2. Line 13: If the JSON names an unknown tool, it fails with a generic error. In CodeAct, the Python interpreter returns NameError: name 'unknown_tool' is not defined — more informative.

  3. Data flow: The JSON loop has no mechanism for variable binding across turns. If turn 3 computes price = 42.0, turn 4 cannot reference price — it must re-query the tool. CodeAct’s interpreter state handles this automatically.

Understanding the Experiments: Statistical Nuances

Why 17 LLMs?

The choice to test 17 models is not arbitrary. The paper wants to show that CodeAct’s advantage is not model-specific — it holds across closed-source and open-source, large and small, instruction-tuned and base models. Table 2 and Table 3 both report the frequency of best-performing format row at the bottom:

MetricCodeActJSONText
Best format (API-Bank, open-source)501
Best format (API-Bank, closed-source)450
Best format (M³ToolEval, open-source)543
Best format (M³ToolEval, closed-source)721

On M³ToolEval (the more meaningful benchmark), CodeAct is the best format for 12 out of 17 LLMs — a strong majority. JSON is best for 5, text for 4 (note a model can tie for best so counts exceed 17).

The Closed-Source JSON Anomaly

Why does JSON sometimes beat CodeAct for closed-source models on API-Bank? Because GPT-3.5, GPT-4, and Claude have all been fine-tuned with JSON function calling as a specific capability. Their training data includes function-calling examples in exactly the JSON schema format. For single atomic calls (API-Bank), this targeted fine-tuning gives them an advantage over code-format. For multi-tool tasks (M³ToolEval), this advantage disappears because no amount of JSON fine-tuning can add for-loops to the action space.

This analysis predicts: as open-source models receive more instruction tuning (including function-calling), the JSON advantage for closed-source models will shrink. By 2025-2026, this prediction has largely come true — most state-of-the-art open-source models have explicit function-calling capability, and code-first agents (like Claude Computer Use, OpenHands) have become the dominant paradigm.

The Turn Efficiency Formula

The expected number of turns for a JSON agent to complete a task requiring NN tool calls is at minimum NN (ignoring retries). For CodeAct, a loop over NN tools is a single turn:

TJSONNvsTCodeAct=1+δ(7)T_{\text{JSON}} \geq N \quad \text{vs} \quad T_{\text{CodeAct}} = 1 + \delta \tag{7}

where δ0\delta \geq 0 is the number of additional self-debug turns triggered by Python errors. In the paper’s experiments, δavg0.5\delta_{\text{avg}} \approx 0.5 for strong models (gpt-4), meaning CodeAct uses about 1.5 turns for a task that needs N4N \geq 4 tool calls.

The result: 30% fewer turns on average (paper’s headline number), which translates directly to 30% fewer LLM API calls and approximately 30% lower inference cost.

The Data Mixture Problem: A Closer Look

Why Does Removing General Conversation Hurt So Much?

The ablation result in Table A.8 shows that removing general conversation data catastrophically hurts Miniwob++ (46.2 → 0.0). This is surprising — why would text-based web interaction require general conversation data?

The explanation is a form of catastrophic forgetting. When fine-tuned exclusively on CodeActInstruct (agent trajectories), the model learns to always respond in the code-as-action format. When evaluated on Miniwob++ (which expects text-format actions), the model can’t switch to the text format — it has forgotten that text actions are even possible. Adding general conversation data reminds the model that natural language responses are also valid.

This is a general principle in instruction tuning: mixture diversity prevents format lock-in. A model trained on one narrow distribution of outputs will produce that format even when it’s inappropriate. Diverse training maintains the model’s ability to modulate output format based on context.

The Data Mixing Objective

Formally, the training distribution is:

Dtrain=αDCodeAct+(1α)Dgeneral(8)\mathcal{D}_{\text{train}} = \alpha \cdot \mathcal{D}_{\text{CodeAct}} + (1-\alpha) \cdot \mathcal{D}_{\text{general}} \tag{8}

where α7,139/(7,139+69,230)0.093\alpha \approx 7,139 / (7,139 + 69,230) \approx 0.093 in the paper’s setup. The agent trajectories are about 9% of all training examples by count, but since they are much longer (~1,482 tokens per instance vs. ~797 for general data), they represent approximately 15% of total training tokens.

The authors don’t ablate over α\alpha — finding the optimal mixture ratio is left as an open problem.

CodeAct’s Influence: The Post-2024 Landscape

How the Field Moved After CodeAct

CodeAct’s most lasting contribution is not its specific model (CodeActAgent has been superseded by many larger and better models) but its conceptualization of code as a universal agent action space. Looking at major agent frameworks built after the paper:

OpenHands (formerly OpenDevin, 2024): Directly builds on CodeAct’s framework. Uses a CodeAct-style code execution loop as the core agent loop. The lead author of CodeAct (Xingyao Wang) is also a contributor to OpenHands.

Claude Computer Use (Anthropic, Oct 2024): Uses code (bash commands + Python) as the primary action language for computer control tasks. JSON tool calls are available but secondary.

Codex/o3-mini Agent Mode (OpenAI, 2025): Generates Python code for execution as part of task solving — exactly the CodeAct pattern.

Agent frameworks (LangChain, LlamaIndex, 2024+): Both added Python REPL tools as first-class action types after CodeAct demonstrated their superiority.

timeline
    title CodeAct's Influence on Agent Frameworks
    2024-02 : CodeAct paper (arXiv)
    2024-06 : ICML 2024 acceptance
    2024-07 : OpenHands (OpenDevin) adopts CodeAct loop
    2024-10 : Claude Computer Use (code-first actions)
    2025-01 : Most SOTA agent benchmarks use code actions
    2025-06 : Code-as-action becomes default in all major agent frameworks

What Hasn’t Changed: The Safety Problem

Despite widespread adoption, the fundamental security concern remains unresolved. Executing arbitrary LLM-generated Python in a production environment requires:

  1. Process isolation (Docker container or VM)
  2. Network restrictions (no outbound HTTP except allowed endpoints)
  3. Filesystem restrictions (no access to sensitive paths)
  4. Resource limits (CPU time, memory, disk I/O)
  5. Code auditing (detecting potentially malicious patterns before execution)

Most agent frameworks today run CodeAct in Docker containers with restricted capabilities. But the attack surface is real: an adversarially crafted instruction could cause the agent to execute os.system("rm -rf /") if the sandbox is misconfigured. The paper acknowledges this but doesn’t provide a solution — it remains an active research area as of 2026.

Experimental Figure Annotations

Figure 1 (Paper): The Motivating Example

The paper’s Figure 1 shows side-by-side comparison of JSON vs. CodeAct for determining the cheapest country to buy a smartphone. The JSON agent needs at least 8 turns (2 per country × 4 countries minimum), while CodeAct does it in 1 turn using a for-loop. Key annotations:

  • “Control & Data Flow of Code Simplifies Complex Operations” — the for-loop handles all countries in one block
  • “Re-use ‘min’ Function from Existing Software Infrastructures (Python library)”min(final_prices, key=...) is a built-in that JSON agents can’t use
  • “Fewer Actions Required!” — the comparison caption

Figure 2 (Paper): The Agent Framework Diagram

Figure 2 shows the general multi-turn interaction framework. Key components:

  • Agent: receives observations, runs chain-of-thought, emits actions
  • Environment: two types — Computer interface (information seeking, tools, external memory) and Physical world (robots)
  • User: provides instructions, receives natural-language responses
  • Planning module: chain-of-thought, self-reflection, learning from prior observations

The diagram makes explicit that CodeAct is a general framework — not specific to any one application. The same loop works for web browsing (click_url in Python), database queries (sqlite3 in Python), math (sympy in Python), and robot control (robot APIs in Python).

Figure 3 (Paper): CodeActAgent in Action

The data-analysis session in Figure 3 demonstrates four self-debug cycles:

  1. Type error on $ character in price column → fixed with str.replace('$','')
  2. tick_params() wrong argument → fixed by switching to xticks(rotation=...) Both fixes are derived purely from reading the traceback, with no human-provided hint about what’s wrong.

Final Verdict

CodeAct is one of the cleaner papers in the LLM agent literature of 2024 — it makes a specific, testable claim (code outperforms JSON/text), provides rigorous experiments across 17 models and two benchmarks, releases all code/data/models, and builds a concrete artifact (CodeActAgent) that demonstrates the idea works in practice.

The 20% success rate improvement headline is real and reproducible. The more important contribution is conceptual: code is the right level of abstraction for agent actions. Not natural language (too ambiguous), not JSON (too constrained), but executable code — the universal “language of computation” that LLMs have been reading and writing since before they were called LLMs.

The sandboxing and safety gap is the paper’s main unresolved issue, and it’s an important one for anyone deploying CodeAct-style agents in production. But for research purposes, the case is made compellingly.