Teaching Agents to Ask for Help

	Before	After 29 Steps
Completion Rate	5.5%	44.5%
Reward	-0.528	+0.824 peak
Format Errors	88.3%	31.3%
Phase	Scaffold (6 expert / 2 solo)	Independence (2 expert / 6 solo)

What Is This?

Imagine teaching a new employee. You wouldn't just hand them a manual and walk away. You also wouldn't stand behind them dictating every keystroke.

The best approach? Let them try, and tell them an expert is available if they get stuck.

We give a small language model (Qwen3-4B) a set of ~35 API tools, a task description, and access to a brilliant advisor (GPT-5.1). Then we use reinforcement learning (GRPO) to teach it when calling the expert leads to better outcomes — and ultimately, when it can fly solo.

The Environment: Agent World Model (AWM)

The Agent World Model is a benchmark of 1,005 simulated web environments — each one a unique "website" backed by a SQLite database and a REST API exposed as MCP tools.

What's inside each environment?

Component	Details
Database	SQLite with pre-seeded data (users, records, relationships)
API	30-40 REST endpoints auto-generated as MCP tools
Tasks	10 per environment — natural language instructions the agent must complete
Verifier	Python code that checks the final DB state for correctness

Example domains (from 1,005 environments)

Domain	Example Environments	Example Task
Workflow Automation	FlowLatch, FlowMesh, DocRelay	"Create a workflow named 'Lead to Support Sync' in draft status..."
E-Commerce	Amazon, eBay, Shopify Admin	"Search for 'wireless headphones' and add the top-rated item..."
Dev Tools	GitHub, Jira, ChatGPT	"Create a new branch 'feature/dark-mode' and open a pull request..."
IoT / Smart Home	NestGrid, RoomAura, VetLoop	"Register a smart thermostat for room 805 with firmware v3.1.0..."
Social Media	YouTube, Reddit, LinkedIn	"Subscribe to 'Kurzgesagt' and add their latest video to playlist..."

Training uses 53 tasks across 8 workflow automation environments, with 29 held-out tasks for validation.

What Does the Agent Actually Do?

The agent interacts through two simple primitives:

list_tools — discovers all available API endpoints (typically 30-40 per environment)
call_tool — calls a specific API endpoint with JSON arguments

And optionally:

ask_expert — consults GPT-5.1 for a step-by-step plan (max 3 calls per task)

Every tool call must be wrapped in XML:

<tool_call>
{"name": "call_tool", "arguments": {"tool_name": "create_workflow",
  "arguments": "{\"name\": \"Lead Sync\", \"status\": \"draft\"}"}}
</tool_call>

88% of rollouts in step 1 fail just to format this XML correctly. That's the gateway skill the agent must learn first.

The Expert Tool

The ask_expert tool is a GPT-5.1 call with a key advantage: it has already analyzed the Python verifier code and knows exactly what database state constitutes success.

Parameter	Details
Model	GPT-5.1 (Azure OpenAI)
Input	Task description + available MCP tool schemas + error context
Output	Precise step-by-step plan with exact tool names and argument values
Max calls	3 per episode
Secret weapon	Verifier-informed — it knows the answer key

Think of it like an open-book exam where one of the "books" is a brilliant tutor who already read the answer key — but you have to decide when it's worth raising your hand.

Architecture

┌──────────────────┐     ┌──────────────────┐
│  Code Verifier   │────▶│  GPT-5.1 Expert  │
│ (success criteria)│     │ (verifier-informed)│
└──────────────────┘     └────────▲─────────┘
                                  │ ask_expert
                                  │ (on-demand)
┌─────────────────────────────────┴──────────────┐
│              Qwen3-4B Agent                     │
│  1. list_tools  →  discover 35 API endpoints   │
│  2. call_tool   →  attempt task independently   │
│  3. ask_expert  →  consult if error or stuck    │
│  4. call_tool   →  follow expert's plan         │
│  5. verify      →  check DB state for reward    │
└─────────────────────┬──────────────────────────┘
                      │ call_tool / list_tools
                      ▼
┌─────────────────────────────────────────────────┐
│           AWM Environment (Docker)              │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐  │
│  │ SQLite DB│  │ REST API │  │ MCP Tools    │  │
│  │ (state)  │  │ (30-40   │  │ (auto-       │  │
│  │          │  │ endpoints)│  │ generated)   │  │
│  └──────────┘  └──────────┘  └──────────────┘  │
└─────────────────────────────────────────────────┘

The workflow:

Discover — Agent calls list_tools to get the API catalog
Try first — Agent attempts the task independently using call_tool
Ask if stuck — If an error occurs, agent calls ask_expert with context
Execute plan — Agent follows the expert's recovery plan
Verify — Code verifier checks the SQLite database for correctness

Training Setup

Parameter	Value
Model	Qwen3-4B (SFT checkpoint)
Algorithm	GRPO (Generalized RL Policy Optimization)
GPUs	8x NVIDIA B200 (183GB each)
Batch size	16 prompts × 8 rollouts = 128 rollouts/step
Training tasks	53 workflow automation scenarios
Validation tasks	29 held-out scenarios
Reward	Code verifier + LLM judge (GPT-5.1)

Reward Hierarchy

Not all completions are equal. The reward system creates a clear hierarchy that teaches the agent how to use the expert, not just whether to use it:

Pattern	Reward	What Happened
Solo completion	2.0	Agent solved it without calling the expert
Recovery expert	1.3	Agent tried first, hit an error, asked for help, then succeeded
Blind expert	0.8	Agent called expert immediately without trying first
Incomplete	0.1	Agent did something but didn't finish
Format error	-1.0	Agent produced malformed XML — no tool call executed

The key insight: solo completions get the highest reward, but recovery completions (try first, then ask) get more than blind expert calls. This teaches the agent to attempt tasks independently and only escalate when genuinely stuck.

Experiments

Experiment 1: Baseline (No Expert)

The control group — agent has only list_tools and call_tool.

Milestone	Step	Value
Start	1	Reward: -0.80, FE: 92%
First improvement	9	Reward: -0.40
Crosses zero	~28	Reward: +0.02
Final (step 29)	29	Reward: +0.05, 22 completions

The baseline works, but slowly. It takes ~28 steps just to reach positive reward.

Experiment 2: Expert-Assisted (Always Available)

Every rollout has access to ask_expert.

Step Window	Expert Reward	Baseline Reward	Expert Advantage
Steps 1-5	-0.37	-0.73	+0.36
Steps 6-10	-0.01	-0.52	+0.51
Steps 11-17	+0.19	-0.22	+0.41

Massive early boost, but the agent learns "always ask first" rather than "ask selectively."

Experiment 3: Mixed Mode (Fixed 50/50 Split)

Half the rollouts get expert access, half don't. Plus a +0.5 bonus for completing solo.

The 50/50 split dominated early (3.4x more completions in steps 1-5), but the baseline overtook it by step 16.

Why? Solo chains maintain 85-87% format errors throughout — they're dead weight. Half the batch produces almost no useful gradient signal.

A fixed 50/50 split wastes compute. You don't yank training wheels off a kid who can't balance yet.

Experiment 4: Adaptive Ratio — Training Wheels (The Breakthrough)

Instead of a fixed ratio, adapt the expert/solo split based on how well the agent is doing:

Format Error Rate	Phase	Expert	Solo	Rationale
> 70%	Scaffold	6	2	Can't even format tools — heavy expert support
40-70%	Balanced	4	4	Has basics — equal exposure
< 40%	Independence	2	6	Proficient — push toward self-reliance

Combined with "Try First" prompting and graduated reward shaping, this is the winning strategy.

Adaptive Training Results (Step by Step)

Step	Reward	Format Error %	Completions	Rate	Phase
1	-0.528	88.3%	7	5.5%	Scaffold (6E/2S)
5	-0.427	87.5%	13	10.2%	Scaffold
10	-0.021	65.6%	25	19.5%	→ Balanced
11	+0.316	65.6%	28	21.9%	Balanced (4E/4S)
15	+0.589	56.3%	41	32.0%	Balanced
20	+0.603	50.0%	37	28.9%	Balanced
23	+0.824	43.0%	51	39.8%	Balanced
28	+0.426	35.2%	53	41.4%	→ Independence
29	+0.407	31.3%	57	44.5%	Independence (2E/6S)

The Training Wheels Analogy

Scaffold (Steps 1-10): Training wheels firmly on. FE: 88% → 66%. Reward: -0.53 → -0.02.
Balanced (Steps 11-27): One training wheel comes off. Peak reward: +0.824. Completions climb to 51.
Independence (Steps 28+): Both wheels off, just a hand hovering. 57 completions (44.5%), FE down to 31.3%.

You don't yank training wheels off a kid who can't balance yet. That's what the fixed 50/50 split did — and it's why it failed.

Head-to-Head: Final Scoreboard

Metric	Adaptive (Try-First)	Adaptive v1	Mixed 50/50	Baseline
Peak reward	+0.824	+0.068	Never positive	Never positive
Peak completions/step	57 (44.5%)	25	~15	~5
Lowest format errors	31.3%	~70%	~75%	~64%
Phase transitions	All 3	None	N/A	N/A
Sustained positive reward	19 consecutive steps	No	No	No

Expert Calling Behavior Over Time

How does the agent actually use the expert as training progresses?

Pattern	After 8 Steps	After 16 Steps	After 29 Steps	Trend
Blind expert (called first)	255	360	1,115	Decreasing share
Recovery expert (after error)	28	55	272	Steadily growing
Solo completions (no expert)	9	34	635	Exploding growth

The shift happened. Solo completions rocketed from 0 to 635. The blind-to-solo ratio went from infinity to roughly 2:1, and keeps improving.

Key Findings

1. "Try First, Ask If Stuck" Is The Winning Strategy

Changing a single instruction — from "call the expert for complex tasks" to "try first, then ask if stuck" — combined with reward shaping, produced agents that are 10x better than baseline (44.5% vs ~4% completion rate).

2. Graduated Reward Shaping Beats Binary Bonuses

A blunt +0.5 solo bonus rarely triggered early on. The graduated system (solo=2.0, recovery=1.3, blind=0.8) gives the model a gradient to climb at every skill level. Over 29 steps: 635 solo completions and 272 recovery expert calls.

3. Format Errors Are The Gateway Skill

88% of rollouts die from malformed XML in step 1. By step 29, this drops to 31.3%. An agent that can't format a tool call can't learn anything else.

4. All Three Adaptive Phase Transitions Fired

The model graduated through every phase: Scaffold (step 10) → Balanced (step 28) → Independence. It thrived at each transition rather than regressing.

5. Scaffolding Then Independence > Either Alone

Expert scaffolding teaches what good tool calls look like (steps 1-10). Balanced exposure teaches when to ask (steps 11-27). Independence proves the agent can ride solo (steps 28+).

Interactive Reward Calculator

Explore how different rollout compositions affect the average reward and which adaptive phase the agent would be in.

Solo completions (reward: 2.0)

0 50

Recovery expert (reward: 1.3)

0 50

Blind expert (reward: 0.8)

0 50

Incomplete (reward: 0.1)

0 50

Format errors (reward: -1.0)

0 100

Average Reward

Adaptive Ratio Simulator

Simulate how the adaptive phase transitions work over training steps.

Initial format error rate (%)

50 99

FE decay per step (%)

0.5 5

Training steps

5 50

Bug Fixes

Bug	Impact	Fix
SQLite escaped quotes	`\'` not recognized — seed data silently failed	Convert `\'` to `''` in `db_manager.py`
FastAPI 500 errors	Opaque errors, no traceback	Injected Starlette exception handler
Hardcoded Azure config	Couldn't switch models	Model/endpoint/version from env vars

Research Directions

1. Curriculum Learning for Expert Withdrawal

Track per-task difficulty. Easy tasks lose expert access first; hard tasks keep it longer.

2. Confidence-Calibrated Expert Calls

Train the agent to output a confidence score before each action. Below a threshold, it calls the expert. The holy grail: an agent that knows when it doesn't know.

3. Expert Distillation

Distill the expert's knowledge into a small auxiliary model. Remove the latency and cost of live GPT-5.1 calls while preserving the scaffolding benefit.

4. Multi-Expert Ensembles

What if the agent could choose between a "tool formatting" expert, a "task planning" expert, and an "error recovery" expert?

5. Self-Play Expert Bootstrapping

Use the agent's own successful trajectories as "expert" demonstrations for the next round.

6. Meta-Learning Across Tasks

Train across diverse environments so the agent learns a general "when to ask for help" policy that transfers to any environment.

The broader vision: The most capable AI systems won't be the ones that know everything — they'll be the ones that know the boundaries of their own knowledge and can effectively collaborate with other systems (or humans) to fill the gaps.

Quick Start

Train locally

pip install "trl[vllm]" "openenv-core[core]>=0.2.1" openai datasets

# Start the AWM server
uvicorn agent_world_model_env.server.app:app --host 127.0.0.1 --port 8899

# Train with GRPO
python train_grpo_awm.py --model Qwen/Qwen3-0.6B --max-steps 50

# Train with expert tool
export AZURE_OPENAI_API_KEY="your-key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
python train_grpo_awm.py --model Qwen/Qwen3-0.6B --use-expert --max-steps 50

Run inference benchmark

python run_awm_task_dynamic_expert.py workflow_automation_1
python run_awm_task_dynamic_expert.py workflow_automation_1 --expert-only

Dynamic Expert-in-the-Loop GRPO Training on Agent World Model

What Is This?

The Environment: Agent World Model (AWM)

What's inside each environment?

Example domains (from 1,005 environments)

What Does the Agent Actually Do?

The Expert Tool

Architecture

Training Setup

Reward Hierarchy

Experiments

Experiment 1: Baseline (No Expert)

Experiment 2: Expert-Assisted (Always Available)

Experiment 3: Mixed Mode (Fixed 50/50 Split)

Experiment 4: Adaptive Ratio — Training Wheels (The Breakthrough)

Adaptive Training Results (Step by Step)

The Training Wheels Analogy

Head-to-Head: Final Scoreboard

Expert Calling Behavior Over Time

Key Findings

1. "Try First, Ask If Stuck" Is The Winning Strategy

2. Graduated Reward Shaping Beats Binary Bonuses

3. Format Errors Are The Gateway Skill

4. All Three Adaptive Phase Transitions Fired

5. Scaffolding Then Independence > Either Alone

Interactive Reward Calculator

Adaptive Ratio Simulator

Bug Fixes

Research Directions

1. Curriculum Learning for Expert Withdrawal

2. Confidence-Calibrated Expert Calls

3. Expert Distillation

4. Multi-Expert Ensembles

5. Self-Play Expert Bootstrapping

6. Meta-Learning Across Tasks

Quick Start

Train locally

Run inference benchmark

Links