Teaching Agents to Ask for Help

Dynamic Expert-in-the-Loop GRPO Training on Agent World Model

"The mark of wisdom is not knowing everything — it's knowing when to ask."

Before After 29 Steps
Completion Rate 5.5% 44.5%
Reward -0.528 +0.824 peak
Format Errors 88.3% 31.3%
Phase Scaffold (6 expert / 2 solo) Independence (2 expert / 6 solo)

What Is This?

Imagine teaching a new employee. You wouldn't just hand them a manual and walk away. You also wouldn't stand behind them dictating every keystroke.

The best approach? Let them try, and tell them an expert is available if they get stuck.

We give a small language model (Qwen3-4B) a set of ~35 API tools, a task description, and access to a brilliant advisor (GPT-5.1). Then we use reinforcement learning (GRPO) to teach it when calling the expert leads to better outcomes — and ultimately, when it can fly solo.

The Environment: Agent World Model (AWM)

The Agent World Model is a benchmark of 1,005 simulated web environments — each one a unique "website" backed by a SQLite database and a REST API exposed as MCP tools.

What's inside each environment?

Component Details
Database SQLite with pre-seeded data (users, records, relationships)
API 30-40 REST endpoints auto-generated as MCP tools
Tasks 10 per environment — natural language instructions the agent must complete
Verifier Python code that checks the final DB state for correctness

Example domains (from 1,005 environments)

Domain Example Environments Example Task
Workflow Automation FlowLatch, FlowMesh, DocRelay "Create a workflow named 'Lead to Support Sync' in draft status..."
E-Commerce Amazon, eBay, Shopify Admin "Search for 'wireless headphones' and add the top-rated item..."
Dev Tools GitHub, Jira, ChatGPT "Create a new branch 'feature/dark-mode' and open a pull request..."
IoT / Smart Home NestGrid, RoomAura, VetLoop "Register a smart thermostat for room 805 with firmware v3.1.0..."
Social Media YouTube, Reddit, LinkedIn "Subscribe to 'Kurzgesagt' and add their latest video to playlist..."

Training uses 53 tasks across 8 workflow automation environments, with 29 held-out tasks for validation.


Built with OpenEnvSourceEXPERT_ENHANCEMENT.md