Qwen-AgentWorld Deployment Guide: Run the Open 35B Language World Model Locally
A practical English guide to Qwen-AgentWorld, Alibaba Qwen's language world model for AI agents. Learn its seven-domain design, training pip...
Qwen-AgentWorld is a language world model released by the Qwen team for simulating agent environments. Instead of only answering questions like a general chat model, it is designed to predict what an environment would return after an agent takes an action.
This makes it especially relevant for AI agent research, simulated reinforcement learning, benchmark evaluation, and local experiments around terminal, software engineering, search, MCP, web, operating system, and Android-style environments.
This article is a lightly rewritten and translated version of the original Chinese article. The structure, technical flow, commands, tables, and key ideas are preserved, while the language has been adjusted for smoother English reading and SEO publishing.
Source note: The original article was published on CSDN and states that it follows the CC BY-SA 4.0 license. Original source: Qwen-AgentWorld完整部署指南:免费开源,性能超GPT-5.4,5分钟跑起来.
Verification note: Official Qwen pages confirm the public release of Qwen-AgentWorld-35B-A3B model weights and AgentWorldBench. The larger Qwen-AgentWorld-397B-A17B is included in official benchmark results, but the public model page and GitHub release primarily point to the 35B-A3B model weights.
1. Background: Why Do We Need a Language World Model?
Over the past two years, AI agents have moved quickly from simple chat assistants into tools that can operate websites, run terminal commands, control mobile apps, and complete software engineering tasks.
But training a strong agent is expensive. It often requires large volumes of real environment interaction, and that creates several practical problems:
Building and maintaining environments is tedious.
Data collection is slow and hard to scale.
Real environments carry risk, especially when testing failure cases or injecting controlled disruptions.
A Language World Model, or LWM, is built to solve this problem. The idea is simple but powerful: let a model play the role of the environment. Given an agent action and the interaction history, the model predicts the next environment state.
With that setup, agents can be trained and evaluated in simulation instead of always relying on real systems.
On 2026-06-24, the Qwen team released Qwen-AgentWorld, a native language world model that unifies seven agent interaction domains in one model. The companion benchmark, AgentWorldBench, was also released.
2. Core Idea: What Makes It a “Native” World Model?
The word native is important here. Qwen-AgentWorld is not just a general-purpose LLM adapted after training to imitate an environment. Its world-modeling goal is built into the training process from the beginning.
Comparison Dimension
Traditional Approach
Qwen-AgentWorld
Training starting point
Fine-tune a general LLM
Treat environment modeling as the goal from CPT onward
Training process
Usually SFT or RL only
CPT → SFT → RL
Environment knowledge
Added through extra data or adaptation
Internalized during training
Domain coverage
One or a few domains
Seven domains in one model
In other words, Qwen-AgentWorld is not just a general model wrapped with prompts. It is trained from the lower layers of the pipeline to predict the next state of an environment.
That gives the model a more structured understanding of environment dynamics, especially when simulating long interaction trajectories.
3. Seven Domains: Text and GUI Environments in One Model
Qwen-AgentWorld splits agent interaction scenarios into two large groups: text-based environments and GUI-based environments.
Tool calling and Model Context Protocol interactions
Search
Text
Search engine interaction and retrieval behavior
Terminal
Text
Linux terminal command execution
SWE
Text
Software engineering tasks, such as code fixes
Web
GUI
Browser and webpage interaction
OS
GUI
Desktop operating system interaction
Android
GUI
Mobile app and Android-style UI interaction
For the three GUI domains, observations are represented as renderable code rather than raw pixel frames. This lets a text-based world model cover visual environments without directly processing full image sequences.
The model was trained on more than 10 million real-world interaction trajectories across the seven domains.
4. Three-Stage Training Pipeline
Qwen-AgentWorld uses a connected three-stage training pipeline: CPT → SFT → RL.
Stage 1: CPT — Injecting Environment Knowledge
During continual pre-training, the model learns from large-scale real environment interaction trajectories. This stage embeds environment dynamics into the model weights.
The original article also mentions a turn-level information-theoretic loss mask. The goal is to identify which dialogue turns actually carry environment-state information and reduce noise from less useful turns.
Supervised fine-tuning turns next-state prediction into a chain-of-thought style reasoning pattern.
Instead of directly outputting a predicted result, the model learns to reason through why a state should change before generating the next observation.
Stage 3: RL — Refining Simulation Fidelity
The reinforcement learning stage uses hybrid reward signals, including the GSPO algorithm, to improve output quality.
The optimization focuses on:
Format correctness
Factual accuracy
Context consistency
Realism
Overall simulation quality
Emergent behaviors mentioned in the original article:
Qwen-AgentWorld reportedly shows self-correction behavior, information-leakage prevention in search scenarios, and multi-step causal reasoning for some command-output predictions.
5. Open-Source Model List
Release
Parameters
Activated Parameters
Context Length
Positioning
Qwen-AgentWorld-35B-A3B
35B
3B
256K tokens
Public, efficient open model
Qwen-AgentWorld-397B-A17B
397B
17B
Not clearly listed in the original table
Flagship benchmark model
AgentWorldBench
—
—
—
Evaluation benchmark
35B-A3B Architecture Details
Base model: Qwen3.5-35B-A3B-Base
Model type: Causal Language Model / Language World Model
Architecture style: Hybrid linear attention + MoE
Hidden dimension: 2048
Layers: 40 layers
Layer layout: repeated groups with Gated DeltaNet, Gated Attention, and MoE components
AgentWorldBench scores each model across five dimensions: Format, Factuality, Consistency, Realism, and Quality. Scores are normalized to a 0–100 scale, where higher is better.
Full Ranking by Overall Score
Model
MCP
Search
Terminal
SWE
Android
Web
OS
Overall
Qwen-AgentWorld-397B-A17B
68.24
37.82
57.73
68.49
60.20
50.98
67.89
58.71
GPT-5.4
70.10
37.26
53.69
66.29
60.00
51.80
68.58
58.25
Claude Opus 4.6
69.90
29.30
57.51
64.55
61.74
51.42
70.20
57.80
Claude Opus 4.8
54.93
35.14
59.18
64.10
61.50
54.66
66.62
56.59
Qwen-AgentWorld-35B-A3B
64.79
36.69
53.96
65.63
58.17
49.55
65.92
56.39
Claude Sonnet 4.6
70.00
28.79
56.98
64.52
58.03
50.78
63.17
56.04
Qwen3.5-397B-A17B
68.31
30.81
55.30
64.44
54.90
48.55
60.85
54.74
Gemini 3.1 Pro
59.07
30.21
52.47
59.07
61.40
52.83
66.92
54.57
DeepSeek-V4-Pro
63.27
27.61
51.26
59.44
55.17
50.32
63.70
52.97
Qwen3.5-35B-A3B
57.87
25.98
46.13
47.58
53.18
47.10
56.27
47.73
Key takeaways from the original article:
Qwen-AgentWorld-397B-A17B reaches an overall score of 58.71 and ranks first in the listed AgentWorldBench table.
Qwen-AgentWorld-35B-A3B improves by +8.66 points over the base Qwen3.5-35B-A3B model.
Practical note: Treat benchmark numbers as reference data from the official benchmark setup. Real results will depend on hardware, prompt design, serving framework, context length, and the environment being simulated.
7. Four Application Patterns and Experimental Results
The original article describes using Qwen-AgentWorld-397B-A17B for simulated RL across 4,000 out-of-distribution OpenClaw environments, then testing zero-shot generalization in new domains.
Controlled perturbations can expose weak points in an agent more effectively than standard real-environment training.
Configuration
Tool Decathlon
MCPMark
Base SFT
32.4
21.5
Sim RL without control
31.5
24.6
Sim RL with control
36.1
33.8
Improvement
+3.7
+12.3
Pattern 3: Fictional World Construction — Search Domain
The Search-domain experiment uses a fictional but self-consistent search world for training, then evaluates generalization on real search tasks.
Configuration
WideSearch F1 Item
WideSearch F1 Row
Base SFT, 35B
34.02
13.72
+ Sim RL fictional world
50.31
24.21
Improvement
+16.29
+10.49
Pattern 4: Agent Foundation Model — LWM RL Warm-Up Transfer
The article also describes LWM RL warm-up as a way to improve downstream agent performance without extra RL fine-tuning on those specific tasks.
Metric
Terminal-Bench 2.0
SWE-Bench Verified
SWE-Bench Pro
WideSearch F1
Claw-Eval
BFCL v4
Base SFT
33.25
64.47
42.18
33.38
53.60
62.29
+ LWM RL warm-up
39.55
67.86
47.42
46.17
64.88
71.25
Improvement
+6.30
+3.39
+5.24
+12.79
+11.28
+8.96
Highlight: The warm-up data comes from single-turn, non-agentic trajectories, yet the improvement transfers to more complex multi-turn tool-calling agent tasks. That suggests world-modeling knowledge can transfer beyond its original training format.
8. Quick Deployment Guide
Method 1: Deploy with SGLang
SGLang is recommended in the original article for fast serving.
Official-docs note: The current Hugging Face model card also recommends using --language-model-only with vLLM because the model architecture includes visual component definitions while the checkpoint contains language model weights. If vLLM initialization fails, try adding that flag.
Each test sample includes ground-truth observation data from real environment execution. The benchmark evaluates world-modeling ability across format, factuality, consistency, realism, and quality.
10. Fine-Tuning Suggestions
If you want to customize Qwen-AgentWorld for a specific domain, the original article recommends three common fine-tuning frameworks.
The original article includes several images related to Qwen-AgentWorld domains and benchmark results. These were kept in the relevant sections.
CSDN platform icons, promotion modules, author subscription blocks, QR codes, reward buttons, and unrelated recommendation images were removed according to the publishing requirements.
FAQ
What is Qwen-AgentWorld?
Qwen-AgentWorld is a language world model from the Qwen team. It predicts the next environment state after an agent takes an action, making it useful for agent simulation, training, and evaluation.
Is Qwen-AgentWorld the same as a normal chat model?
No. A normal chat model is mainly optimized for conversation and instruction following. Qwen-AgentWorld is trained as an environment simulator, so its main use case is predicting observations in agent interaction environments.
Which Qwen-AgentWorld model is publicly available?
Official pages list Qwen-AgentWorld-35B-A3B as the publicly released model weight. AgentWorldBench is also available as an evaluation benchmark. The larger 397B model appears in benchmark tables, but the public model release mainly points to the 35B-A3B version.
Can Qwen-AgentWorld be deployed with vLLM?
Yes. The Hugging Face model card includes a vLLM serving example. If you run into initialization issues, the official model card recommends adding --language-model-only because the checkpoint contains language model weights.
Can Qwen-AgentWorld be deployed with SGLang?
Yes. SGLang is one of the recommended serving options and can expose an OpenAI-compatible API endpoint. The model can then be called through local API requests.
Why does Qwen-AgentWorld need a long context window?
Agent environment simulation often depends on long interaction histories. A shorter context window may lose important state information, so the official guidance recommends keeping at least 128K tokens when possible.
What is AgentWorldBench used for?
AgentWorldBench is the benchmark released with Qwen-AgentWorld. It evaluates language world models across seven domains using dimensions such as format, factuality, consistency, realism, and quality.
Is Qwen-AgentWorld suitable for production use?
It can be useful for research, evaluation, simulation, and internal experiments. For production systems, you still need to evaluate latency, hardware cost, safety, prompt reliability, and whether simulated results match your real environment closely enough.
Related Tools
Qwen-AgentWorld GitHub: Official repository for Qwen-AgentWorld code, prompts, and evaluation workflow.