Alibaba dropped Qwen3.5 today, timed almost to the hour before China’s Lunar New Year holiday. The press release framing (“agentic AI era”, “new benchmark for capability per unit of inference cost”) is the usual noise. What’s underneath is actually interesting — and in a few ways, genuinely novel. Let’s get into it.
What It Actually Is
Qwen3.5-397B-A17B is a sparse Mixture-of-Experts model with 397B total parameters, 17B active per token. That active/total ratio (~4.3%) is unusually lean — for comparison, Mixtral 8x7B activates about 12B of 47B. The practical implication: inference memory and compute are closer to a 17B dense model than a 397B one, while the total parameter count gives it a much larger effective knowledge capacity.
It’s also a native multimodal model, meaning vision capability was baked in from pretraining, not bolted on via an adapter. This matters more than it sounds: early-fusion training means the model actually reasons across modalities rather than translating between them. The benchmark numbers on visual math (MathVision: 88.6, best in its comparison set) suggest this approach pays off.
The Architecture Is Worth Paying Attention To
The interesting piece is Gated Delta Networks (GDN) replacing standard attention in most layers. The 60-layer stack is structured as repeating blocks:
3× (GatedDeltaNet → MoE) → 1× (GatedAttention → MoE)
Only 1 in 4 sublayers uses full quadratic attention. The rest use linear attention via GDN — a state-based recurrence architecture delivering near-linear scaling with sequence length. This is not a gimmick. It’s why they can credibly claim 8.6×–19× higher decoding throughput vs Qwen3-Max while running competitive benchmarks.
The MoE layer uses 512 experts, activating 10 routed + 1 shared per token — a much larger expert pool than typical MoE designs, keeping individual expert size small (intermediate dim: 1024) for cache efficiency. Built-in Multi-Token Prediction (MTP) enables speculative decoding out of the box, stacking further throughput gains on top of the GDN speedup.
Context Length: Native 256K, Extensible to 1M
The model natively handles 262,144 tokens. With YaRN RoPE scaling, that extends to 1,010,000 tokens. The hosted Qwen3.5-Plus on Alibaba Cloud uses 1M by default.
Two practical caveats worth knowing upfront:
The Jinja2 chat template strips thinking tokens from conversation history in multi-turn sessions. If you’re rolling your own inference stack without using the template directly, implement this stripping yourself — otherwise you’ll bloat context with chain-of-thought tokens that should be ephemeral.
They also recommend keeping context above 128K even if you don’t need it, because the model’s thinking capability degrades at shorter windows. This is a consequence of training on long-horizon tasks and worth knowing if you’re running memory-constrained deployments.
Benchmarks: Where It’s Strong, Where It Isn’t
Genuinely impressive:
- Instruction following (IFBench: 76.5, MultiChallenge: 67.6) — best in the comparison set, beating GPT-5.2 and Claude Opus 4.5. This is one of the most practically important capabilities for agentic use cases.
- BrowseComp (78.6) — web search agent tasks, best in class.
- Multilingual (NOVA-63: 59.1) — 201 languages with a 250K vocabulary outperforms everything in the multilingual evals.
- Visual math (MathVision: 88.6) — early-fusion clearly pays off.
Where it trails:
- HLE (Humanity’s Last Exam: 28.7) — hardest open-ended STEM. GPT-5.2 scores 35.5, Gemini-3 Pro 37.5. A real gap.
- Long context (LongBench v2: 63.2) — competitive but not a leader; Claude Opus 4.5 hits 64.4, Gemini-3 Pro 68.2.
- Reasoning ceiling (AIME26: 91.3) — trails GPT-5.2 (96.7).
The honest picture: Qwen3.5 is a particularly strong instruction-follower and multilingual agent that is competitive with but not uniformly better than frontier closed models. Its efficiency advantage is where it really separates itself.
Thinking Mode: On by Default, Toggleable
The model outputs <think>...</think> chains before responding by default. Unlike Qwen3, Qwen3.5 dropped the soft /think//nothink toggle — you control it via API parameters:
Thinking mode (default):
temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0
Non-thinking mode:
temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
The presence_penalty=1.5 in non-thinking mode specifically counteracts the model’s tendency toward repetition without chain-of-thought. Know this before you tune sampling parameters yourself.
Running It: Hardware Reality Check
The full 397B model in BF16 is ~794GB. You’re looking at 8× H100 80GB minimum, realistically 16 for comfortable headroom. This is not a laptop model.
Practical options:
- Qwen3.5-Plus on Alibaba Cloud — hosted, 1M context, production API.
- Community quantizations — GGUF/AWQ versions are already appearing on HuggingFace. Q4_K_M lands ~200GB, within range of 2–3× H100s or a single MI300X.
- Wait for smaller variants — the 397B is “the first in the Qwen3.5 series.” Smaller models are presumably coming.
Recommended serving stack: SGLang from main or vLLM nightly — both have Qwen3.5-specific support merged.
# SGLang — best throughput with MTP speculative decoding
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B --tp-size 8 \
--mem-fraction-static 0.8 --context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN --speculative-num-steps 3 \
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4
# vLLM — text-only flag frees KV cache if you don't need vision
vllm serve Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 --max-model-len 262144 \
--reasoning-parser qwen3 --language-model-only
For 1M context via YaRN on vLLM:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.5-397B-A17B \
--hf-overrides '{"text_config": {"rope_parameters": {
"rope_type": "yarn", "factor": 4.0,
"original_max_position_embeddings": 262144,
"mrope_interleaved": true, "mrope_section": [11,11,10],
"rope_theta": 10000000, "partial_rotary_factor": 0.25}}}' \
--max-model-len 1010000
Note: YaRN uses a static scaling factor, which can hurt performance on shorter inputs. Only apply it when you actually need million-token contexts.
What to Test First
Based on where the model actually differentiates:
- Constraint-heavy instruction prompts — IFBench and MultiChallenge wins are most relevant for agentic reliability. Test deeply nested conditionals before trusting it in an agentic loop.
- Non-English workloads — run your domain data through it before defaulting to an English-only fine-tune. The tokenizer efficiency gain alone (10–60% fewer tokens for non-Latin scripts) compounds.
- Mixed-content documents — tables, formulas, charts together. OmniDocBench 1.5 at 90.8 is best in class.
- Long-context agentic loops — 256K native + MTP + instruction-following strength makes this a plausible backbone for agents that accumulate tool call history.
Bottom Line
Qwen3.5-397B-A17B is a technically coherent release with real architectural novelty. The Gated Delta Network / sparse MoE hybrid is not marketing — it’s why the throughput numbers are credible. The efficiency story (60% cheaper, ~10–19× faster decoding vs Qwen3-Max) is the headline that actually matters for production deployment decisions.
It is not uniformly the best model. It trades raw reasoning ceiling for better inference economics and multilingual breadth. For most production agentic workloads — long contexts, tool use, multilingual — that’s probably the right trade. For pure research-grade reasoning, you’ll still want a model optimized specifically for that.
Apache 2.0 on a competitive frontier-class model is significant. The quantized variants landing over the next few weeks will be the real test of how broadly this gets adopted outside large infrastructure shops.
Weights: huggingface.co/Qwen/Qwen3.5-397B-A17B · Hosted: Alibaba Cloud Model Studio · Agent lib: Qwen-Agent · CLI: qwen-code