When the World Becomes a Prompt: How Text in the Environment Can Hijack Embodied AI

When the World Becomes a Prompt: How Text in the Environment Can Hijack Embodied AI
Embodied AI systems are often praised for their ability to handle the messy edges of the real world. When sensors fail, maps are incomplete, or something unexpected happens in front of a robot or vehicle, these systems lean on multimodal reasoning—vision plus language plus action—to make sensible decisions. That flexibility is exactly what makes them useful.

It’s also what makes them vulnerable.

A recent paper, https://arxiv.org/html/2510.00181v1, explores a new attack class called CHAI (Command Hijacking against embodied AI) that turns a model’s ability to interpret language in images into a liability. Instead of manipulating pixels in subtle, adversarial ways, CHAI uses something far simpler: readable, natural-language instructions embedded directly into the scene.

Think signs, labels, or text that a human might ignore—but a multimodal model might treat as a command.

When perception becomes a control channel

Large Visual-Language Models do more than detect objects. They read. They interpret. They connect what they see with instructions and goals.

That’s powerful. A drone can read a sign that says “Emergency landing zone” and make a safe choice. A robotic vehicle can understand “Detour” or “Stop.”

CHAI flips that strength into an attack surface.

If a model treats text in the environment as part of its reasoning process, then an attacker can plant deceptive instructions directly in view. No need for low-level perturbations or invisible noise. Just well-placed language that the model takes seriously.

In effect, the environment itself becomes a prompt.

How CHAI works

The attack strategy is systematic rather than random:

  • deceptive natural language is embedded into visual input
  • the token space is searched to discover effective phrases
  • a dictionary of prompts is built
  • an attacker model generates what the authors call Visual Attack Prompts

Instead of guessing what wording might work, CHAI learns which phrases most reliably hijack behavior. That makes it more repeatable and scalable than ad-hoc tricks.

The idea is closer to prompt engineering than classic adversarial examples.

Tested on real robotic tasks

The experiments aren’t confined to toy benchmarks. The attack is evaluated across multiple embodied scenarios:

  • drone emergency landing
  • autonomous driving
  • aerial object tracking
  • a real robotic vehicle

Across four Large Visual-Language Model agents, CHAI consistently outperforms previous attack methods. That consistency matters: it suggests the weakness isn’t a one-off bug, but something structural to how these systems interpret multimodal inputs.

If a model is designed to read and reason over language in the world, it’s hard to simply “turn off” that capability without losing usefulness.

A different kind of adversarial threat

Traditional adversarial robustness focuses on imperceptible pixel changes—tiny perturbations that fool classifiers while remaining invisible to humans. Defenses are built around smoothing, filtering, or training against noise.

CHAI doesn’t play that game.

The attacks are semantically meaningful. They look like ordinary signs or messages. To a human observer, nothing seems suspicious. To the model, it might look like a high-priority instruction.

That makes standard defenses less relevant. You can’t just denoise text that’s meant to be read.

Why this matters

As robots and vehicles become more autonomous, they’ll rely more heavily on models that interpret rich, multimodal input. Reading signs, labels, and instructions is essential for operating in human environments.

But every capability that broadens understanding also broadens exposure.

If a system treats any visible language as potentially authoritative, attackers gain a low-cost, high-leverage way to interfere. A printed sign or sticker could become a control interface.

The paper highlights a growing tension: the smarter and more context-aware embodied AI becomes, the more creative the attack surface. Defending these systems likely requires new strategies—ones that reason about trust, intent, and the reliability of environmental language, not just pixel-level robustness.

In other words, the world itself has become part of the prompt—and that changes how we need to think about security.