Multimodal Web Navigation with Instruction-Finetuned Foundation Models

GPT-4: WebGUM, a multimodal agent, leverages vision-language foundation models to improve autonomous web navigation. By jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations, WebGUM enhances grounded visual perception, HTML comprehension, and multi-step reasoning. The agent outperforms previous offline methods by 31.9% on the MiniWoB benchmark and surpasses existing state-of-the-art models on the WebShop benchmark. The researchers also provide 347K high-quality demonstrations to promote further advancements in the field.
Read more…

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Related

Mistral AI Releases Codestral Embed: A Specialized Code Embedding Model

OpenEvolve: Pioneering the Future of Evolutionary Code Optimization

LLMs Spot Subtle Linux Kernel Bugs Through Code Alone

Claude Opus 4 Brings AI One Step Closer to Autonomous Workdays

Devstral-Small-2505 Sets New Standard for Open-Source Coding Agents

Microsoft and GitHub Back MCP to Bridge AI with Real-World Systems

Meet MyManus: Your Local AI Agent That Plans, Executes, and Stays Offline

Microsoft Open-Sources Windows Subsystem for Linux, Invites Community Collaboration

AI Uncovers Hidden Role of Key Enzyme in Alzheimer’s and a Promising Treatment Path