New Web Agent to Navigate Real Websites

A team of researchers from Google DeepMind and University of Tokyo has developed a new web agent system called WebAgent that can follow natural language instructions to complete tasks on real-world websites. The system combines two large language models (LLMs) – one specialized for website navigation and one for general programming – to overcome challenges like long website HTML and open-ended actions.

WebAgent uses a model called HTML-T5 to plan the sub-steps to accomplish the overall instruction and summarize long HTML code into relevant snippets. It then feeds these snippets into Flan-U-PaLM, a 540B parameter LLM trained on code, which generates Python programs to execute the sub-steps on the actual website.

Key results:

Achieved 70% success rate on tasks on real estate and social media sites, 50% higher than single LLM approaches
HTML-T5 model outperformed prior best method by 15% on MiniWoB benchmark of 56 web tasks
Performed better than single generalist or specialist LLM models on static HTML comprehension

The modular approach allows each model to focus on its strengths – HTML-T5 handles instruction following and HTML structure while Flan-U-PaLM generates programs. The HTML-T5 model uses specialized local-global attention and training on HTML data to better capture document structure.

Key actions that WebAgent can perform:

Fill out forms on websites by locating form elements like text boxes, drop downs, checkboxes etc. and populating them.
Click on buttons, links, tabs, menu items to navigate between pages and sections of a website.
Scroll up or down on a page to bring specific elements into view.
Interact with search bars to lookup information by entering text and submitting queries.
Scrape and extract information from webpages by locating relevant DOM elements.
Execute JavaScript code snippets to control page behavior.
Set values of input elements like date pickers, sliders, radio buttons etc.
Upload files by locating upload fields and submitting file paths programmatically.
Download files from links and export web data.
Automate multi-page workflows by chaining together sequences of actions.
Extract summaries of page content by locating relevant DOM elements.

Broader Impact:

This work could enable more capable web agents that can assist people in completing complex online tasks. The modular design is more scalable as additional expert models can be plugged in. Code generation also provides an open action space beyond predefined actions.

However, security and misuse remain a concern if such agents are deployed autonomously without human supervision. More research is still needed to ensure robust and safe web navigation across the diversity of real-world websites.

The specialized HTML-T5 model exemplifies how inductive biases can make LLMs better suited for particular domains, an area likely to grow. This could reduce the need for massive general models.

Overall the work demonstrates how combining modular LLMs that have complementary skills and training can achieve better performance on complex real-world tasks. As LLMs advance, finding the right decompositions and specializations will be key to realizing their full potential.

New Web Agent to Navigate Real Websites

Key results:

Key actions that WebAgent can perform:

Broader Impact:

Related

Leave a ReplyCancel reply

Putting Math Behind the Madness: A Theoretical Framework for LLM Hallucinations

The Hidden Homework Problem: How ArxivRoll Exposed AI’s Inflated Test Scores

Teaching AI Models to Debug Themselves: The Reflect, Retry, Reward Method

Claude Code Gets Smarter with Modular Sub-Agents for Dev Workflows

Aeneas: How AI Is Reuniting Us with Lost Roman Voices

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage