Reinforcing Large Language Models with Retrospective Policy Optimization

Recent months have seen the rise of powerful new autonomous language agents built on top of large language models (LLMs) like GPT-3, GPT-4 or LLaMa. These agents can perform multi-step reasoning and tool use, interacting with APIs and environments to complete tasks. However, most agents do not actually learn from environmental rewards and feedback.

A new paper from Salesforce Research introduces Retroformer, an elegant framework that reinforces LLMs to become better learning agents. Retroformer uses a “retrospective” LLM that summarizes past failures and suggests improvements. This retrospective model is optimized using policy gradient reinforcement learning to generate better prompts and feedback for the main LLM agent.

Retroformer achieved strong results on HotPotQA, a challenging multi-hop question answering benchmark requiring complex reasoning and Wikipedia search actions. Over just 4 attempts, the success rate improved 18% as Retroformer’s retrospective model learned to assign credit and give focused feedback. This demonstrates the power of policy optimization for iterative LLM refinement.

Retroformer provides a general framework for taking any large language model like GPT and transforming it into an agent that actively improves itself using environmental rewards, without ever fine-tuning the base LLM weights. The authors believe Retroformer is one of the first efforts to successfully apply policy gradient methods to optimize language agent prompting.

The results indicate that current LLMs still struggle with credit assignment – understanding the root causes behind multi-step failures. By learning to generate focused reflections, Retroformer agents act more rationally, preventing repetitive mistakes. The structured self-critiquing learned by the retrospective model leads to faster skill acquisition.

Retroformer could enable a new paradigm of large language agents that efficiently self-improve based on experience, while remaining safely constrained. Rather than full unsupervised learning, Retroformer allows gradual, measured performance gains targeting specific reasoning skills. Future work may expand the approach to optimize different agent modules or enhance other capabilities like summarization and memory. By using frozen base models and optimizing prompts, Retroformer points toward reliable and controllable augmentation of LLMs.

Reinforcing Large Language Models with Retrospective Policy Optimization

Related

Leave a ReplyCancel reply

Putting Math Behind the Madness: A Theoretical Framework for LLM Hallucinations

The Hidden Homework Problem: How ArxivRoll Exposed AI’s Inflated Test Scores

Teaching AI Models to Debug Themselves: The Reflect, Retry, Reward Method

Claude Code Gets Smarter with Modular Sub-Agents for Dev Workflows

Aeneas: How AI Is Reuniting Us with Lost Roman Voices

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage