Reinforcing Large Language Models with Retrospective Policy Optimization

Recent months have seen the rise of powerful new autonomous language agents built on top of large language models (LLMs) like GPT-3, GPT-4 or LLaMa. These agents can perform multi-step reasoning and tool use, interacting with APIs and environments to complete tasks. However, most agents do not actually learn from environmental rewards and feedback.

A new paper from Salesforce Research introduces Retroformer, an elegant framework that reinforces LLMs to become better learning agents. Retroformer uses a “retrospective” LLM that summarizes past failures and suggests improvements. This retrospective model is optimized using policy gradient reinforcement learning to generate better prompts and feedback for the main LLM agent.

Retroformer achieved strong results on HotPotQA, a challenging multi-hop question answering benchmark requiring complex reasoning and Wikipedia search actions. Over just 4 attempts, the success rate improved 18% as Retroformer’s retrospective model learned to assign credit and give focused feedback. This demonstrates the power of policy optimization for iterative LLM refinement.

Retroformer provides a general framework for taking any large language model like GPT and transforming it into an agent that actively improves itself using environmental rewards, without ever fine-tuning the base LLM weights. The authors believe Retroformer is one of the first efforts to successfully apply policy gradient methods to optimize language agent prompting.

The results indicate that current LLMs still struggle with credit assignment – understanding the root causes behind multi-step failures. By learning to generate focused reflections, Retroformer agents act more rationally, preventing repetitive mistakes. The structured self-critiquing learned by the retrospective model leads to faster skill acquisition.

Retroformer could enable a new paradigm of large language agents that efficiently self-improve based on experience, while remaining safely constrained. Rather than full unsupervised learning, Retroformer allows gradual, measured performance gains targeting specific reasoning skills. Future work may expand the approach to optimize different agent modules or enhance other capabilities like summarization and memory. By using frozen base models and optimizing prompts, Retroformer points toward reliable and controllable augmentation of LLMs.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from Emsi's feed

Subscribe now to keep reading and get access to the full archive.

Continue reading