Scaling Up Language Models with Agent Ensembles

A new study reveals that simply increasing the number of agents in an ensemble can boost the performance of large language models (LLMs) across a variety of tasks.

Researchers from Tencent Inc. conducted comprehensive experiments evaluating different ensemble sizes on benchmarks for reasoning, generation, and other capabilities. They found that using their simple “sampling-and-voting” technique to instantiate multiple agents consistently improved results for various LLMs, including the 13B-parameter Llama2-Chat and 70B-parameter Llama2-Chat from Anthropic, and OpenAI’s 7.5B-parameter GPT-3.5 Turbo.

Remarkably, smaller models like Llama2-13B could match or even surpass the performance of much larger models like GPT-3.5 Turbo by scaling up the number of agents to 15-20. The gains were especially pronounced on difficult reasoning tasks, with accuracy improvements of 6-24% on math problems and 1-11% on general reasoning.

The accuracy increases with ensemble size across Llama2-13B, Llama2-70B and GPT-3.5-Turbo in GSM8K.

For open-ended generation tasks such as code generation, the researchers used BLEU scores to quantify the similarity between each agent’s generated text. The BLEU metric calculates overlap between an agent’s text and all the other agents’ texts. The agent producing the text with the highest average BLEU score when compared to the others is deemed the “winner” for that round of voting. This text is considered the consensus response that is most representative of what the ensemble produced. So in tasks where the agents generate free-form text, voting is based on the text with the highest semantic similarity to the full set of responses, rather than simply picking the most frequent word-for-word output.

The researchers posit that larger agent ensembles help overcome errors and inconsistencies when models attempt complex multi-step reasoning. Each agent produces a potentially different response, and majority voting selects the most common coherent answer.

They also found ensemble methods compatible with other techniques like chain-of-thought prompting and agent debate frameworks, further boosting results. However, standalone ensembles proved highly competitive, achieving top accuracy with no extra prompting or debate training.

Analyzing task difficulty, the study uncovered how gains correlate with inherent problem complexity, reasoning steps, and solution probability. These insights led to customized sampling optimizations tailored to task properties.

The simple yet effective ensemble method could make deploying LLMs more affordable. Combining multiple smaller models may provide a lower-cost alternative to gigantic single models with billions of parameters.

If the approach scales up in practice, it could enable wider access to powerful AI assistants, reasoning tools, and generators. The researchers aim to reduce the computational expenses of ensembling in future work. But for now, their results suggest that when it comes to improving LLMs, more agents is all you need.

Scaling Up Language Models with Agent Ensembles

Related

Leave a ReplyCancel reply

When Code Training Goes Wrong: The Surprising Case of Emergent AI Misalignment

The Energy Infrastructure Gap That Could Decide the AI Race

AI-Powered Security Checks: Filtering Bots Without Slowing Users

Inside the Underground World of LLM Jailbreaks

GPT-5 is Here, and It’s Not What You Expected

The AI Agent That Actually Knows How to Build ML Models

Qwen-Image: Finally, an AI That Can Actually Write

Perplexity’s Stealth Crawling Sparks Debate Over AI Web Ethics

Feeding Your Gut to Fight Fat: How Tryptophan Sparks Hormone Recovery