Rethinking Calibration for More Robust Large Language Models

Large language models (LLMs) like GPT-3 have shown impressive capabilities when prompted with instructions or given a few examples to learn from (known as in-context learning). However, recent research has revealed that LLMs’ predictions can be sensitive or biased based on small details in how the prompting or examples are structured.

To address this “prompt brittleness”, researchers from Google and University of Cambridge proposed a new calibration technique called Batch Calibration (BC) in a paper published recently. BC aims to remove biases that come from the prompt/example context itself.

The key idea is to estimate the contextual bias for each class by averaging the LLM’s predictions over all samples in a batch. This estimated bias is then subtracted to get calibrated predictions. BC is computationally lightweight, requiring just one additional forward pass per batch.

Experiments across 13 text classification tasks showed BC consistently improves on prior calibration methods, including Contextual Calibration and Prototypical Calibration. On average, BC boosted the performance of GPT-3 style models PaLM 2-S and PaLM 2-L by 6-8% in the challenging 1-shot learning setup.

The researchers demonstrated BC is effective even when the examples or prompt templates are varied, making it more robust. This is a very desirable property for real world deployment of LLMs where prompt engineering can be difficult and error-prone.

BC improves zero-shot (ZS) image clas-sification: Accuracy (%) on image classification tasks with the zero-shot CLIP ViT-16/B

BC was also extended to calibrate vision-language models like CLIP in zero-shot image classification. Again substantial gains were achieved, showing the broad applicability of the method.

Overall, BC represents an important step towards more stable and reliably performing LLMs. By removing unwanted biases in the input context itself, models can focus learning on the actual examples provided.

The simple and generalizable nature of BC makes it a promising calibration technique. In the future, exploring extensions to conditional text generation and integrating BC into optimization-based prompt search methods could further unlock the power of large language models.

Rethinking Calibration for More Robust Large Language Models

Related

Leave a ReplyCancel reply

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage

Introducing the OWASP AI Testing Guide: A New Standard for AI Security Testing

The Low-Background Steel Problem of AI

Chinese AI Firms Dodge US Chip Bans with Cross-Border Data Smuggling to Malaysia

OpenAI open-sources a demo of a UI testing agent

Financial Dynamics in Agentic AI: Cursor’s Rise Versus GitHub Copilot