Rethinking Calibration for More Robust Large Language Models

Large language models (LLMs) like GPT-3 have shown impressive capabilities when prompted with instructions or given a few examples to learn from (known as in-context learning). However, recent research has revealed that LLMs’ predictions can be sensitive or biased based on small details in how the prompting or examples are structured.

To address this “prompt brittleness”, researchers from Google and University of Cambridge proposed a new calibration technique called Batch Calibration (BC) in a paper published recently. BC aims to remove biases that come from the prompt/example context itself.

The key idea is to estimate the contextual bias for each class by averaging the LLM’s predictions over all samples in a batch. This estimated bias is then subtracted to get calibrated predictions. BC is computationally lightweight, requiring just one additional forward pass per batch.

llustration of Batch Calibration

Experiments across 13 text classification tasks showed BC consistently improves on prior calibration methods, including Contextual Calibration and Prototypical Calibration. On average, BC boosted the performance of GPT-3 style models PaLM 2-S and PaLM 2-L by 6-8% in the challenging 1-shot learning setup.

The researchers demonstrated BC is effective even when the examples or prompt templates are varied, making it more robust. This is a very desirable property for real world deployment of LLMs where prompt engineering can be difficult and error-prone.

BC improves zero-shot (ZS) image clas-sification: Accuracy (%) on image classification tasks with the zero-shot CLIP ViT-16/B

BC was also extended to calibrate vision-language models like CLIP in zero-shot image classification. Again substantial gains were achieved, showing the broad applicability of the method.

Overall, BC represents an important step towards more stable and reliably performing LLMs. By removing unwanted biases in the input context itself, models can focus learning on the actual examples provided.

The simple and generalizable nature of BC makes it a promising calibration technique. In the future, exploring extensions to conditional text generation and integrating BC into optimization-based prompt search methods could further unlock the power of large language models.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.