Evaluating Llama 3: The Impact of Quantization on Large Language Model Performance


Llama 3, an open large language model, has been evaluated for its performance under various levels of quantization using the Massive Multitask Language Understanding (MMLU) test. The model comes in two variants, 70B and 8B, with their weights published for use on consumer hardware thanks to quantization methods. The study focused on the impact of quantization on the model’s ability to correctly answer MMLU questions, a comprehensive test covering 57 categories with over 14,000 multiple-choice questions.

The findings reveal that quantization, which reduces a model’s memory usage by converting parts of it to lower precision numerical representations, does have an effect on the model’s correctness but retains good quality up to a certain level of quantization. Specifically, models quantized to around 5 bits per weight (bpw) showed minimal impact on performance. The study also compared different quantization formats and found that GGUF “I-Quants” generally offered the best quality for a given file size, with the `transformers` library’s quantization slightly lower in quality except for its 4 bit normalized float (nf4), which performed comparably.

Interestingly, the 70B model variant was less affected by quantization than the 8B variant, suggesting a relative sparsity in the larger model. The study also highlighted the limitations of the MMLU test and suggested areas for further research, particularly in how quantization affects different types of tasks, such as programming versus creative writing.

This evaluation provides valuable insights into the trade-offs between model size, computational requirements, and performance, offering guidance for deploying large language models on limited hardware.
Read more at GitHub…