Answer.AI – Enabling 70B Finetuning on Consumer GPUs

Answer.AI has unveiled FSDP+QLoRA, a groundbreaking open-source project that allows for the fine-tuning of massive models with up to 70 billion parameters using just two consumer-grade GPUs. The key to this capability lies in the integration of Fully Sharded Data Parallel (FSDP) with quantization libraries, which traditionally store quantized weights in integer formats incompatible with FSDP’s float datatype requirements. To address this, Answer.AI presents two methods: using `torch.view` for byte-preserving type conversion, as demonstrated by HQQ, and a datatype-agnostic approach in dequantization kernels, used by bitsandbytes.

The implementation involves updating quantization libraries to support FSDP, ensuring that quantized weights are stored in float datatypes to facilitate efficient model sharding. This is critical for mixed precision training, preventing inadvertent weight randomization when converting from float32 to bfloat16. Additionally, bitsandbytes required further modifications to maintain compatibility with FSDP, such as preserving quantization metadata during sharding and preventing re-quantization of already quantized weights.

For training frameworks, the integration process involves managing model loading and quantization to minimize CPU memory usage and correctly setting up FSDP wrapping policies, mixed precision, and sharding settings for QLoRA fine-tuning. This includes handling LoRA layers with custom wrapping policies and ensuring that the quantized weight storage type matches the model’s non-quantized weight datatype.

The successful implementation of FSDP+QLoRA has been incorporated into development builds of Axolotl and the Hugging Face ecosystem, marking a significant step forward in the efficient training of large-scale models on limited hardware. This advancement opens the door for broader adoption and further innovation in the field of AI model training.
Read more at Answer.AI…