Training a 1 Trillion Parameter Model With PyTorch Fully Sharded Data Parallel on AWS

Linear scaling efficiency is observed when the number of GPUs is increased from 8 GPUs to 512 GPUs.
Read more at Medium…