Training Code Models to Follow Instructions with OctoPack

A new paper titled “OctoPack: Instruction Tuning Code Large Language Models” proposes a novel method for improving the capabilities of code-generating AI systems by leveraging a massive dataset of GitHub commits.

The researchers compiled CommitPack, a dataset containing close to 4 terabytes of GitHub commits across 350 programming languages. They then filtered this data down to 2 gigabytes of high-quality commits with informative messages, creating CommitPackFT. These commit messages serve as natural instructions, explaining the purpose of code changes.

CommitPack dataset
CommitPack dataset

The researchers show that “instruction tuning” large language models like CodeXGLUE or StarCoder on CommitPackFT leads to significant performance gains. Their best model, OctoCoder, improves by 23% on average across 3 code intelligence tasks compared to its base model StarCoder. It outperforms all other openly licensed code LLMs on the tasks of code fixing, code explanation and code synthesis.

To benchmark model performance, the researchers expanded the popular Python code synthesis benchmark HumanEval to HumanEvalPack. Their new benchmark covers 3 scenarios – code fixing, code explanation and code synthesis across 6 programming languages.

HumanEvalPack benchmark
HumanEvalPack benchmark

On this more comprehensive benchmark, OctoCoder even surpasses some commercial models like InstructCodeT5+. However, it still falls short of capabilities like GPT-4, highlighting room for improvement. GPT-4 gets near perfect scores on Python code synthesis, saturating the original HumanEval. But it struggles more on the new explanation and fixing tasks proposed in this work.

Overall, this paper demonstrates that Git commit data is a valuable resource for improving code LLMs. The proposed method of instruction tuning on commits could make these models far more useful for real-world coding applications. The authors also contribute two valuable resources – CommitPackFT as instruction data and HumanEvalPack as a benchmark.

The code, data and models have all been open-sourced. So this approach could be adopted by any researchers or companies working on coding assistants and auto-completion tools powered by large language models. More capable and steering AI systems that generate, modify and explain code could lead to huge boosts in programmer productivity.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from Emsi's feed

Subscribe now to keep reading and get access to the full archive.

Continue reading