Training Code Models to Follow Instructions with OctoPack

A new paper titled “OctoPack: Instruction Tuning Code Large Language Models” proposes a novel method for improving the capabilities of code-generating AI systems by leveraging a massive dataset of GitHub commits.

The researchers compiled CommitPack, a dataset containing close to 4 terabytes of GitHub commits across 350 programming languages. They then filtered this data down to 2 gigabytes of high-quality commits with informative messages, creating CommitPackFT. These commit messages serve as natural instructions, explaining the purpose of code changes.

The researchers show that “instruction tuning” large language models like CodeXGLUE or StarCoder on CommitPackFT leads to significant performance gains. Their best model, OctoCoder, improves by 23% on average across 3 code intelligence tasks compared to its base model StarCoder. It outperforms all other openly licensed code LLMs on the tasks of code fixing, code explanation and code synthesis.

To benchmark model performance, the researchers expanded the popular Python code synthesis benchmark HumanEval to HumanEvalPack. Their new benchmark covers 3 scenarios – code fixing, code explanation and code synthesis across 6 programming languages.

On this more comprehensive benchmark, OctoCoder even surpasses some commercial models like InstructCodeT5+. However, it still falls short of capabilities like GPT-4, highlighting room for improvement. GPT-4 gets near perfect scores on Python code synthesis, saturating the original HumanEval. But it struggles more on the new explanation and fixing tasks proposed in this work.

Overall, this paper demonstrates that Git commit data is a valuable resource for improving code LLMs. The proposed method of instruction tuning on commits could make these models far more useful for real-world coding applications. The authors also contribute two valuable resources – CommitPackFT as instruction data and HumanEvalPack as a benchmark.

The code, data and models have all been open-sourced. So this approach could be adopted by any researchers or companies working on coding assistants and auto-completion tools powered by large language models. More capable and steering AI systems that generate, modify and explain code could lead to huge boosts in programmer productivity.

Training Code Models to Follow Instructions with OctoPack

Related

Leave a ReplyCancel reply

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage

Introducing the OWASP AI Testing Guide: A New Standard for AI Security Testing

The Low-Background Steel Problem of AI

Chinese AI Firms Dodge US Chip Bans with Cross-Border Data Smuggling to Malaysia

OpenAI open-sources a demo of a UI testing agent

Financial Dynamics in Agentic AI: Cursor’s Rise Versus GitHub Copilot