MathCoders: Enhancing Mathematical Reasoning of Open-Source Language Models

A group of researchers from The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory, and City University of Hong Kong have developed a new method to significantly improve the mathematical reasoning capabilities of open-source large language models (LLMs) like Llama and CodeLlama.

In their paper “MATHCODER: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning“, the researchers introduce MathCoder, a framework that includes a novel math instruction-following dataset called MathCodeInstruct and a customized supervised fine-tuning approach.

The process of dataset creation and model fine-tuning. (a) First, solutions for problems in the GSM8K and MATH datasets are collected from the GPT-4. Then, CodeLlama-34B model is fine-tunedon this data, producing the MathCoder-Initial. New problems are created using novel prompt and their solutions are generated using MathCoder-Initial. (b) Finally, the new problems and solutions are combined with the existing training data to create the final dataset, which is used to fine-tune the base Llama-2 model, producing final MathCoder mode.

MathCodeInstruct contains over 80,000 math problems paired with solutions that interleave natural language, code, and execution results in a format the researchers call LCE (Language, Code, Execution). The key highlights of this dataset are:

Solutions are collected from the powerful closed-source model GPT-4 Code Interpreter, ensuring high quality
Additional problems are generated using an innovative prompting technique called “problem interpolation”, creating intermediate difficulty level problems between basic and advanced math questions
Multiple LCE solutions are distilled for each generated problem to further validate quality

The supervised fine-tuning approach trains models like Llama and CodeLlama on this dataset while executing code blocks in real-time. This allows the model to assess the execution results and continue reasoning accordingly, similar to how the GPT-4 Code Interpreter operates.

Evaluation on benchmarks like GSM8K and MATH shows MathCoder models substantially outperforming other open-source methods for math problem solving. Impressively, MathCoder achieves state-of-the-art scores of 83.9% on GSM8K and 45.2% on MATH, even surpassing proprietary models like ChatGPT-3.5 and PaLM-2.

The key implications of this work are:

Demonstrating the possibility of integrating reasoning, coding, and execution in open-source models to enhance complex reasoning capabilities
Providing an effective framework and high-quality dataset to train performant math problem-solving models without needing massive resources
Closing the gap between open-source and closed-source LLMs on mathematical challenges

Model performance comparison for MathCoders with CodeLlama and Llama-2 as base

Possible use cases include tutoring systems, quantitative analysis, financial modeling, and scientific computing. The availability of accurate open-source math models could make AI assistance more accessible for math-heavy fields.

The researchers plan to release the dataset and models to spur progress in this domain. While math is the focus here, the principles used in MathCoder could eventually generalize to other reasoning-based tasks involving computation.

MathCoders: Enhancing Mathematical Reasoning of Open-Source Language Models

Related

Leave a ReplyCancel reply

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage

Introducing the OWASP AI Testing Guide: A New Standard for AI Security Testing

The Low-Background Steel Problem of AI

Chinese AI Firms Dodge US Chip Bans with Cross-Border Data Smuggling to Malaysia

OpenAI open-sources a demo of a UI testing agent

Financial Dynamics in Agentic AI: Cursor’s Rise Versus GitHub Copilot