MathCoders: Enhancing Mathematical Reasoning of Open-Source Language Models

A group of researchers from The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory, and City University of Hong Kong have developed a new method to significantly improve the mathematical reasoning capabilities of open-source large language models (LLMs) like Llama and CodeLlama.

In their paper “MATHCODER: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning“, the researchers introduce MathCoder, a framework that includes a novel math instruction-following dataset called MathCodeInstruct and a customized supervised fine-tuning approach.

The process of dataset creation and model fine-tuning. (a) First, solutions for problems in the GSM8K and MATH datasets are collected from the GPT-4. Then, CodeLlama-34B model is fine-tunedon this data, producing the MathCoder-Initial. New problems are created using novel prompt and their solutions are generated using MathCoder-Initial. (b) Finally, the new problems and solutions are combined with the existing training data to create the final dataset, which is used to fine-tune the base Llama-2 model, producing final MathCoder mode.

MathCodeInstruct contains over 80,000 math problems paired with solutions that interleave natural language, code, and execution results in a format the researchers call LCE (Language, Code, Execution). The key highlights of this dataset are:

  • Solutions are collected from the powerful closed-source model GPT-4 Code Interpreter, ensuring high quality
  • Additional problems are generated using an innovative prompting technique called “problem interpolation”, creating intermediate difficulty level problems between basic and advanced math questions
  • Multiple LCE solutions are distilled for each generated problem to further validate quality

The supervised fine-tuning approach trains models like Llama and CodeLlama on this dataset while executing code blocks in real-time. This allows the model to assess the execution results and continue reasoning accordingly, similar to how the GPT-4 Code Interpreter operates.

Evaluation on benchmarks like GSM8K and MATH shows MathCoder models substantially outperforming other open-source methods for math problem solving. Impressively, MathCoder achieves state-of-the-art scores of 83.9% on GSM8K and 45.2% on MATH, even surpassing proprietary models like ChatGPT-3.5 and PaLM-2.

The key implications of this work are:

  • Demonstrating the possibility of integrating reasoning, coding, and execution in open-source models to enhance complex reasoning capabilities
  • Providing an effective framework and high-quality dataset to train performant math problem-solving models without needing massive resources
  • Closing the gap between open-source and closed-source LLMs on mathematical challenges
Model performance comparison for MathCoders with CodeLlama and Llama-2 as base

Possible use cases include tutoring systems, quantitative analysis, financial modeling, and scientific computing. The availability of accurate open-source math models could make AI assistance more accessible for math-heavy fields.

The researchers plan to release the dataset and models to spur progress in this domain. While math is the focus here, the principles used in MathCoder could eventually generalize to other reasoning-based tasks involving computation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.