A team of researchers from Microsoft and the Shenzhen Institute of Advanced Technology have developed a new method called Reinforced Evol-Instruct that significantly enhances the mathematical reasoning abilities of large language models (LLMs). Their method is presented in a new paper titled “WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct.”

Reinforced Evol-Instruct is a novel technique proposed by the researchers to enhance the mathematical reasoning capabilities of large language models (LLMs) like Llama-2. It has three main steps are:

- Supervised Fine-Tuning: The LLM is first fine-tuned on math problems and solutions to adapt it to the mathematical reasoning task.
- Reward Model Training: Two reward models are trained – an Instruction Reward Model that scores the quality of math questions, and a Process Reward Model that provides feedback on each step of the solution.
- Active Evol-Instruct: The math questions are iteratively evolved to increase diversity and complexity using two techniques – downward evolution to simplify questions and upward evolution to make them harder. The LLM is then trained via reinforcement learning using the rewards from the two models.

By evolving the math training data and leveraging both instruction and process rewards, Reinforced Evol-Instruct produces LLM models with significantly improved mathematical reasoning abilities, as demonstrated by the WizardMath results. The method provides a general framework for enhancing LLMs in logical and mathematical reasoning.

The researchers applied Reinforced Evol-Instruct to boost the performance of Llama-2, an open-source LLM from Meta with 70 billion parameters. The resulting model, called **WizardMath-70B-V1.0**, achieves state-of-the-art results on mathematical reasoning benchmarks GSM8k and MATH, even surpassing major commercial models like OpenAI’s ChatGPT, Google’s PaLM and Anthropic’s Claude.

On the GSM8k benchmark consisting of grade school math problems, **WizardMath-70B-V1.0** attains 81.6% accuracy, trailing top proprietary models like GPT-4 at 92%, Claude 2 at 88%, and Flan-PaLM 2 at 84.7%, but exceeding ChatGPT at 80.8%, Claude Instant at 80.9%, and PaLM-2 at 80.7%. The previous best open-source LLM was Llama-2 at 56.8%. On the more advanced MATH dataset, WizardMath reaches 22.7% versus 19.1% for OpenAI’s davinci-002 model. The authors released two additional versions of the model with 13B and 7B parameters.

The significant boost over other open-source and commercial LLMs highlights the effectiveness of Reinforced Evol-Instruct. By evolving diverse math questions and solutions for training, and using both instruction and process rewards, the method produces models superior at mathematical reasoning.

The researchers suggest WizardMath could have broad implications for education by powering tools that help students learn and practice math concepts. It could also enable new applications like chatbots that provide step-by-step solving of math problems. The ready availability of WizardMath as an open-source model also makes state-of-the-art math reasoning accessible to other researchers and developers.

Overall, the WizardMath paper demonstrates the potent combination of evolved instructions and reinforcement learning for advancing LLMs’ reasoning abilities. The techniques provide a valuable template for further enhancing large language models in mathematical and logical reasoning.