Introduction

Have you ever been stuck on a tricky problem, wishing you had a bit more time to think it through? Well, that's exactly what Large Language Models (LLMs) are starting to do! Models like OpenAI's o1 have shown that when LLMs are given more computational power during inference (the thinking process), they get much better at reasoning tasks like math, coding, and logic. It's like giving your brain a little extra time to work things out.

However, the exact recipe behind these powerful reasoning models has been a closely guarded secret. That is, until recently, when DeepSeek released their DeepSeek-R1 model. This release not only matched or exceeded the performance of models like o1, but also came with a detailed tech report outlining their training process. This was a big deal, and it got everyone in the AI community very excited!

DeepSeek's approach involved using pure reinforcement learning to teach a base language model how to reason without any human supervision. Think of it like teaching a dog a new trick, but instead of treats, you're giving the model feedback based on how well it solves problems. The figure below shows that creating a powerful reasoning model is now quite simple if you have a good base model and high-quality data:

[Diagram of Base Model + Data Mixture -> Reasoning Model]

Despite this exciting release, some key questions remain:

Data Collection: How were the specific datasets for reasoning created?
Model Training: DeepSeek didn't release the training code, so we don't know the best settings and how they might change for different models.
Scaling Laws: What are the trade-offs between compute power and data when training these models?

These questions led to the creation of the Open-R1 project. Our goal is to reconstruct DeepSeek-R1's data and training pipeline, validate their claims, and push the boundaries of open reasoning models. By doing this, we hope to make the process transparent, share insights with the open-source community, and create a foundation for future models to build upon.

In this blog post, we'll explore the key ingredients behind DeepSeek-R1, what we plan to replicate, and how you can contribute to the Open-R1 project. Let's dive in!

How Did DeepSeek Do It?

DeepSeek-R1 is built on top of DeepSeek-V3, a powerful base model. Think of DeepSeek-V3 as the foundation of a house, and DeepSeek-R1 as the finished building. DeepSeek-V3 is a 671B Mixture of Experts (MoE) model, which means it's like having a team of specialists working together. It performs as well as other top models like Sonnet 3.5 and GPT-4o. What's even more impressive is that it was trained for only $5.5M, thanks to clever architectural changes like Multi Token Prediction (MTP), Multi-Head Latent Attention (MLA), and a lot of hardware optimization. It's like building a high-performance car but making it fuel-efficient at the same time!

DeepSeek actually released two models: DeepSeek-R1-Zero and DeepSeek-R1. They each have a different training approach. DeepSeek-R1-Zero skipped the traditional supervised fine-tuning and relied entirely on reinforcement learning (RL). They used Group Relative Policy Optimization (GRPO) to make the process more efficient. The model was guided by a simple reward system that gave feedback based on the accuracy and structure of its answers. This helped the model develop reasoning skills, like breaking down problems and checking its own work. However, the responses from R1-Zero were often unclear and hard to read.

That's where DeepSeek-R1 comes in. It started with a "cold start" phase, fine-tuning on a small set of carefully chosen examples to improve clarity and readability. From there, it went through more RL and refinement steps, including rejecting low-quality outputs using both human preferences and verifiable rewards. This resulted in a model that not only reasons well but also produces clear and consistent answers. It's like taking a rough draft and polishing it into a final, well-written piece.

Open-R1: Filling in the Missing Pieces

The release of DeepSeek-R1 is a huge step forward for the AI community. However, not everything was released. While the model weights are open, the datasets and code used to train the model are not. It's like getting a recipe for a delicious cake but not having the list of ingredients or the instructions on how to bake it.

The goal of Open-R1 is to create these missing pieces so that the entire research and industry community can build similar or even better models. By doing this in the open, everyone can contribute! It's like a community cookbook where everyone can share their recipes and improve on them.

Here's our plan of attack:

Replicate the R1-Distill models: We'll create a high-quality reasoning dataset by distilling knowledge from DeepSeek-R1. Think of it like extracting the essence of the model's reasoning abilities.
Replicate the pure RL pipeline: We'll recreate the reinforcement learning process that DeepSeek used to create R1-Zero. This will involve creating new, large-scale datasets for math, reasoning, and code. It's like building the training ground for the model.
Show we can go from base model → SFT → RL: We'll demonstrate how to go from a base model to a reasoning model through multi-stage training. It's like showing the complete journey of how to build the model.

The synthetic datasets will allow anyone to fine-tune existing or new LLMs into reasoning models. The training recipes involving RL will serve as a starting point for building similar models from scratch and will allow researchers to build even more advanced methods. It's like providing a toolkit for anyone to build their own reasoning models.

We're not just stopping at math datasets. There's a lot of potential in exploring other areas, like code and scientific fields such as medicine, where reasoning models could have a significant impact. Imagine a model that can help doctors diagnose diseases or help scientists discover new medicines!

This initiative isn't just about replicating results—it's about sharing insights with the community. By documenting what works, what doesn't, and why, we hope to save others from wasting time and resources on unproductive paths. It's like sharing your experiences so that others can learn from your mistakes and successes.

Open-R1: Rebuilding DeepSeek-R1 for the Open Source Community

Introduction

How Did DeepSeek Do It?

Open-R1: Filling in the Missing Pieces

intermac.dev © 2025 All rights reserved | Build with ❤ by intermac & claude