Nvidia researchers boost LLMs’ reasoning skills by making them ‘think’ during pre-training

Nvidia researchers have developed a new technique that flips the script on how large language models (LLMs) learn to reason.

The method, called reinforcement learning pre-training (RLP), integrates RL into the initial training phase rather than saving it for the end.

This approach encourages the model to “think for themselves before predicting what comes next, thus teaching independent thinking behavior early in pre-training,” the researchers state in their article.

By learning to reason in plain text without needing external verifiers, models trained with RLP show significant improvements in learning complex reasoning tasks downstream, suggesting a future of AI that is more capable and adaptable for real-world tasks.

The typical LLM training cycle

Typically, large language models are first pre-trained on large amounts of text using a “next token prediction” objective, where they are given a string of text and asked to continually guess what the next word (or token) will be. At this stage, they learn basic grammar, facts, and associations.

In the later post-training phase, models often learn complex reasoning skills such as chain of thought (CoT), where a model explains its reasoning step by step. This step usually involves supervised fine tuning (SFT) or reinforcement learning from human feedback (RLHF), which require specialized and curated datasets.

The paper’s authors argue that this sequential process does not correspond to human understanding, which is “not a linear token-by-token process, but rather a parallel integration of input with prior knowledge.” Existing pre-training methods lack this mechanism, hindering a model’s ability to develop deep reasoning from the start.

How reinforcement learning pre-training works

RLP reformulates this process by treating CoT generation as an action that the model takes before predicting the next token. At each step, the model first generates an internal report “thought” or chain of reasoning. It then predicts the next word in the text, using the original context augmented with your new thought.

The model receives a reward based on how much its thinking improved the accuracy of its prediction compared to a baseline that did not generate a thought (pure prediction of the next token). This reward signal is automatically calculated based on the change in probability, eliminating the need for external verifiers or human-labeled data.

The reward is positive only when the thought generated helps the model better predict the next token. By rewarding thoughts based on their predictive benefit, RLP effectively teaches the model how to think usefully on the same massive, unstructured data sets used for standard pre-training.

This continuous feedback loop allows the model to learn when a simple predictive guess is sufficient and when it needs to engage in deeper reasoning. As the researchers say, “RLP was designed to shape thinking into basic models, rewarding only those thoughts that measurably help in predicting the next token.”

This fundamental approach, however, does not make the later stages of fine-tuning obsolete. According to Bryan Catanzaro, vice president of deep learning applied research at Nvidia and co-author of the paper, RLP is designed to complement, not replace, these crucial steps. “RLP is not intended to replace later post-training stages such as supervised fine-tuning or reinforcement learning from human feedback,” Catanzaro told VentureBeat. “These stages remain crucial to refining the model’s behavior… It was actually designed to amplify the effectiveness of these later phases, giving the model a head start.”

RLP in action

In experiments with Qwen3-1.7B and Nemotron-Nano-12BThe Nvidia team tested RLP on a set of mathematical and scientific reasoning benchmarks. The results show that RLP-enhanced models consistently outperformed their conventionally trained counterparts, with particularly strong gains on heavy reasoning tasks.

For a company, this improved reasoning could translate into more reliable results in multi-step workflows like financial analysis or legal document summarization.

“RLP encourages the model during pre-training to think before predicting, helping the model internalize a more coherent reasoning style,” Catanzaro said. “This could help reduce subtle logic errors, especially in longer workflows.”

While emphasizing that RLP-trained models will still need the usual safeguards like verification layers, human oversight, and consistency checks, Catanzaro said that “RLP provides a stronger foundation.”

Importantly, the benefits of RLP increase rather than disappear during subsequent stages of fine-tuning (catastrophic forgetting is a common problem in LLM training, where later stages of training cause the model to forget its previously learned skills and knowledge). The RLP-trained model achieved an overall score 7-8% higher than baseline values ​​after an identical post-training regimen. The researchers conclude that RLP “establishes robust reasoning foundations that are not eliminated by post-alignment but are instead combined post-training.”

The efficiency of the technique is an important discovery. In the Qwen3-1.7B model, RLP improved performance by 17% over standard continuous pre-training and also outperformed a similar technique called Reinforcement Pre-training via Prefix Matching (RPT) rewards. This advantage held even when the baseline model was trained with 35 times more data to match the computational cost, confirming that the gains come from the method itself, and not just from more processing.

Furthermore, RLP demonstrates impressive scalability and versatility, successfully extracting a reasoning signal from general-purpose web data – not just select datasets. When applied to the Mamba-Transformer hybrid model Nemotron-Nano-12B RLP achieved a 35% relative improvement over a heavily trained baseline while only using a small fraction of the data.

While these results point to a more efficient path to building powerful models, Catanzaro frames the innovation as a fundamental change in the learning process itself, rather than an immediate solution to high training costs.

“This research is interesting because it offers a change in the way models absorb information during pre-training, leading to a smarter learning process.” he explained. “It wouldn’t replace large-scale pre-training, but it would offer another creative method for building the best models possible.”

A New Foundation for AI Training

Ultimately, RLP points to a future where pre-training will no longer be a monolithic process of predicting the next token. Instead, the next generation of models could be built on a hybrid of goals, creating AI that learns to think more robustly from day one. Catanzaro offers a powerful analogy to frame this shift:

“Predicting the next token teaches a model what the world is like; Reinforcement-style goals like RLP can teach you to think about what you’re seeing,” he said. “Combining these two goals can help models develop deeper, more structured thinking much earlier in training… Tools like RLP can build on this foundation, making learning more active, curious, and even more efficient.”

There is still a lot to learn about the dynamics of reinforcement learning in the pre-training phase, but what seems clear is that “introducing exploration early in training opens up a new axis for scaling – not just in size, but in the way models learn to reason,” said Catanzaro.

avots

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *