arxiv:2601.05242

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Published on Jan 8

· Submitted by

LIU Shih-yang on Jan 9

#1 Paper of the day

NVIDIA

Upvote

Authors:

Shih-Yang Liu ,

Ximing Lu ,

Peter Belcak ,

Min-Hung Chen ,

Hongxu Yin ,

Yejin Choi ,

Pavlo Molchanov

Abstract

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

AI-generated summary

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

View arXiv page View PDF Project page GitHub 64 Add to collection

Community

sliuau

Paper author Paper submitter about 24 hours ago

GDPO is a drop-in replacement for GRPO in verl and TRL — only minor code changes needed.

We release a slurm-free, easy-to-run implementation supporting multiple RL frameworks (verl / TRL / NeMo-RL) so you can quickly validate GDPO on tool-calling and math reasoning tasks.

⏱️ Each run can be completed in ~1 hour on 8×A100s, or ~2.5 hours on a single A100.

🔄 Switching from GRPO to GDPO is easy.
👉 Try it yourself: https://github.com/NVlabs/GDPO

noahml

about 15 hours ago

Really cool paper!
I've created a podcast that explains the key concepts:
https://researchpod-share.vercel.app/episode/c83f1820-279a-4cc0-afe1-b927a0c20ec8

taimurs

about 5 hours ago

I enjoyed listening to the AI paper podcast!

AlexRadch

about 14 hours ago

This comment has been hidden

SimonX

about 9 hours ago

When you RL models for real-world use, you care about more than one thing: accuracy, conciseness, alignment, faithfulness, etc.

But most RL pipelines still compress all of that into one scalar advantage in the loss function — and a lot of preference signal gets washed out.

We’re introducing GDPO, a simple fix that lets you express multi-dimensional preferences with a single advantage. Key idea: swap the order of reward normalization and aggregation.

Works out-of-the-box as a GRPO add-on—code is provided for veRL, TRL, and NeMo-RL.