Reddit - r/MachineLearning 1h ago

A debugger for RL reward functions that detects reward hacking during training [P]

A Debugger for RL Reward Functions That Detects Reward Hacking During Training

While experimenting with GRPO training, I kept running into an issue: when reward increases, it becomes difficult to tell whether the policy is genuinely improving or simply exploiting the reward function.

So I built a small library called rewardspy that wraps an existing reward function and continuously monitors indicators that often precede reward hacking.

It currently tracks things like:

Rolling reward statistics
Reward variance collapse
Reward component imbalance
Response length drift
Reward slope changes
GRPO group collapse
Anol

This is my first major RL project, so I would absolutely love some technical advice.

Check it out here: https://github.com/AvAdiii/rewardspy

(Credits to u/Oranoleo12, posting on their behalf)

Read on Reddit - r/MachineLearning ↗ ← Back to News

A debugger for RL reward functions that detects reward hacking during training [P]

A Debugger for RL Reward Functions That Detects Reward Hacking During Training

Comments