Reddit - r/MachineLearning

A debugger for RL reward functions that detects reward hacking during training [P]

A Debugger for RL Reward Functions That Detects Reward Hacking During Training

While experimenting with GRPO training, I kept running into an issue: when reward increases, it becomes difficult to tell whether the policy is genuinely improving or simply exploiting the reward function.

So I built a small library called rewardspy that wraps an existing reward function and continuously monitors indicators that often precede reward hacking.

It currently tracks things like:

  • Rolling reward statistics
  • Reward variance collapse
  • Reward component imbalance
  • Response length drift
  • Reward slope changes
  • GRPO group collapse
  • Anol

This is my first major RL project, so I would absolutely love some technical advice.

Check it out here: https://github.com/AvAdiii/rewardspy

(Credits to u/Oranoleo12, posting on their behalf)

Comments

No comments yet. Start the discussion.