A debugger for RL reward functions that detects reward hacking during training [P]
A Debugger for RL Reward Functions That Detects Reward Hacking During Training
While experimenting with GRPO training, I kept running into an issue: when reward increases, it becomes difficult to tell whether the policy is genuinely improving or simply exploiting the reward function.
So I built a small library called rewardspy that wraps an existing reward function and continuously monitors indicators that often precede reward hacking.
It currently tracks things like:
- Rolling reward statistics
- Reward variance collapse
- Reward component imbalance
- Response length drift
- Reward slope changes
- GRPO group collapse
- Anol
This is my first major RL project, so I would absolutely love some technical advice.
Check it out here: https://github.com/AvAdiii/rewardspy
(Credits to u/Oranoleo12, posting on their behalf)
Comments
No comments yet. Start the discussion.