RLHF for finer alignment with Gemma 3

Estimated read time 1 min read

Post Content

​ How to best align a model for human interaction? In RLHF we first learn a proxy for the human preferences: the reward model (RM), that is later used to align a language policy. Yet the RM is an imperfect approximator of human preferences, and its prolonged usage leads inevitably to reward hacking. In this presentation we’ll review techniques developed in Gemma to mitigate this hacking and allow for longer training.

Subscribe to Google for Developers → https://goo.gle/developers

Speakers: Louis Rouillard
Products Mentioned: Gemma   Read More Google for Developers 

You May Also Like

More From Author