DeepSeek R1 Theory Tutorial – Architecture, GRPO, KL Divergence

Post Content

Learn about DeepSeek R1’s innovative AI architecture from @deeplearningexplained. The course explores how R1 achieves exceptional reasoning through reinforcement learning, focusing on Group Relative Policy Optimization (GRPO) and how it improves upon traditional PPO methods. You’ll also understand KL divergence’s role in model stability, with practical code examples and clear mathematical explanations.

❤️ Support for this channel comes from our friends at Scrimba – the coding platform that’s reinvented interactive learning: https://scrimba.com/freecodecamp

Contents
⌨️ (0:00:00) Introduction
⌨️ (0:01:49) R1 Overview – Overview
⌨️ (0:03:52) R1 Overview – DeepSeek R1-zero path
⌨️ (0:05:32) R1 Overview – Reinforcement learning setup
⌨️ (0:08:36) R1 Overview – Group Relative Policy Optimization (GRPO)
⌨️ (0:13:04) R1 Overview – DeepSeek R1-zero result
⌨️ (0:16:53) R1 Overview – Cold start supervised fine-tuning
⌨️ (0:17:44) R1 Overview – Consistency reward for CoT
⌨️ (0:18:35) R1 Overview – Supervised Fine tuning data generation
⌨️ (0:21:06) R1 Overview – Reinforcement learning with neural reward model
⌨️ (0:22:53) R1 Overview – Distillation
⌨️ (0:26:16) GRPO – Overview
⌨️ (0:26:55) GRPO – PPO vs GRPO
⌨️ (0:30:25) GRPO – PPO formula overview
⌨️ (0:33:25) GRPO – GRPO formula overview
⌨️ (0:36:48) GRPO – GRPO pseudo code
⌨️ (0:38:56) GRPO – GRPO Trainer code
⌨️ (0:49:24) KL Divergence – Overview
⌨️ (0:49:55) KL Divergence – KL Divergence in GRPO vs PPO
⌨️ (0:51:20) KL Divergence – KL Divergence refresher
⌨️ (0:55:32) KL Divergence – Monte Carlo estimation of KL divergence
⌨️ (0:56:43) KL Divergence – Schulman blog
⌨️ (0:57:38) KL Divergence – k1 = log(q/p)
⌨️ (1:00:01) KL Divergence – k2 = 0.5*log(p/q)^2
⌨️ (1:02:19) KL Divergence – k3 = (p/q – 1) – log(p/q)
⌨️ (1:04:44) KL Divergence – benchmarking
⌨️ (1:07:28) Conclusion

🎉 Thanks to our Champion and Sponsor supporters:
👾 Drake Milly
👾 Ulises Moralez
👾 Goddard Tan
👾 David MG
👾 Matthew Springman
👾 Claudio
👾 Oscar R.
👾 jedi-or-sith
👾 Nattira Maneerat
👾 Justin Hual

—

Learn to code for free and get a developer job: https://www.freecodecamp.org

Read hundreds of articles on programming: https://freecodecamp.org/news Read More freeCodeCamp.org

#programming #freecodecamp #learn #learncode #learncoding