Chapters (27)
- 0:00Introduction
- 1:49R1 Overview - Overview
- 3:52R1 Overview - DeepSeek R1-zero path
- 5:32R1 Overview - Reinforcement learning setup
- 8:36R1 Overview - Group Relative Policy Optimization (GRPO)
- 13:04R1 Overview - DeepSeek R1-zero result
- 16:53R1 Overview - Cold start supervised fine-tuning
- 17:44R1 Overview - Consistency reward for CoT
- 18:35R1 Overview - Supervised Fine tuning data generation
- 21:06R1 Overview - Reinforcement learning with neural reward model
- 22:53R1 Overview - Distillation
- 26:16GRPO - Overview
- 26:55GRPO - PPO vs GRPO
- 30:25GRPO - PPO formula overview
- 33:25GRPO - GRPO formula overview
- 36:48GRPO - GRPO pseudo code
- 38:56GRPO - GRPO Trainer code
- 49:24KL Divergence - Overview
- 49:55KL Divergence - KL Divergence in GRPO vs PPO
- 51:20KL Divergence - KL Divergence refresher
- 55:32KL Divergence - Monte Carlo estimation of KL divergence
- 56:43KL Divergence - Schulman blog
- 57:38KL Divergence - k1 = log(q/p)
- 1:00:01KL Divergence - k2 = 0.5*log(p/q)^2
- 1:02:19KL Divergence - k3 = (p/q - 1) - log(p/q)
- 1:04:44KL Divergence - benchmarking
- 1:07:28Conclusion
Show the creator's full description
Learn about DeepSeek R1's innovative AI architecture from @deeplearningexplained. The course explores how R1 achieves exceptional reasoning through reinforcement learning, focusing on Group Relative Policy Optimization (GRPO) and how it improves upon traditional PPO methods. You'll also understand KL divergence's role in model stability, with practical code examples and clear mathematical explanations.
❤️ Try interactive AI courses we love, right in your browser: https://scrimba.com/freeCodeCamp-AI (Made possible by a grant from our friends at Scrimba)
Contents
⌨️ (0:00:00) Introduction
⌨️ (0:01:49) R1 Overview - Overview
⌨️ (0:03:52) R1 Overview - DeepSeek R1-zero path
⌨️ (0:05:32) R1 Overview - Reinforcement learning setup
⌨️ (0:08:36) R1 Overview - Group Relative Policy Optimization (GRPO)
⌨️ (0:13:04) R1 Overview - DeepSeek R1-zero result
⌨️ (0:16:53) R1 Overview - Cold start supervised fine-tuning
⌨️ (0:17:44) R1 Overview - Consistency reward for CoT
⌨️ (0:18:35) R1 Overview - Supervised Fine tuning data generation
⌨️ (0:21:06) R1 Overview - Reinforcement learning with neural reward model
⌨️ (0:22:53) R1 Overview - Distillation
⌨️ (0:26:16) GRPO - Overview
⌨️ (0:26:55) GRPO - PPO vs GRPO
⌨️ (0:30:25) GRPO - PPO formula overview
⌨️ (0:33:25) GRPO - GRPO formula overview
⌨️ (0:36:48) GRPO - GRPO pseudo code
⌨️ (0:38:56) GRPO - GRPO Trainer code
⌨️ (0:49:24) KL Divergence - Overview
⌨️ (0:49:55) KL Divergence - KL Divergence in GRPO vs PPO
⌨️ (0:51:20) KL Divergence - KL Divergence refresher
⌨️ (0:55:32) KL Divergence - Monte Carlo estimation of KL divergence
⌨️ (0:56:43) KL Divergence - Schulman blog
⌨️ (0:57:38) KL Divergence - k1 = log(q/p)
⌨️ (1:00:01) KL Divergence - k2 = 0.5*log(p/q)^2
⌨️ (1:02:19) KL Divergence - k3 = (p/q - 1) - log(p/q)
⌨️ (1:04:44) KL Divergence - benchmarking
⌨️ (1:07:28) Conclusion
🎉 Thanks to our Champion and Sponsor supporters:
👾 Drake Milly
👾 Ulises Moralez
👾 Goddard Tan
👾 David MG
👾 Matthew Springman
👾 Claudio
👾 Oscar R.
👾 jedi-or-sith
👾 Nattira Maneerat
👾 Justin Hual
--
Learn to code for free and get a developer job: https://www.freecodecamp.org
Read hundreds of articles on programming: https://freecodecamp.org/news
Description and video by freeCodeCamp.org. This page is an independent companion view; the video is embedded from YouTube.