Chapters (99)
- 0:00Course Introduction
- 7:27Table of Contents & Course Overview
- 23:30LESSON 1 — H100 Hopper GPU Architecture
- 25:47H100 Specifications: HBM3, Bandwidth & Power
- 26:22Tensor Cores Overview
- 27:18Tensor Memory Accelerator (TMA)
- 34:44Transformer Engine
- 34:58L2 Cache Architecture
- 35:21GPCs, TPCs & SM Layout
- 37:00Thread Block Clusters
- 46:22Distributed Shared Memory
- 52:44SM Sub-Partitions (SMSPs)
- 54:01Warp Schedulers & Dispatch Units
- 1:02:37Shared Memory & Data Movement
- 1:12:20Occupancy
- 1:32:49LESSON 2 — Clusters, Data Types, Inline PTX & Pointers
- 1:32:57Thread Block Clusters Programming
- 1:42:11Configuring Cluster Dimensions
- 1:48:08Inline PTX Assembly
- 1:59:31State Spaces
- 2:06:01Data Types in PTX
- 2:07:16Generic Pointers
- 2:09:59Address Space Conversion
- 2:15:14LESSON 3 — Asynchronicity & Barriers
- 2:15:22Introduction to Async Operations
- 2:28:06Proxies
- 2:28:56Fences & Memory Ordering
- 2:36:17Fence Ordering & Visibility
- 2:38:58Fence Scopes
- 2:40:30Acquire & Release Fences
- 2:45:18Expected Count & Thread Arrival
- 2:46:01M-Barrier Arrive Operations
- 2:55:37M-Barrier PTX Instructions
- 3:07:21Barrier Wait Operations
- 3:10:03Phase & Parity
- 3:59:42Commit Operations
- 4:10:06LESSON 4 — CuTensorMap Descriptors
- 4:16:18Tensor Shape, Stride & Data Type
- 4:22:52Element Stride & Dimensions
- 4:24:13Box Dimensions (Tile Size)
- 4:30:30Bank Conflicts
- 4:31:05Swizzling
- 4:33:02Swizzle Formula Deep Dive
- 4:52:48Interleave Layouts
- 5:04:22Out-of-Bounds Fill (OOB)
- 5:06:01LESSON 5 — cp.async.bulk (Async Bulk Copies via TMA)
- 5:08:04Bulk Tensor Operations (1D–5D)
- 5:27:31Multicast Operations
- 5:47:44Prefetch
- 5:53:41LESSON 6 — WGMMA Part 1 (Warp Group Matrix Multiply Accumulate)
- 5:59:18Warp Groups & Matrix Multiplication
- 6:03:21WGMMA Descriptors
- 6:07:26Accumulators & Register Reuse
- 6:30:34Scale Factors (Scale D, Scale A, Scale B)
- 6:47:05Core Matrices & 16×16 Tiles
- 7:44:03LESSON 7 — WGMMA Part 2
- 7:46:02Commit Groups & Wait Groups
- 8:04:31WGMMA with FP8 Data Types
- 8:48:46LESSON 8 — Kernel Design
- 8:50:58Compute-Bound vs. Memory-Bound Kernels
- 8:54:10Warp Specialization
- 9:08:56Cooperative vs. Ping-Pong Pipelines
- 9:09:47Pipelining Fundamentals
- 9:12:47Circular Buffering
- 9:36:38Ping-Pong Pipeline Deep Dive
- 9:37:34Epilogue Handling in Pipelines
- 9:43:52Persistent Scheduling
- 10:48:13Split-K & Stream-K Strategies
- 10:57:20Data-Parallel Tile Scheduling
- 11:35:23Epilogue Fusion (Bias, Activation, Scaling)
- 11:41:35Epilogue Operations Overview
- 12:05:32CUTLASS SOURCE CODE WALKTHROUGH
- 13:04:17Main Loop & Scheduling Policies
- 13:51:03Dispatch Policy
- 15:18:49SM90 Tile Scheduler
- 17:58:46SM90 Epilogue (TMA Warp Specialized)
- 19:22:42SM90 Builder
- 19:44:58Collective Builder
- 19:49:56FAST.CU KERNEL WALKTHROUGH
- 19:55:19Main Loop Implementation
- 20:06:51Producer Warp Group (Dependence Wall)
- 20:12:08Consumer Warp Group
- 21:29:30Prologue
- 21:47:09MULTI-GPU PROGRAMMING — Part 1
- 21:56:19NVSwitch
- 22:03:23Topology & System Architecture
- 22:17:03NVSwitch, BlueField DPUs & Storage Fabrics
- 22:37:19CUDA Peer-to-Peer Communication
- 22:37:57MPI (Message Passing Interface)
- 22:46:29P2P Limitations & Trade-offs
- 22:49:29MULTI-GPU PROGRAMMING — Part 2
- 22:52:20SLURM Resource Allocation
- 22:52:52PMIx Process Management
- 23:05:27NCCL (NVIDIA Collective Communications Library)
- 23:15:20NCCL Internals & Ring Algorithm
- 23:17:43AllReduce Operations
- 23:34:18NCCL Collectives: Broadcast, AllGather, ReduceScatter
- 23:36:16Parallelism Strategies: Data, Tensor, Pipeline & Expert Parallelism
- 24:37:56Course Conclusion & Next Steps
Show the creator's full description
Learn CUDA programming for NVIDIA Hopper GPUs. You will learn to build efficient WGMMA pipelines and leverage Cutlass optimizations to perform the massive matrix multiplications that power modern AI. Beyond single-chip performance, the curriculum covers multi-GPU scaling and NCCL primitives necessary for training trillion-parameter models. To get the most out of these lessons, you should have a foundational grasp of C++ syntax and linear algebra, particularly how matrices are tiled and multiplied.
- Course website: https://cudacourseh100.github.io
- Course repo: https://github.com/cudacourseh100/H100-Course
- X: https://x.com/_PrateekShukla_
- GitHub Sponsors: https://github.com/sponsors/prateekshukla1108
✏️ Developed byPrateek_Shukla
❤️ Support for this channel comes from our friends at Scrimba – the coding platform that's reinvented interactive learning: https://scrimba.com/freecodecamp
- 0:00:00 Course Introduction
- 0:07:27 Table of Contents & Course Overview
- 0:23:30 LESSON 1 — H100 Hopper GPU Architecture
- 0:25:47 H100 Specifications: HBM3, Bandwidth & Power
- 0:26:22 Tensor Cores Overview
- 0:27:18 Tensor Memory Accelerator (TMA)
- 0:34:44 Transformer Engine
- 0:34:58 L2 Cache Architecture
- 0:35:21 GPCs, TPCs & SM Layout
- 0:37:00 Thread Block Clusters
- 0:46:22 Distributed Shared Memory
- 0:52:44 SM Sub-Partitions (SMSPs)
- 0:54:01 Warp Schedulers & Dispatch Units
- 1:02:37 Shared Memory & Data Movement
- 1:12:20 Occupancy
- 1:32:49 LESSON 2 — Clusters, Data Types, Inline PTX & Pointers
- 1:32:57 Thread Block Clusters Programming
- 1:42:11 Configuring Cluster Dimensions
- 1:48:08 Inline PTX Assembly
- 1:59:31 State Spaces
- 2:06:01 Data Types in PTX
- 2:07:16 Generic Pointers
- 2:09:59 Address Space Conversion
- 2:15:14 LESSON 3 — Asynchronicity & Barriers
- 2:15:22 Introduction to Async Operations
- 2:28:06 Proxies
- 2:28:56 Fences & Memory Ordering
- 2:36:17 Fence Ordering & Visibility
- 2:38:58 Fence Scopes
- 2:40:30 Acquire & Release Fences
- 2:45:18 Expected Count & Thread Arrival
- 2:46:01 M-Barrier Arrive Operations
- 2:55:37 M-Barrier PTX Instructions
- 3:07:21 Barrier Wait Operations
- 3:10:03 Phase & Parity
- 3:59:42 Commit Operations
- 4:10:06 LESSON 4 — CuTensorMap Descriptors
- 4:16:18 Tensor Shape, Stride & Data Type
- 4:22:52 Element Stride & Dimensions
- 4:24:13 Box Dimensions (Tile Size)
- 4:30:30 Bank Conflicts
- 4:31:05 Swizzling
- 4:33:02 Swizzle Formula Deep Dive
- 4:52:48 Interleave Layouts
- 5:04:22 Out-of-Bounds Fill (OOB)
- 5:06:01 LESSON 5 — cp.async.bulk (Async Bulk Copies via TMA)
- 5:08:04 Bulk Tensor Operations (1D–5D)
- 5:27:31 Multicast Operations
- 5:47:44 Prefetch
- 5:53:41 LESSON 6 — WGMMA Part 1 (Warp Group Matrix Multiply Accumulate)
- 5:59:18 Warp Groups & Matrix Multiplication
- 6:03:21 WGMMA Descriptors
- 6:07:26 Accumulators & Register Reuse
- 6:30:34 Scale Factors (Scale D, Scale A, Scale B)
- 6:47:05 Core Matrices & 16×16 Tiles
- 7:44:03 LESSON 7 — WGMMA Part 2
- 7:46:02 Commit Groups & Wait Groups
- 8:04:31 WGMMA with FP8 Data Types
- 8:48:46 LESSON 8 — Kernel Design
- 8:50:58 Compute-Bound vs. Memory-Bound Kernels
- 8:54:10 Warp Specialization
- 9:08:56 Cooperative vs. Ping-Pong Pipelines
- 9:09:47 Pipelining Fundamentals
- 9:12:47 Circular Buffering
- 9:36:38 Ping-Pong Pipeline Deep Dive
- 9:37:34 Epilogue Handling in Pipelines
- 9:43:52 Persistent Scheduling
- 10:48:13 Split-K & Stream-K Strategies
- 10:57:20 Data-Parallel Tile Scheduling
- 11:35:23 Epilogue Fusion (Bias, Activation, Scaling)
- 11:41:35 Epilogue Operations Overview
- 12:05:32 CUTLASS SOURCE CODE WALKTHROUGH
- 13:04:17 Main Loop & Scheduling Policies
- 13:51:03 Dispatch Policy
- 15:18:49 SM90 Tile Scheduler
- 17:58:46 SM90 Epilogue (TMA Warp Specialized)
- 19:22:42 SM90 Builder
- 19:44:58 Collective Builder
- 19:49:56 FAST.CU KERNEL WALKTHROUGH
- 19:55:19 Main Loop Implementation
- 20:06:51 Producer Warp Group (Dependence Wall)
- 20:12:08 Consumer Warp Group
- 21:29:30 Prologue
- 21:47:09 MULTI-GPU PROGRAMMING — Part 1
- 21:56:19 NVSwitch
- 22:03:23 Topology & System Architecture
- 22:17:03 NVSwitch, BlueField DPUs & Storage Fabrics
- 22:37:19 CUDA Peer-to-Peer Communication
- 22:37:57 MPI (Message Passing Interface)
- 22:46:29 P2P Limitations & Trade-offs
- 22:49:29 MULTI-GPU PROGRAMMING — Part 2
- 22:52:20 SLURM Resource Allocation
- 22:52:52 PMIx Process Management
- 23:05:27 NCCL (NVIDIA Collective Communications Library)
- 23:15:20 NCCL Internals & Ring Algorithm
- 23:17:43 AllReduce Operations
- 23:34:18 NCCL Collectives: Broadcast, AllGather, ReduceScatter
- 23:36:16 Parallelism Strategies: Data, Tensor, Pipeline & Expert Parallelism
- 24:37:56 Course Conclusion & Next Steps
Description and video by freeCodeCamp.org. This page is an independent companion view; the video is embedded from YouTube.