Learning GPU Programming (30 days plan)

27 January 2025

this is a very optimistic goal that i am setting for myself, currently i don’t understand and know shiz about CUDA, GPU programming, writing kernels and all. i want to learn. i don’t believe in the word getting “cracked”, but i definitely want to become better in certain skills which i feel will be important in the coming months or years. let’s see how far i can go

i have created a comprehensive plan with the help of chatgpt and deepseek to get me started and follow a timeline. i will try to adhere to that.

this plan is structured to ensure both deep theoretical understanding and practical hands-on experience from day 1. by the end of 30 days, the plan is to be proficient in writing custom GPU kernels, optimizing performance, and integrating GPU programming into ML workloads.

Week 1: Foundations of GPU Programming (CUDA & Triton)

Goal: Understand the GPU execution model, memory hierarchy, and write basic kernels.

Day 1: Introduction & Your First CUDA & Triton Kernel

Theoretical:
- Why use GPUs? Parallelism vs. Serial Execution
- How NVIDIA GPUs execute thousands of threads in parallel.
- GPU architecture overview: Cores, Threads, Warps, Blocks, Grids
Hands-on:
- Write a "Hello GPU" kernel (basic CUDA kernel execution).
- Write a simple CUDA/Triton kernel for vector addition (CPU vs GPU)

Day 2: Memory Management & Thread Hierarchies

Theoretical:
- CUDA memory types: Global, Shared, Local, Register, Constant, Texture memory
- Threading concepts: Thread Indexing, Block Indexing
- cudaMalloc, cudaMemcpy, and error handling.
Hands-on:
- Modify Day 1’s kernel to use shared memory for performance gains.
- Implement matrix addition using both CUDA and Triton.
- Profile with nvprof to compare execution times.

Day 3: GPU Performance Optimization Basics

Theoretical:
- Memory coalescing, bank conflicts, global memory vs. shared memory access
- Using profilers: Nsight, CUDA Profiler, PyTorch Profiler
Hands-on:
- Implement a matrix multiplication kernel using CUDA.
- Measure execution time and optimize memory access patterns.
- Run PyTorch’s profiler to inspect CUDA kernel calls.

Week 1: Foundations of GPU Programming (CUDA & Triton)

Day 1: Introduction & Your First CUDA & Triton Kernel

Day 2: Memory Management & Thread Hierarchies

Day 3: GPU Performance Optimization Basics

Day 5: Writing Custom PyTorch GPU Kernels (CUDA + Triton)