27 January 2025
this is a very optimistic goal that i am setting for myself, currently i don’t understand and know shiz about CUDA, GPU programming, writing kernels and all. i want to learn. i don’t believe in the word getting “cracked”, but i definitely want to become better in certain skills which i feel will be important in the coming months or years. let’s see how far i can go
i have created a comprehensive plan with the help of chatgpt and deepseek to get me started and follow a timeline. i will try to adhere to that.
this plan is structured to ensure both deep theoretical understanding and practical hands-on experience from day 1. by the end of 30 days, the plan is to be proficient in writing custom GPU kernels, optimizing performance, and integrating GPU programming into ML workloads.
Week 1: Foundations of GPU Programming (CUDA & Triton)
Goal: Understand the GPU execution model, memory hierarchy, and write basic kernels.
Day 1: Introduction & Your First CUDA & Triton Kernel
- Theoretical:
- Why use GPUs? Parallelism vs. Serial Execution
- How NVIDIA GPUs execute thousands of threads in parallel.
- GPU architecture overview: Cores, Threads, Warps, Blocks, Grids
- Hands-on:
- Write a "Hello GPU" kernel (basic CUDA kernel execution).
- Write a simple CUDA/Triton kernel for vector addition (CPU vs GPU)
Day 2: Memory Management & Thread Hierarchies
- Theoretical:
- CUDA memory types: Global, Shared, Local, Register, Constant, Texture memory
- Threading concepts: Thread Indexing, Block Indexing
cudaMalloc
, cudaMemcpy
, and error handling.
- Hands-on:
- Modify Day 1’s kernel to use shared memory for performance gains.
- Implement matrix addition using both CUDA and Triton.
- Profile with
nvprof
to compare execution times.
Day 3: GPU Performance Optimization Basics
- Theoretical:
- Memory coalescing, bank conflicts, global memory vs. shared memory access
- Using profilers: Nsight, CUDA Profiler, PyTorch Profiler
- Hands-on:
- Implement a matrix multiplication kernel using CUDA.
- Measure execution time and optimize memory access patterns.
- Run PyTorch’s profiler to inspect CUDA kernel calls.
Day 5: Writing Custom PyTorch GPU Kernels (CUDA + Triton)