CS61Cuda Project: Matmul & CUDA Fundamentals
Welcome to CS61Cuda!! This is a mini‑project that introduces you to GPU programming with CUDA by building up to a fast matrix multiply. You’ll start with a CPU reference, write your first CUDA kernels, learn to reason about grids/blocks/threads, and finally add simple vectorization (SIMD) on the GPU. An optional performance sandbox lets you explore optimizations for bragging rights.
1 Learning Goals
- Map data‑parallel work to CUDA’s grid/block/thread hierarchy.
- Practice indexing and bounds checks in 1D/2D.
- Understand memory access patterns (coalescing) and why they matter.
- See the benefits of TLP (thread‑level parallelism) and DLP (data‑level/SIMD) on the GPU.
- Read simple performance counters and reason about memory‑ vs compute‑bound kernels.
2 Repo Layout & What You Edit
cs61cuda/
├─ CMakeLists.txt # or Makefile (both provided)
├─ include/
│ └─ matmul.h # shared function prototypes
├─ src/
│ ├─ main.cpp # driver: parses flags, allocs, calls your code
│ ├─ task_1_cuda_copy.cu # ✏️ Task 1: 2D copy kernel
│ ├─ task_2_cpu_baseline.cpp # ✏️ Task 2: implement CPU matmul
│ ├─ task_3_cuda_naive.cu # ✏️ Task 3: naive CUDA matmul (1 output/thread)
│ ├─ task_4_cuda_simd.cu # ✏️ Task 4: vectorized CUDA matmul (no shared mem)
│ ├─ utils.cu # timers, error checks, random init, compare
│ └─ check.cuh # CUDA error macros (provided)
├─ tests/
│ ├─ correctness_tests.py # local correctness checks
│ └─ perf_runner.py # runs sizes & prints throughput
├─ data/
│ └─ generate.py # makes small sample matrices (optional)
├─ scripts/
│ ├─ build.sh # nvcc or cmake build helper
│ └─ run_all.sh # runs all tasks & tests
└─ README.md # quick overviewYou will edit: src/task_1_cpu_baseline.cpp, src/task_2_cuda_copy.cu, src/task_3_cuda_naive.cu, src/ctask_4_cuda_simd.cu.
3 Grading Summary
| Task | Description | Points |
|---|---|---|
| 1 | Welcome to 61Cuda: Copy kernel (2D grid, indexing, bounds, coalescing) | 10 |
| 2 | Baseline CPU matmul (triple-loop reference) | 15 |
| 3 | CUDA naive matmul (1 output/thread) | 30 |
| 4 | CUDA SIMD matmul (vectorized loads, no shared memory) | 30 |
| 5 | Performance engineering (optional) | 15 EC |
Total: 100 required + 15 extra credit.
Autograder policy. We check correctness on hidden sizes, run times on a standard GPU, and basic style (clear bounds checks, no UB). We do not require a specific speedup, but we verify your kernels scale sensibly with size.
Collaboration. Discuss ideas high‑level with peers; code must be your own. Cite any online sources you consulted.
4 CUDA Primer (read first)
4.1 Core Concepts
- A kernel is a C/C++ function annotated
__global__and launched with<<<grid, block>>>.
- Each launch creates a 2‑D/3‑D grid of blocks; each block contains many threads. Every thread runs the same kernel on different data.
Example: myKernel<<<dim3(2,2), dim3(16,16)>>>(args) creates a 2×2 grid of blocks, each with 16×16 threads (total: 4 blocks × 256 threads = 1024 threads).
- Built‑in variables inside kernels:
blockIdx.{x,y,z},threadIdx.{x,y,z},blockDim.{x,y,z},gridDim.{x,y,z}.
2D Indexing Example:
int i = blockIdx.y * blockDim.y + threadIdx.y; // row
int j = blockIdx.x * blockDim.x + threadIdx.x; // columnGrid size calculation: When dimensions don’t divide evenly, round up:
dim3 grid((width + block.x - 1) / block.x, (height + block.y - 1) / block.y);4.2 Memory Hierarchy
- Global (device DRAM): large, high‑latency (~400-800 cycles); visible to all threads. Allocated with
cudaMalloc(), accessed via pointers. Persists across kernel launches. - Shared (on‑chip, per‑block): small (~48KB per SM), low‑latency (~20-30 cycles); visible to threads in the same block. Declared with
__shared__. Cleared between kernel launches. - Registers (per‑thread): fastest (~1 cycle), limited (~255 per thread). Automatic for local variables. Private to each thread.
Memory transfer patterns: * Host → Device: cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice) * Device → Host: cudaMemcpy(dst, src, size, cudaMemcpyDeviceToHost) * Device → Device: cudaMemcpy(dst, src, size, cudaMemcpyDeviceToDevice)
4.3 Performance Concepts
Memory coalescing: threads in a warp (32 threads) should access consecutive addresses to combine loads/stores into few transactions.
- ✅ Good:
thread 0 → addr[0], thread 1 → addr[1], ..., thread 31 → addr[31](coalesced into 1-2 transactions)
- ❌ Bad:
thread 0 → addr[0], thread 1 → addr[100], thread 2 → addr[200]...(32 separate transactions)
- ✅ Good:
Warp: A group of 32 threads that execute in lockstep (SIMT - Single Instruction, Multiple Threads). Threads in a warp share the same instruction stream. Warps are the fundamental unit of execution on the GPU.
Synchronization:
__syncthreads()is a barrier for all threads in a block (not across blocks). Use before reading shared memory written by other threads. Important: All threads in a block must reach the barrier, or you’ll deadlock.Occupancy: The ratio of active warps to maximum warps per SM. Higher occupancy doesn’t always mean better performance (register pressure, shared memory limits).
4.4 Common Patterns & Best Practices
You’ll always (1) map data to threads, (2) ensure bounds checks, (3) choose grid/block sizes, and (4) verify results.
Bounds checking pattern:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return; // Guard against out-of-boundsBlock size guidelines: * Use multiples of 32 (warp size): 32, 64, 128, 256, 512, 1024 * For 2D: Common sizes are 8×8, 16×16, 32×32 * Balance: Larger blocks = more shared memory per block, but fewer blocks = less parallelism
4.5 Common Pitfalls
- Forgetting bounds checks: Always check
if (i >= rows || j >= cols) return;when grid size rounds up. - Wrong indexing order: Use
threadIdx.xfor columns (fastest dimension) to enable coalescing in row-major layouts. - Missing synchronization: If threads write then read shared memory, use
__syncthreads()between. - Not checking errors: Always call
checkCuda(cudaGetLastError())after kernel launches. - Race conditions: Multiple threads writing to the same global memory location without atomics.
- Register spilling: Too many local variables can cause register spilling to local memory (slow).
4.6 Debugging Tips
- Start small: Test with 8×8 matrices before scaling up.
- Use printf in kernels (sparingly) for small sizes:
if (threadIdx.x == 0 && blockIdx.x == 0) printf("value: %f\n", x); - Run cuda-memcheck:
cuda-memcheck ./your_programto catch memory errors, race conditions, and invalid accesses. - Check occupancy: Use
nvidia-smito monitor GPU usage, or profiling tools if available. - Synchronize before copying: Use
cudaDeviceSynchronize()before copying results back to host. - Verify with small inputs: Always test correctness on small, known inputs first.
4.7 CUDA Documentation & Finding Functions
You will need to look up CUDA functions in the official documentation. For each function you use:
- If the function has its own page (e.g.,
cudaMalloc,cudaMemcpy), link directly to that page. - If the function is part of a larger API section (e.g., vector types like
float4), link to the relevant section of the programming guide.
Key Documentation Resources:
- CUDA C++ Programming Guide - Complete reference for CUDA programming
- Memory Hierarchy - Detailed memory model
- Built-in Variables -
blockIdx,threadIdx, etc. - Vector Types -
float4,int4, etc. - Synchronization Functions -
__syncthreads(), etc.
- CUDA Runtime API - Host-side functions
- Memory Management -
cudaMalloc,cudaFree,cudaMemcpy - Device Management -
cudaDeviceSynchronize,cudaGetLastError
- Memory Management -
- CUDA Best Practices Guide - Performance optimization tips
Example: If you use cudaMalloc, link to: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g37d37965bfb4803b6d4e59ff26856356
If you use float4, link to: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#vector-types
4.8 Quick Reference
| Concept | Syntax | Notes |
|---|---|---|
| Kernel launch | kernel<<<grid, block>>>(args) |
Grid/block are dim3 |
| Global memory alloc | cudaMalloc(&ptr, size) |
Returns device pointer |
| Copy to device | cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice) |
|
| Copy from device | cudaMemcpy(dst, src, size, cudaMemcpyDeviceToHost) |
|
| Shared memory | __shared__ float sdata[256] |
Per-block, ~48KB limit |
| Synchronization | __syncthreads() |
Block-level barrier |
| Error check | checkCuda(cudaGetLastError()) |
After kernel launch |
| Device sync | cudaDeviceSynchronize() |
Wait for kernel completion |
| Vector load | float4 vec = *reinterpret_cast<float4*>(&ptr[i]) |
Aligned access |
| Vector store | *reinterpret_cast<float4*>(&ptr[i]) = vec |
Aligned access |
5 Task 1 — Welcome to 61Cuda: 2D Copy Kernel (10 pts)
5.1 Conceptual Overview
Warm up with indexing and coalesced global memory access. You’ll copy a dense matrix from device input to device output using a 2D grid of 2D blocks, one element per thread.
5.2 Data Flow
- Input:
in ∈ R^{rows×cols}(row‑majorin[r*cols + c]), allocated in device memory.
- Processing: For each
(r,c)covered by a thread, readin[r,c]and write it toout[r,c]. Use 2‑D indexing and guard against out‑of‑bounds.
- Output:
out ∈ R^{rows×cols}identical toin(device memory).
5.3 Your Task
Implement copy2D_kernel and the host wrapper copy2D(...)
#include "check.cuh"
__global__ void copy2D_kernel(const float* __restrict__ in,
float* __restrict__ out,
int rows, int cols) {
// TODO: compute r, c using blockIdx/threadIdx and blockDim
// TODO: bounds check: if (r >= rows || c >= cols) return;
// TODO: copy one element
}
void copy2D(const float* d_in, float* d_out, int rows, int cols, dim3 block){
// Suggest block = (16,16,1)
dim3 grid((cols + block.x - 1)/block.x,
(rows + block.y - 1)/block.y);
copy2D_kernel<<<grid, block>>>(d_in, d_out, rows, cols);
checkCuda(cudaGetLastError());
checkCuda(cudaDeviceSynchronize());
}Testing & Tips
- Prefer
threadIdx.xto index columns to encourage coalesced row‑major accesses.
- Run:
./build/cs61cuda --task=copy --M=64 --N=128 --verify.
6 Task 2 — CPU Baseline Matmul (15 pts)
6.1 Conceptual Overview
Implement a correct triple‑loop matrix multiply in row‑major order. This is the correctness oracle for later tasks.
6.2 Data Flow
- Input:
A ∈ R^{M×K},B ∈ R^{K×N}(host memory, row‑major).
- Processing: For every
(i,j), computeC[i,j] = Σ_{k=0..K-1} A[i,k] * B[k,j].
- Output:
C ∈ R^{M×N}(host memory).
6.3 Your Task
Fill mm_cpu(...) in src/cpu_baseline.cpp.
Starter (skeleton) — src/cpu_baseline.cpp
#include <cstddef>
void mm_cpu(const float* A, const float* B, float* C,
int M, int N, int K){
// TODO: triple nested loops over i (rows of A), j (cols of B), k (shared dim)
// Use row-major: A[i*K + k], B[k*N + j], C[i*N + j]
}Testing & Tips
- Use small sizes first (e.g., 8×8×8). GPU results in later tasks are compared against this function with tolerance 1e‑4.
7 Task 3 — CUDA Naive Matmul (30 pts)
7.1 Conceptual Overview
Parallelize Task 2 on the GPU: map each output element C[i,j] to a single thread. This exposes thread‑level parallelism (TLP); performance is often limited by memory bandwidth.
7.2 Data Flow
- Input:
A ∈ R^{M×K},B ∈ R^{K×N}(device memory), populated from host.
- Processing: Each thread computes one output
C[i,j]by streaming thekdimension.
- Output:
C ∈ R^{M×N}(device memory), copied back to host by the driver when verifying.
7.3 Your Task
Implement mm_naive_kernel and the host launcher mm_naive(...).
Starter (skeleton) — src/cuda_naive.cu
#include "check.cuh"
__global__ void mm_naive_kernel(const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ C,
int M, int N, int K){
// TODO: compute i (row) and j (col) from 2D grid/block
// TODO: bounds guard
// TODO: accumulate over k
}
void mm_naive(const float* dA, const float* dB, float* dC,
int M, int N, int K, dim3 block){
dim3 grid((N + block.x - 1)/block.x,
(M + block.y - 1)/block.y);
mm_naive_kernel<<<grid, block>>>(dA, dB, dC, M, N, K);
checkCuda(cudaGetLastError());
checkCuda(cudaDeviceSynchronize());
}Notes on memory access
- Threads in a warp vary
jat fixedk, soB[k*N + j]is coalesced. Accesses toA[i*K + k]replicate a single element per thread (cache helps).
Run ./build/cs61cuda --task=naive --M=512 --N=512 --K=512 --verify
9 Task 5 — (Optional) Performance Engineering (15 EC)
Make it faster. Ideas:
- Shared‑memory tiling (classic 16×16 or 32×32 tiles).
- Register tiling: each thread computes a small
r×ctile.
- Loop unrolling &
#pragma unrollfork.
- Occupancy tuning: vary
blockDimto trade registers vs parallelism.
- Mixed precision: keep FP32 accumulate but try
__halfinputs (only if you also keep a FP32 correctness path for grading).
- Software prefetch: read the next
kslice early.
9.1 Performance Targets & Grading
We’ll measure performance in GFLOP/s (billions of floating-point operations per second) on a standard GPU. The formula is: GFLOP/s = (2 × M × N × K) / (time_in_seconds × 1e9).
Grading Rubric:
- Full credit (15 pts): ≥ 2.0× speedup over Task 4 baseline
- Example: If Task 4 runs at 500 GFLOP/s, you need ≥ 1000 GFLOP/s
- Must maintain correctness (within 1e-4 tolerance)
- Must include analysis in README-perf.md
- Partial credit (10 pts): 1.5× - 2.0× speedup over Task 4 baseline
- Example: If Task 4 runs at 500 GFLOP/s, you need 750-1000 GFLOP/s
- Must maintain correctness
- Must include analysis in README-perf.md
- Minimal credit (5 pts): 1.2× - 1.5× speedup over Task 4 baseline
- Example: If Task 4 runs at 500 GFLOP/s, you need 600-750 GFLOP/s
- Must maintain correctness
- Must include analysis in README-perf.md
- No credit (0 pts): < 1.2× speedup, or correctness failures, or missing README-perf.md
Note: Baseline Task 4 performance will vary by GPU. We’ll normalize based on the reference implementation’s performance on the grading hardware. The speedup ratios above are relative to your Task 4 implementation.
We’ll publish a lightweight leaderboard (GFLOP/s). Please write a short README-perf.md describing: * What optimizations you tried * Why each optimization helped (or didn’t) * Performance measurements (GFLOP/s) for Task 4 baseline and your optimized version * Any insights about memory access patterns, occupancy, or other performance factors
Command‑Line Interface (Driver)
./cs61cuda --task={copy|cpu|naive|simd}
--M=1024 --N=1024 --K=1024
--block=16 --vec=4 --repeat=10 --verify--task=copyignoresK.
--veccontrols SIMD width in Task 4; grader uses 4.
--verifyrunsmm_cputhen compares results (L2 relative error ≤ 1e-4).
10 Correctness & Floating‑Point Notes
- We compare with a small absolute+relative tolerance (
1e-4).
- Random inputs in
[-1, 1]with fixed seed.
- Your GPU kernels must not read/write out of bounds; failing
cuda-memcheckis an automatic zero for that test.
11 Debugging Checklist
- After each kernel launch:
cudaDeviceSynchronize();
checkCuda(cudaGetLastError());- Use
printfinside kernels sparingly on tiny sizes.
- Try
cuda-memcheckfor invalid addresses / race conditions.
- Start small (e.g.,
M=N=K=8) then scale.
12 Style & Submission
- Clear variable names (
i,j,k,M,N,K).
- Comments: what the mapping is, what each thread computes, any assumptions.
- No undefined behavior (no out‑of‑bounds pointer math, no aliasing shenanigans).
- Submit your edited
.cpp/.cufiles and README.md answering the reflection prompts below.
12.1 Reflection (graded in Task 5 rubric even if you skip perf EC)
- Where is your naive kernel memory‑bound? Which array dominates traffic and why?
- Why does vectorizing along columns improve coalescing? What’s the trade‑off?
- What would shared‑memory tiling change about the access pattern?
13 Reference: Shapes & Indexing
Row‑major:
A: MxK → A[i*K + k]
B: KxN → B[k*N + j]
C: MxN → C[i*N + j]Grid/block formulas used throughout:
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;14 Rubric Details
Task 1 (10)
- correct 2D indexing + bounds
- correct 2D indexing + bounds
- coalesced mapping (x→cols)
- coalesced mapping (x→cols)
- passes tests, tidy style
Task 2 (15)
- correct triple‑loop for arbitrary M,N,K
- correct triple‑loop for arbitrary M,N,K
- clear comments & no UB
Task 3 (30)
- correct one‑element/thread mapping
- correct one‑element/thread mapping
- correct launch geometry & bounds
- correct launch geometry & bounds
- reasonable performance (scaled timing)
- reasonable performance (scaled timing)
- comments explaining memory pattern
Task 4 (30)
- correct SIMD (V outputs/thread), handles tails
- correct SIMD (V outputs/thread), handles tails
- proper vector loads/stores when aligned
- proper vector loads/stores when aligned
- coalesced writes & justified mapping
- coalesced writes & justified mapping
- performance better than naive on large sizes
Task 5 EC (15)
- measurable speedup over Task 4
- measurable speedup over Task 4
- README‑perf.md analysis