CS61Cuda Project: Matmul & CUDA Fundamentals

Welcome to CS61Cuda!! This is a mini‑project that introduces you to GPU programming with CUDA by building up to a fast matrix multiply. You’ll start with a CPU reference, write your first CUDA kernels, learn to reason about grids/blocks/threads, and finally add simple vectorization (SIMD) on the GPU. An optional performance sandbox lets you explore optimizations for bragging rights.

1 Learning Goals

  • Map data‑parallel work to CUDA’s grid/block/thread hierarchy.
  • Practice indexing and bounds checks in 1D/2D.
  • Understand memory access patterns (coalescing) and why they matter.
  • See the benefits of TLP (thread‑level parallelism) and DLP (data‑level/SIMD) on the GPU.
  • Read simple performance counters and reason about memory‑ vs compute‑bound kernels.

2 Repo Layout & What You Edit

cs61cuda/
├─ CMakeLists.txt # or Makefile (both provided)
├─ include/
│ └─ matmul.h # shared function prototypes
├─ src/
│ ├─ main.cpp # driver: parses flags, allocs, calls your code
│ ├─ task_1_cuda_copy.cu # ✏️ Task 1: 2D copy kernel
│ ├─ task_2_cpu_baseline.cpp # ✏️ Task 2: implement CPU matmul
│ ├─ task_3_cuda_naive.cu # ✏️ Task 3: naive CUDA matmul (1 output/thread)
│ ├─ task_4_cuda_simd.cu # ✏️ Task 4: vectorized CUDA matmul (no shared mem)
│ ├─ utils.cu # timers, error checks, random init, compare
│ └─ check.cuh # CUDA error macros (provided)
├─ tests/
│ ├─ correctness_tests.py # local correctness checks
│ └─ perf_runner.py # runs sizes & prints throughput
├─ data/
│ └─ generate.py # makes small sample matrices (optional)
├─ scripts/
│ ├─ build.sh # nvcc or cmake build helper
│ └─ run_all.sh # runs all tasks & tests
└─ README.md # quick overview

You will edit: src/task_1_cpu_baseline.cpp, src/task_2_cuda_copy.cu, src/task_3_cuda_naive.cu, src/ctask_4_cuda_simd.cu.

3 Grading Summary

Task Description Points
1 Welcome to 61Cuda: Copy kernel (2D grid, indexing, bounds, coalescing) 10
2 Baseline CPU matmul (triple-loop reference) 15
3 CUDA naive matmul (1 output/thread) 30
4 CUDA SIMD matmul (vectorized loads, no shared memory) 30
5 Performance engineering (optional) 15 EC

Total: 100 required + 15 extra credit.

Autograder policy. We check correctness on hidden sizes, run times on a standard GPU, and basic style (clear bounds checks, no UB). We do not require a specific speedup, but we verify your kernels scale sensibly with size.

Collaboration. Discuss ideas high‑level with peers; code must be your own. Cite any online sources you consulted.

4 CUDA Primer (read first)

4.1 Core Concepts

  • A kernel is a C/C++ function annotated __global__ and launched with <<<grid, block>>>.
  • Each launch creates a 2‑D/3‑D grid of blocks; each block contains many threads. Every thread runs the same kernel on different data.

Example: myKernel<<<dim3(2,2), dim3(16,16)>>>(args) creates a 2×2 grid of blocks, each with 16×16 threads (total: 4 blocks × 256 threads = 1024 threads).

  • Built‑in variables inside kernels: blockIdx.{x,y,z}, threadIdx.{x,y,z}, blockDim.{x,y,z}, gridDim.{x,y,z}.

2D Indexing Example:

int i = blockIdx.y * blockDim.y + threadIdx.y;  // row
int j = blockIdx.x * blockDim.x + threadIdx.x;  // column

Grid size calculation: When dimensions don’t divide evenly, round up:

dim3 grid((width + block.x - 1) / block.x, (height + block.y - 1) / block.y);

4.2 Memory Hierarchy

  • Global (device DRAM): large, high‑latency (~400-800 cycles); visible to all threads. Allocated with cudaMalloc(), accessed via pointers. Persists across kernel launches.
  • Shared (on‑chip, per‑block): small (~48KB per SM), low‑latency (~20-30 cycles); visible to threads in the same block. Declared with __shared__. Cleared between kernel launches.
  • Registers (per‑thread): fastest (~1 cycle), limited (~255 per thread). Automatic for local variables. Private to each thread.

Memory transfer patterns: * Host → Device: cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice) * Device → Host: cudaMemcpy(dst, src, size, cudaMemcpyDeviceToHost) * Device → Device: cudaMemcpy(dst, src, size, cudaMemcpyDeviceToDevice)

4.3 Performance Concepts

  • Memory coalescing: threads in a warp (32 threads) should access consecutive addresses to combine loads/stores into few transactions.

    • Good: thread 0 → addr[0], thread 1 → addr[1], ..., thread 31 → addr[31] (coalesced into 1-2 transactions)
    • Bad: thread 0 → addr[0], thread 1 → addr[100], thread 2 → addr[200]... (32 separate transactions)
  • Warp: A group of 32 threads that execute in lockstep (SIMT - Single Instruction, Multiple Threads). Threads in a warp share the same instruction stream. Warps are the fundamental unit of execution on the GPU.

  • Synchronization: __syncthreads() is a barrier for all threads in a block (not across blocks). Use before reading shared memory written by other threads. Important: All threads in a block must reach the barrier, or you’ll deadlock.

  • Occupancy: The ratio of active warps to maximum warps per SM. Higher occupancy doesn’t always mean better performance (register pressure, shared memory limits).

4.4 Common Patterns & Best Practices

You’ll always (1) map data to threads, (2) ensure bounds checks, (3) choose grid/block sizes, and (4) verify results.

Bounds checking pattern:

int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;  // Guard against out-of-bounds

Block size guidelines: * Use multiples of 32 (warp size): 32, 64, 128, 256, 512, 1024 * For 2D: Common sizes are 8×8, 16×16, 32×32 * Balance: Larger blocks = more shared memory per block, but fewer blocks = less parallelism

4.5 Common Pitfalls

  • Forgetting bounds checks: Always check if (i >= rows || j >= cols) return; when grid size rounds up.
  • Wrong indexing order: Use threadIdx.x for columns (fastest dimension) to enable coalescing in row-major layouts.
  • Missing synchronization: If threads write then read shared memory, use __syncthreads() between.
  • Not checking errors: Always call checkCuda(cudaGetLastError()) after kernel launches.
  • Race conditions: Multiple threads writing to the same global memory location without atomics.
  • Register spilling: Too many local variables can cause register spilling to local memory (slow).

4.6 Debugging Tips

  • Start small: Test with 8×8 matrices before scaling up.
  • Use printf in kernels (sparingly) for small sizes: if (threadIdx.x == 0 && blockIdx.x == 0) printf("value: %f\n", x);
  • Run cuda-memcheck: cuda-memcheck ./your_program to catch memory errors, race conditions, and invalid accesses.
  • Check occupancy: Use nvidia-smi to monitor GPU usage, or profiling tools if available.
  • Synchronize before copying: Use cudaDeviceSynchronize() before copying results back to host.
  • Verify with small inputs: Always test correctness on small, known inputs first.

4.7 CUDA Documentation & Finding Functions

You will need to look up CUDA functions in the official documentation. For each function you use:

  • If the function has its own page (e.g., cudaMalloc, cudaMemcpy), link directly to that page.
  • If the function is part of a larger API section (e.g., vector types like float4), link to the relevant section of the programming guide.

Key Documentation Resources:

Example: If you use cudaMalloc, link to: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g37d37965bfb4803b6d4e59ff26856356

If you use float4, link to: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#vector-types

4.8 Quick Reference

Concept Syntax Notes
Kernel launch kernel<<<grid, block>>>(args) Grid/block are dim3
Global memory alloc cudaMalloc(&ptr, size) Returns device pointer
Copy to device cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice)
Copy from device cudaMemcpy(dst, src, size, cudaMemcpyDeviceToHost)
Shared memory __shared__ float sdata[256] Per-block, ~48KB limit
Synchronization __syncthreads() Block-level barrier
Error check checkCuda(cudaGetLastError()) After kernel launch
Device sync cudaDeviceSynchronize() Wait for kernel completion
Vector load float4 vec = *reinterpret_cast<float4*>(&ptr[i]) Aligned access
Vector store *reinterpret_cast<float4*>(&ptr[i]) = vec Aligned access

5 Task 1 — Welcome to 61Cuda: 2D Copy Kernel (10 pts)

5.1 Conceptual Overview

Warm up with indexing and coalesced global memory access. You’ll copy a dense matrix from device input to device output using a 2D grid of 2D blocks, one element per thread.

5.2 Data Flow

  • Input: in ∈ R^{rows×cols} (row‑major in[r*cols + c]), allocated in device memory.
  • Processing: For each (r,c) covered by a thread, read in[r,c] and write it to out[r,c]. Use 2‑D indexing and guard against out‑of‑bounds.
  • Output: out ∈ R^{rows×cols} identical to in (device memory).

5.3 Your Task

Implement copy2D_kernel and the host wrapper copy2D(...)

#include "check.cuh"

__global__ void copy2D_kernel(const float* __restrict__ in,
float* __restrict__ out,
int rows, int cols) {
// TODO: compute r, c using blockIdx/threadIdx and blockDim
// TODO: bounds check: if (r >= rows || c >= cols) return;
// TODO: copy one element
}

void copy2D(const float* d_in, float* d_out, int rows, int cols, dim3 block){
// Suggest block = (16,16,1)
dim3 grid((cols + block.x - 1)/block.x,
(rows + block.y - 1)/block.y);
copy2D_kernel<<<grid, block>>>(d_in, d_out, rows, cols);
checkCuda(cudaGetLastError());
checkCuda(cudaDeviceSynchronize());
}

Testing & Tips

  • Prefer threadIdx.x to index columns to encourage coalesced row‑major accesses.
  • Run: ./build/cs61cuda --task=copy --M=64 --N=128 --verify.

6 Task 2 — CPU Baseline Matmul (15 pts)

6.1 Conceptual Overview

Implement a correct triple‑loop matrix multiply in row‑major order. This is the correctness oracle for later tasks.

6.2 Data Flow

  • Input: A ∈ R^{M×K}, B ∈ R^{K×N} (host memory, row‑major).
  • Processing: For every (i,j), compute C[i,j] = Σ_{k=0..K-1} A[i,k] * B[k,j].
  • Output: C ∈ R^{M×N} (host memory).

6.3 Your Task

Fill mm_cpu(...) in src/cpu_baseline.cpp.

Starter (skeleton) — src/cpu_baseline.cpp

#include <cstddef>
void mm_cpu(const float* A, const float* B, float* C,
int M, int N, int K){
// TODO: triple nested loops over i (rows of A), j (cols of B), k (shared dim)
// Use row-major: A[i*K + k], B[k*N + j], C[i*N + j]
}

Testing & Tips

  • Use small sizes first (e.g., 8×8×8). GPU results in later tasks are compared against this function with tolerance 1e‑4.

7 Task 3 — CUDA Naive Matmul (30 pts)

7.1 Conceptual Overview

Parallelize Task 2 on the GPU: map each output element C[i,j] to a single thread. This exposes thread‑level parallelism (TLP); performance is often limited by memory bandwidth.

7.2 Data Flow

  • Input: A ∈ R^{M×K}, B ∈ R^{K×N} (device memory), populated from host.
  • Processing: Each thread computes one output C[i,j] by streaming the k dimension.
  • Output: C ∈ R^{M×N} (device memory), copied back to host by the driver when verifying.

7.3 Your Task

Implement mm_naive_kernel and the host launcher mm_naive(...).

Starter (skeleton) — src/cuda_naive.cu

#include "check.cuh"

__global__ void mm_naive_kernel(const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ C,
int M, int N, int K){
// TODO: compute i (row) and j (col) from 2D grid/block
// TODO: bounds guard
// TODO: accumulate over k
}

void mm_naive(const float* dA, const float* dB, float* dC,
int M, int N, int K, dim3 block){
dim3 grid((N + block.x - 1)/block.x,
(M + block.y - 1)/block.y);
mm_naive_kernel<<<grid, block>>>(dA, dB, dC, M, N, K);
checkCuda(cudaGetLastError());
checkCuda(cudaDeviceSynchronize());
}

Notes on memory access

  • Threads in a warp vary j at fixed k, so B[k*N + j] is coalesced. Accesses to A[i*K + k] replicate a single element per thread (cache helps).

Run ./build/cs61cuda --task=naive --M=512 --N=512 --K=512 --verify

8 Task 4 — CUDA SIMD Matmul (Vectorized, No Shared Memory) (30 pts)

8.1 Conceptual Overview

Augment TLP with simple data‑level parallelism (DLP): each thread computes a short contiguous vector of outputs in a row using vector loads/stores (e.g., float4). No shared memory yet.

8.2 Data Flow

  • Input: A ∈ R^{M×K}, B ∈ R^{K×N} (device).
  • Processing: A thread at (i, j0_group) produces V outputs C[i, j0..j0+V-1]. Inner loop streams over k, reading one scalar A[i,k] and a vector of V neighbors from row B[k, :].
  • Output: C ∈ R^{M×N} (device).

8.3 Your Task

Implement mm_simd_kernel<V>() and its launcher. Default V=4; handle tails where N % V ≠ 0.

Starter (skeleton) — src/cuda_simd.cu

#include "check.cuh"

template<int V>
__global__ void mm_simd_kernel(const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ C,
int M, int N, int K){
// TODO: compute i (row) and j0 (first column of this thread's vector)
// TODO: maintain acc[V] and handle aligned vs tail paths
}

void mm_simd(const float* dA, const float* dB, float* dC,
int M, int N, int K, int vec, dim3 block){
// TODO: call specialized kernel for vec==4, else fall back
}

Run ./build/cs61cuda --task=simd --M=1024 --N=1024 --K=1024 --vec=4 --verify

Discussion prompts

  • Why does vectorizing along columns improve coalescing for loads from B and stores to C?
  • What changes would shared‑memory tiling introduce (Task 5 EC)?

9 Task 5 — (Optional) Performance Engineering (15 EC)

Make it faster. Ideas:

  • Shared‑memory tiling (classic 16×16 or 32×32 tiles).
  • Register tiling: each thread computes a small r×c tile.
  • Loop unrolling & #pragma unroll for k.
  • Occupancy tuning: vary blockDim to trade registers vs parallelism.
  • Mixed precision: keep FP32 accumulate but try __half inputs (only if you also keep a FP32 correctness path for grading).
  • Software prefetch: read the next k slice early.

9.1 Performance Targets & Grading

We’ll measure performance in GFLOP/s (billions of floating-point operations per second) on a standard GPU. The formula is: GFLOP/s = (2 × M × N × K) / (time_in_seconds × 1e9).

Grading Rubric:

  • Full credit (15 pts): ≥ 2.0× speedup over Task 4 baseline
    • Example: If Task 4 runs at 500 GFLOP/s, you need ≥ 1000 GFLOP/s
    • Must maintain correctness (within 1e-4 tolerance)
    • Must include analysis in README-perf.md
  • Partial credit (10 pts): 1.5× - 2.0× speedup over Task 4 baseline
    • Example: If Task 4 runs at 500 GFLOP/s, you need 750-1000 GFLOP/s
    • Must maintain correctness
    • Must include analysis in README-perf.md
  • Minimal credit (5 pts): 1.2× - 1.5× speedup over Task 4 baseline
    • Example: If Task 4 runs at 500 GFLOP/s, you need 600-750 GFLOP/s
    • Must maintain correctness
    • Must include analysis in README-perf.md
  • No credit (0 pts): < 1.2× speedup, or correctness failures, or missing README-perf.md

Note: Baseline Task 4 performance will vary by GPU. We’ll normalize based on the reference implementation’s performance on the grading hardware. The speedup ratios above are relative to your Task 4 implementation.

We’ll publish a lightweight leaderboard (GFLOP/s). Please write a short README-perf.md describing: * What optimizations you tried * Why each optimization helped (or didn’t) * Performance measurements (GFLOP/s) for Task 4 baseline and your optimized version * Any insights about memory access patterns, occupancy, or other performance factors

Command‑Line Interface (Driver)

./cs61cuda --task={copy|cpu|naive|simd}
--M=1024 --N=1024 --K=1024
--block=16 --vec=4 --repeat=10 --verify
  • --task=copy ignores K.
  • --vec controls SIMD width in Task 4; grader uses 4.
  • --verify runs mm_cpu then compares results (L2 relative error ≤ 1e-4).

10 Correctness & Floating‑Point Notes

  • We compare with a small absolute+relative tolerance (1e-4).
  • Random inputs in [-1, 1] with fixed seed.
  • Your GPU kernels must not read/write out of bounds; failing cuda-memcheck is an automatic zero for that test.

11 Debugging Checklist

  • After each kernel launch:
cudaDeviceSynchronize();
checkCuda(cudaGetLastError());
  • Use printf inside kernels sparingly on tiny sizes.
  • Try cuda-memcheck for invalid addresses / race conditions.
  • Start small (e.g., M=N=K=8) then scale.

12 Style & Submission

  • Clear variable names (i,j,k, M,N,K).
  • Comments: what the mapping is, what each thread computes, any assumptions.
  • No undefined behavior (no out‑of‑bounds pointer math, no aliasing shenanigans).
  • Submit your edited .cpp/.cu files and README.md answering the reflection prompts below.

12.1 Reflection (graded in Task 5 rubric even if you skip perf EC)

  1. Where is your naive kernel memory‑bound? Which array dominates traffic and why?
  2. Why does vectorizing along columns improve coalescing? What’s the trade‑off?
  3. What would shared‑memory tiling change about the access pattern?

13 Reference: Shapes & Indexing

Row‑major:

A: MxK → A[i*K + k]
B: KxN → B[k*N + j]
C: MxN → C[i*N + j]

Grid/block formulas used throughout:

int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;

14 Rubric Details

Task 1 (10)

    1. correct 2D indexing + bounds
    1. coalesced mapping (x→cols)
    1. passes tests, tidy style

Task 2 (15)

    1. correct triple‑loop for arbitrary M,N,K
    1. clear comments & no UB

Task 3 (30)

    1. correct one‑element/thread mapping
    1. correct launch geometry & bounds
    1. reasonable performance (scaled timing)
    1. comments explaining memory pattern

Task 4 (30)

    1. correct SIMD (V outputs/thread), handles tails
    1. proper vector loads/stores when aligned
    1. coalesced writes & justified mapping
    1. performance better than naive on large sizes

Task 5 EC (15)

    1. measurable speedup over Task 4
    1. README‑perf.md analysis