GPU matrix multiplication: a literate CUDA program

1. Mathematical preliminaries
2. The inner loop of matrix multiplication
3. The full function to multiply \(A\) and \(B\) on the CPU
4. The full function to multiply \(A\) and \(B\) on the GPU
5. The call site for matrixMultiplyKernel
6. The main entry point
7. Results
8. Appendix
- 8.1. Imports

This is a literate program that implements matrix multiplication in C++ using NVIDIA CUDA to accelerate the computation. I tested this on an NVIDIA GeForce 1060, as well as a GeForce 1660. The speedup is about 1000x.

1. Mathematical preliminaries

An \(n\times n\) matrix \(A\) implements a linear function: \(A : \mathbb{R}^n \to \mathbb{R}^n\).

Given two \(n\times n\) matrices \(A\) and \(B\), we can compute \(C = AB\) using this component-wise formula:

\[ C_{ij} = \sum_{k=1}^n A_{ik}B_{kj} \]

The straightforward way to implement this in C++ is to represent the \(A\) an array of float values of size n*n, then the mapping from \((i,j)\) to the index is just \(in + j\).

2. The inner loop of matrix multiplication

To calculate \(C_{ij} = \sum_{k=1}^n A_{ik}B_{kj}\), we start with the inner loop:

float sum = 0.0f;
for (int k = 0; k < n; k++) {
  sum += A[i * n + k] * B[k * n + j];
}
C[i * n + j] = sum;

3. The full function to multiply \(A\) and \(B\) on the CPU

To complete the CPU function, we just need to sum over i and j.

// CPU matrix multiplication
void matrixMultiplyCPU(float* A, float* B, float* C, int n) {
  for (int i = 0; i < n; i++) {
    for (int j = 0; j < n; j++) {
      <<inner-loop>>
    }
  }
}

4. The full function to multiply \(A\) and \(B\) on the GPU

NVIDIA CUDA provides a framework to run a bunch of copies of a function on GPU cores.

The CUDA framework allows for 2D programming, notice the .x and .y values, those span a 2D \((x,y)\) plane.

// CUDA kernel for matrix multiplication
__global__ void matrixMultiplyKernel(float* A, float* B, float* C, int n) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int i = row; j = col;

    if (row < n && col < n) {
      <<inner-loop>>
    }
}

To understand how we allocate this \((x,y)\) plane on the CUDA device, let's go up to the call site where matrixMultiplyKernel is called:

matrixMultiplyKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

5. The call site for `matrixMultiplyKernel`

When you call a CUDA Kernel in C++, you have to specify the grid dimension and the block dimension. The grid dimension defines the number of "thread blocks" to run across the GPU device. The block dimension defines the number of threads that can run inside each thread block.

matrixMultiplyKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

6. The `main` entry point

In CUDA there's a dichotomy between host memory (allocated with new and malloc) and device memory, which is on the GPU (allocated with cudaMalloc).

The basic flow of a CUDA C/C++ program is:

allocate your host memory
allocate your device memory
copy the data to host memory
copy the data to device memory
run CUDA kernels on the device
copy the results back to the host

Now here's the main function for this matrix multiplication example, with some timing code

#include <iostream>
#include <chrono>
#include <cuda_runtime.h>

// Matrix dimensions
const int N = 1024;  // Matrix size N x N

int main() {
    float *h_A, *h_B, *h_C_cpu, *h_C_gpu;  // Host matrices
    float *d_A, *d_B, *d_C;                // Device matrices

    // Allocate host memory
    h_A = new float[N * N];
    h_B = new float[N * N];
    h_C_cpu = new float[N * N];
    h_C_gpu = new float[N * N];

    // Initialize matrices
    initializeMatrix(h_A, N);
    initializeMatrix(h_B, N);

    // CPU Matrix Multiplication
    auto cpu_start = std::chrono::high_resolution_clock::now();
    matrixMultiplyCPU(h_A, h_B, h_C_cpu, N);
    auto cpu_end = std::chrono::high_resolution_clock::now();
    auto cpu_duration = std::chrono::duration_cast<std::chrono::milliseconds>(cpu_end - cpu_start);

    // Allocate device memory
    cudaMalloc(&d_A, N * N * sizeof(float));
    cudaMalloc(&d_B, N * N * sizeof(float));
    cudaMalloc(&d_C, N * N * sizeof(float));

    // Copy data to device
    cudaMemcpy(d_A, h_A, N * N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, N * N * sizeof(float), cudaMemcpyHostToDevice);

    // Define grid and block dimensions
    dim3 blockDim(16, 16);
    dim3 gridDim((N + blockDim.x - 1) / blockDim.x,
                 (N + blockDim.y - 1) / blockDim.y);

    // GPU Matrix Multiplication
    auto gpu_start = std::chrono::high_resolution_clock::now();
    matrixMultiplyKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);
    cudaDeviceSynchronize();
    auto gpu_end = std::chrono::high_resolution_clock::now();
    auto gpu_duration = std::chrono::duration_cast<std::chrono::milliseconds>(gpu_end - gpu_start);

    // Copy result back to host
    cudaMemcpy(h_C_gpu, d_C, N * N * sizeof(float), cudaMemcpyDeviceToHost);

    // Verify results
    float maxError = 0.0f;
    for (int i = 0; i < N * N; i++) {
        maxError = std::max(maxError, std::abs(h_C_cpu[i] - h_C_gpu[i]));
    }

    // Print results
    std::cout << "Matrix size: " << N << " x " << N << std::endl;
    std::cout << "CPU Time: " << cpu_duration.count() << " ms" << std::endl;
    std::cout << "GPU Time: " << gpu_duration.count() << " ms" << std::endl;
    std::cout << "Maximum Error: " << maxError << std::endl;

    // Cleanup
    delete[] h_A;
    delete[] h_B;
    delete[] h_C_cpu;
    delete[] h_C_gpu;
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    return 0;
}

7. Results

On my NVIDIA GeForce GTX 1060, this took 7ms, vs 7,281ms on CPU (Intel Xeon CPU 3 GHz):

./matmul
Matrix size: 1024 x 1024
CPU Time: 7281 ms
GPU Time: 7 ms
Maximum Error: 9.15527e-05

Hire us if you are interested in accelerating your software using your older NVIDIA GPUs, we don't use Python or PyTorch or any fancy deep learning frameworks. We use first-principles thinking to make C/C++ programs that offload paralellizable work to the GPU device.

If you are interested in hiring a team of experienced CUDA C++ developers, reach out to tobi@defini.dev

8. Appendix

8.1. Imports

#include <iostream>
#include <chrono>
#include <cuda_runtime.h>

// Matrix dimensions
const int N = 1024;  // Matrix size N x N