GPU matrix multiplication: a literate CUDA program
Table of Contents
This is a literate program that implements matrix multiplication in C++ using NVIDIA CUDA to accelerate the computation. I tested this on an NVIDIA GeForce 1060, as well as a GeForce 1660. The speedup is about 1000x.
1. Mathematical preliminaries
An \(n\times n\) matrix \(A\) implements a linear function: \(A : \mathbb{R}^n \to \mathbb{R}^n\).
Given two \(n\times n\) matrices \(A\) and \(B\), we can compute \(C = AB\) using this component-wise formula:
\[ C_{ij} = \sum_{k=1}^n A_{ik}B_{kj} \]
The straightforward way to implement this in C++ is to
represent the \(A\) an array of
float
values of size n*n
, then
the mapping from \((i,j)\) to the index is just \(in +
j\).
2. The inner loop of matrix multiplication
To calculate \(C_{ij} = \sum_{k=1}^n A_{ik}B_{kj}\), we start with the inner loop:
float sum = 0.0f; for (int k = 0; k < n; k++) { sum += A[i * n + k] * B[k * n + j]; } C[i * n + j] = sum;
3. The full function to multiply \(A\) and \(B\) on the CPU
To complete the CPU function, we just need to sum over
i
and j
.
// CPU matrix multiplication void matrixMultiplyCPU(float* A, float* B, float* C, int n) { for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { <<inner-loop>> } } }
4. The full function to multiply \(A\) and \(B\) on the GPU
NVIDIA CUDA provides a framework to run a bunch of copies of a function on GPU cores.
The CUDA framework allows for 2D programming, notice the
.x
and .y
values, those span a
2D \((x,y)\) plane.
// CUDA kernel for matrix multiplication __global__ void matrixMultiplyKernel(float* A, float* B, float* C, int n) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int i = row; j = col; if (row < n && col < n) { <<inner-loop>> } }
To understand how we allocate this \((x,y)\) plane on
the CUDA device, let's go up to the call site where
matrixMultiplyKernel
is called:
matrixMultiplyKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);
5. The call site for
matrixMultiplyKernel
When you call a CUDA Kernel in C++, you have to specify the grid dimension and the block dimension. The grid dimension defines the number of "thread blocks" to run across the GPU device. The block dimension defines the number of threads that can run inside each thread block.
matrixMultiplyKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);
6. The
main
entry point
In CUDA there's a dichotomy between
host memory (allocated with
new
and malloc
) and
device memory, which is on the GPU (allocated with
cudaMalloc
).
The basic flow of a CUDA C/C++ program is:
- allocate your host memory
- allocate your device memory
- copy the data to host memory
- copy the data to device memory
- run CUDA kernels on the device
- copy the results back to the host
Now here's the main function for this matrix multiplication example, with some timing code
#include <iostream> #include <chrono> #include <cuda_runtime.h> // Matrix dimensions const int N = 1024; // Matrix size N x N int main() { float *h_A, *h_B, *h_C_cpu, *h_C_gpu; // Host matrices float *d_A, *d_B, *d_C; // Device matrices // Allocate host memory h_A = new float[N * N]; h_B = new float[N * N]; h_C_cpu = new float[N * N]; h_C_gpu = new float[N * N]; // Initialize matrices initializeMatrix(h_A, N); initializeMatrix(h_B, N); // CPU Matrix Multiplication auto cpu_start = std::chrono::high_resolution_clock::now(); matrixMultiplyCPU(h_A, h_B, h_C_cpu, N); auto cpu_end = std::chrono::high_resolution_clock::now(); auto cpu_duration = std::chrono::duration_cast<std::chrono::milliseconds>(cpu_end - cpu_start); // Allocate device memory cudaMalloc(&d_A, N * N * sizeof(float)); cudaMalloc(&d_B, N * N * sizeof(float)); cudaMalloc(&d_C, N * N * sizeof(float)); // Copy data to device cudaMemcpy(d_A, h_A, N * N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, N * N * sizeof(float), cudaMemcpyHostToDevice); // Define grid and block dimensions dim3 blockDim(16, 16); dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y); // GPU Matrix Multiplication auto gpu_start = std::chrono::high_resolution_clock::now(); matrixMultiplyKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N); cudaDeviceSynchronize(); auto gpu_end = std::chrono::high_resolution_clock::now(); auto gpu_duration = std::chrono::duration_cast<std::chrono::milliseconds>(gpu_end - gpu_start); // Copy result back to host cudaMemcpy(h_C_gpu, d_C, N * N * sizeof(float), cudaMemcpyDeviceToHost); // Verify results float maxError = 0.0f; for (int i = 0; i < N * N; i++) { maxError = std::max(maxError, std::abs(h_C_cpu[i] - h_C_gpu[i])); } // Print results std::cout << "Matrix size: " << N << " x " << N << std::endl; std::cout << "CPU Time: " << cpu_duration.count() << " ms" << std::endl; std::cout << "GPU Time: " << gpu_duration.count() << " ms" << std::endl; std::cout << "Maximum Error: " << maxError << std::endl; // Cleanup delete[] h_A; delete[] h_B; delete[] h_C_cpu; delete[] h_C_gpu; cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); return 0; }
7. Results
On my NVIDIA GeForce GTX 1060, this took 7ms, vs 7,281ms on CPU (Intel Xeon CPU 3 GHz):
./matmul Matrix size: 1024 x 1024 CPU Time: 7281 ms GPU Time: 7 ms Maximum Error: 9.15527e-05
Hire us if you are interested in accelerating your software using your older NVIDIA GPUs, we don't use Python or PyTorch or any fancy deep learning frameworks. We use first-principles thinking to make C/C++ programs that offload paralellizable work to the GPU device.
If you are interested in hiring a team of experienced CUDA C++ developers, reach out to tobi@defini.dev
8. Appendix
8.1. Imports
#include <iostream> #include <chrono> #include <cuda_runtime.h> // Matrix dimensions const int N = 1024; // Matrix size N x N