cuda编程[5]：矩阵乘法--使用静态共享内存优化

2025/7/13 16:48:13 来源：https://blog.csdn.net/buuliuda/article/details/141871677 浏览: 次关键词：cuda编程[5]：矩阵乘法--使用静态共享内存优化

前言
项目结构
核心代码
结果分析

前言

所有的代码下载链接：https://github.com/leoda1/the-notes-of-cuda-programming/tree/main/code/matmul-shared-memory

项目结构

📂 .vscode
📂 build
📂 inc
- 📄 matmul.hpp - 矩阵乘法相关的头文件
- 📄 timer.hpp - 计时器相关的头文件
- 📄 utils.hpp - 工具类函数相关的头文件
📂 src
- 📄 main.cpp - 主程序文件
- 📄 matmul_gpu_basic.cu - 基础 GPU 矩阵乘法 CUDA 实现
- 📄 matmul_gpu_shared.cu - 使用共享内存的 GPU 矩阵乘法 CUDA 实现
- 📄 timer.cpp - 计时器实现文件
- 📄 utils.cpp - 工具类函数实现文件
📄 CMakeLists.txt - CMake 构建配置文件

核心代码

这里的普通矩阵乘法的matmul_gpu_basic.cu代码我做了详细的视频解读，视频链接：视频时长32：00，建议2x食用。那么怎么使用静态共享内存来实现矩阵乘法呢？
首先，共享内存空间比局部和全局内存空间更快，对于一个warp 的所有线程，访问共享内存和访问寄存器一样快，只要在线程之间没有bank 冲突（这里会在cuda编程[7]说到）。定义静态共享内存的修饰词是__shared__
矩阵从host端cudaMemcpy到device端后，需要再将A和B矩阵切片（矩阵相乘是A的行乘和累加B的列，所以可以将矩阵切片。使用Tiling技术一个tile处理的就是block, 将一个矩阵分为多个小的tile）放到静态共享内存中。然后按索引计算矩阵的元素相乘和累加就可以得到结果。完整代码：

#define BLOCKSIZE 32__global__ void Matmul_shared_static_kernel(float* M_device, float* N_device, float* P_device, int width)
{__shared__ float M_device_shared[BLOCKSIZE][BLOCKSIZE];__shared__ float N_device_shared[BLOCKSIZE][BLOCKSIZE];//index of the current element in the matrixint ix = blockIdx.x * blockDim.x + threadIdx.x;int iy = blockIdx.y * blockDim.y + threadIdx.y;int idx = iy * width + ix;//index of tiles in the matrixint tx = threadIdx.x;int ty = threadIdx.y;float P_element = 0;for (int i = 0; i < width / BLOCKSIZE; i++){
/*      iy * width is the row index,  i * BLOCKSIZE + tx is the column index*/M_device_shared[ty][tx] = M_device[iy * width + (i * BLOCKSIZE + tx)];
/*      (i * BLOCKSIZE + ty) * width is the row index,  ix is the column index*/       N_device_shared[ty][tx] = N_device[(i * BLOCKSIZE + ty) * width + ix];__syncthreads();for (int j = 0; j < BLOCKSIZE; j++){P_element += M_device_shared[ty][j] * N_device_shared[j][tx];}__syncthreads();}P_device[idx] = P_element;
}

结果分析

&esp;接下来就是正式编译，接下来与（cuda编程[4]中一致，是如何使用vscode配置CUDA C++编译环境并运行的。

cmake -B build -G "Visual Studio 16 2019"
cmake --build build
./build/Debug/matmul_shared.exe

得到的结果

(joker) PS C:\Users\22681\Desktop\project\cudalearn\notes\code\matmul-shared-memory> ./build/Debug/matmul_shared.exe
Input size is:4096 x 4096
matmul in gpu warmup                                         uses 187.220901 ms
matmul in gpu without shared memory<<<256, 16>>>             uses 184.976639 ms

cuda编程[5]：矩阵乘法--使用静态共享内存优化

前言

项目结构

核心代码

结果分析

相关资讯

热文排行

最新新闻

推荐新闻

热搜词