欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 科技 > 能源 > 摩尔线程 国产显卡 MUSA 并行编程 学习笔记-2024/12/02

摩尔线程 国产显卡 MUSA 并行编程 学习笔记-2024/12/02

2025/2/22 2:07:13 来源:https://blog.csdn.net/weixin_47469677/article/details/144186732  浏览:    关键词:摩尔线程 国产显卡 MUSA 并行编程 学习笔记-2024/12/02

Learning Roadmap:

Section 1: Intro to Parallel Programming & MUSA

  1. Deep Learning Ecosystem(摩尔线程 国产显卡 MUSA 并行编程 学习笔记-2024/11/30-CSDN博客)
  2. Ubuntu+Driver+Toolkit+conda+pytorch+torch_musa环境安装(2024/11/24-Ubuntu Windows双系统安装 | 2024/11/30-GPU驱动&MUSA Toolkit安装)
  3. C/C++ Review(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/22-CSDN博客)
  4. GPU intros(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/25-CSDN博客)
  5. GPU硬件架构 (摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/26-CSDN博客)
  6. Write First Kernels (Here) (2024/11/27-线程层级 | 2024/11/28-First MUSA Kernel to Count Thread)
  7. MUSA API
  8. Faster Matrix Multiplication
  9. Triton
  10. Pytorch Extensions(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/21-CSDN博客)
  11. MNIST Multilayer Perceptron

Section 2: Parallel Programming & MUSA in Depth

  1. Analyzing Parallel Program Performance on a Quad-Core CPU
  2. Scheduling Task Graphs on a Multi-Core CPU
  3. A Simple Renderer in MUSA
  4. Optimizing DNN Performance on DNN Accelerator Hardware
  5. llm.c

Ref:摩尔学院 High-Performance Computing with GPUs | Stanford CS149 - Video | Stanford CS149 - Syllabus

Kernel to Add Vector

Ref:  High-Performance Computing with GPUs Chapter 5

下面的代码将用CPU与GPU分别对两个长度为1000万的向量进行相加,并计算对应的平均耗时

代码地址

MUSA PLAY GROUND - Github

代码

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <musa_runtime.h>#define N 10000000  // Vector size = 10 million
#define BLOCK_SIZE 256// Example:
// A = [1, 2, 3, 4, 5]
// B = [6, 7, 8, 9, 10]
// C = A + B = [7, 9, 11, 13, 15]// CPU vector addition
void vector_add_cpu(float *a, float *b, float *c, int n) {for (int i = 0; i < n; i++) {c[i] = a[i] + b[i];}
}// MUSA kernel for vector addition
__global__ void vector_add_gpu(float *a, float *b, float *c, int n) {int i = blockIdx.x * blockDim.x + threadIdx.x;if (i < n) {c[i] = a[i] + b[i];}
}// Initialize vector with random values
void init_vector(float *vec, int n) {for (int i = 0; i < n; i++) {vec[i] = (float)rand() / RAND_MAX;}
}// Function to measure execution time
double get_time() {struct timespec ts;clock_gettime(CLOCK_MONOTONIC, &ts);return ts.tv_sec + ts.tv_nsec * 1e-9;
}int main() {float *h_a, *h_b, *h_c_cpu, *h_c_gpu;float *d_a, *d_b, *d_c;size_t size = N * sizeof(float);// Allocate host memoryh_a = (float*)malloc(size);h_b = (float*)malloc(size);h_c_cpu = (float*)malloc(size);h_c_gpu = (float*)malloc(size);// Initialize vectorssrand(time(NULL));init_vector(h_a, N);init_vector(h_b, N);// Allocate device memorymusaMalloc(&d_a, size);musaMalloc(&d_b, size);musaMalloc(&d_c, size);// Copy data to devicemusaMemcpy(d_a, h_a, size, musaMemcpyHostToDevice);musaMemcpy(d_b, h_b, size, musaMemcpyHostToDevice);// Define grid and block dimensionsint num_blocks = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;// N = 1024, BLOCK_SIZE = 256, num_blocks = 4// (N + BLOCK_SIZE - 1) / BLOCK_SIZE = ( (1025 + 256 - 1) / 256 ) = 1280 / 256 = 4 rounded // Warm-up runsprintf("Performing warm-up runs...\n");for (int i = 0; i < 3; i++) {vector_add_cpu(h_a, h_b, h_c_cpu, N);vector_add_gpu<<<num_blocks, BLOCK_SIZE>>>(d_a, d_b, d_c, N);musaDeviceSynchronize();}// Benchmark CPU implementationprintf("Benchmarking CPU implementation...\n");double cpu_total_time = 0.0;for (int i = 0; i < 20; i++) {double start_time = get_time();vector_add_cpu(h_a, h_b, h_c_cpu, N);double end_time = get_time();cpu_total_time += end_time - start_time;}double cpu_avg_time = cpu_total_time / 20.0;// Benchmark GPU implementationprintf("Benchmarking GPU implementation...\n");double gpu_total_time = 0.0;for (int i = 0; i < 20; i++) {double start_time = get_time();vector_add_gpu<<<num_blocks, BLOCK_SIZE>>>(d_a, d_b, d_c, N);musaDeviceSynchronize();double end_time = get_time();gpu_total_time += end_time - start_time;}double gpu_avg_time = gpu_total_time / 20.0;// Print resultsprintf("CPU average time: %f milliseconds\n", cpu_avg_time*1000);printf("GPU average time: %f milliseconds\n", gpu_avg_time*1000);printf("Speedup: %fx\n", cpu_avg_time / gpu_avg_time);// Verify results (optional)musaMemcpy(h_c_gpu, d_c, size, musaMemcpyDeviceToHost);bool correct = true;for (int i = 0; i < N; i++) {if (fabs(h_c_cpu[i] - h_c_gpu[i]) > 1e-5) {correct = false;break;}}printf("Results are %s\n", correct ? "correct" : "incorrect");// Free memoryfree(h_a);free(h_b);free(h_c_cpu);free(h_c_gpu);musaFree(d_a);musaFree(d_b);musaFree(d_c);return 0;
}

编译

    mcc 00_vector_add_v1.mu -o vector_add_v1 -mtgpu -O2 -lmusart./vector_add_v1

输出结果

如图所示,结果输出了CPU与GPU 对于长度为1000万的两个向量的相加,20次的平均速度,并验证了结果的准确性,GPU加速效果非常明显

Notes

GPU Kernel

  1. 这里可以看到GPU 计算与CPU计算的差异,在GPUkernel里面我们通过blockId,blockDim与threadId取到这个线程的全局唯一id,这个id与向量中的值一一对应。我们基于位置取到对应的向量值,并完成相加。
// MUSA kernel for vector addition
__global__ void vector_add_gpu(float *a, float *b, float *c, int n) {int i = blockIdx.x * blockDim.x + threadIdx.x;if (i < n) {c[i] = a[i] + b[i];}
}

命名

在host(CPU)上的变量由h_打头,在device(GPU)上的变量由d_打头

    float *h_a, *h_b, *h_c_cpu, *h_c_gpu;float *d_a, *d_b, *d_c;

异构计算流程

这里可以看到一个异构计算的典型流程

  • 在host端定义并初始化变量
    float *h_a, *h_b, *h_c_cpu, *h_c_gpu;float *d_a, *d_b, *d_c;size_t size = N * sizeof(float);// Allocate host memoryh_a = (float*)malloc(size);h_b = (float*)malloc(size);h_c_cpu = (float*)malloc(size);h_c_gpu = (float*)malloc(size);// Initialize vectorssrand(time(NULL));init_vector(h_a, N);init_vector(h_b, N);
  •  在Device端定义变量,并将Host端数据拷贝过来
    // Allocate device memorymusaMalloc(&d_a, size);musaMalloc(&d_b, size);musaMalloc(&d_c, size);// Copy data to devicemusaMemcpy(d_a, h_a, size, musaMemcpyHostToDevice);musaMemcpy(d_b, h_b, size, musaMemcpyHostToDevice);
  • 定义kernel中grid dim与block dim
    // Define grid and block dimensionsint num_blocks = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;// N = 1024, BLOCK_SIZE = 256, num_blocks = 4// (N + BLOCK_SIZE - 1) / BLOCK_SIZE = ( (1025 + 256 - 1) / 256 ) = 1280 / 256 = 4 rounded 
  • 进行计算
    // Warm-up runsprintf("Performing warm-up runs...\n");for (int i = 0; i < 3; i++) {vector_add_cpu(h_a, h_b, h_c_cpu, N);vector_add_gpu<<<num_blocks, BLOCK_SIZE>>>(d_a, d_b, d_c, N);musaDeviceSynchronize();}// Benchmark CPU implementationprintf("Benchmarking CPU implementation...\n");double cpu_total_time = 0.0;for (int i = 0; i < 20; i++) {double start_time = get_time();vector_add_cpu(h_a, h_b, h_c_cpu, N);double end_time = get_time();cpu_total_time += end_time - start_time;}double cpu_avg_time = cpu_total_time / 20.0;// Benchmark GPU implementationprintf("Benchmarking GPU implementation...\n");double gpu_total_time = 0.0;for (int i = 0; i < 20; i++) {double start_time = get_time();vector_add_gpu<<<num_blocks, BLOCK_SIZE>>>(d_a, d_b, d_c, N);musaDeviceSynchronize();double end_time = get_time();gpu_total_time += end_time - start_time;}double gpu_avg_time = gpu_total_time / 20.0;// Print resultsprintf("CPU average time: %f milliseconds\n", cpu_avg_time*1000);printf("GPU average time: %f milliseconds\n", gpu_avg_time*1000);printf("Speedup: %fx\n", cpu_avg_time / gpu_avg_time);
  • 将计算结果拷贝回Host端
    // Verify results (optional)musaMemcpy(h_c_gpu, d_c, size, musaMemcpyDeviceToHost);bool correct = true;for (int i = 0; i < N; i++) {if (fabs(h_c_cpu[i] - h_c_gpu[i]) > 1e-5) {correct = false;break;}}printf("Results are %s\n", correct ? "correct" : "incorrect");
  • 释放资源
    // Free memoryfree(h_a);free(h_b);free(h_c_cpu);free(h_c_gpu);musaFree(d_a);musaFree(d_b);musaFree(d_c);return 0;

通过数值而非三维向量定义Kernel

在这里,num_blocks与BLOCK_SIZE均为数值,而非摩尔线程 国产显卡 MUSA 并行编程 学习笔记-2024/11/28-CSDN博客 中的三维向量,这里只传递数值,MUSA会自动的将其变成一个拍扁的三维向量,例:GridDim = 1 会自动转换为 <1,0,0>

        vector_add_gpu<<<num_blocks, BLOCK_SIZE>>>(d_a, d_b, d_c, N);

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

热搜词