2006 年,nvidia 推出 CUDA(统一计算设备架构),彻底改变了计算领域,标志着并行计算历史上的一个关键时刻。虽然图形处理单元 (GPU) 最初是为渲染图形和游戏而设计的,但 NVIDIA 认识到了它们在通用计算方面的潜力。这一洞察力促成了 CUDA 的开发,将 GPU 从专用图形处理器转变为强大的通用计算引擎。
这一旅程始于 21 世纪初,当时研究人员开始尝试使用 GPU 进行非图形计算,这一概念被称为 GPGPU(图形处理单元上的通用计算)。然而,这项早期工作具有挑战性,因为程序员必须使用 OpenGL 或 DirectX 以图形操作的形式表达他们的计算问题。
2006 年推出的 CUDA 彻底改变了这一模式。它为开发人员提供了一种更直接的方法来使用类似 C 的接口对 GPU 进行编程,无需了解图形编程。第一款支持 CUDA 的 GPU GeForce 8800 GTX 具有 128 个并行处理器核心,并展示了当时前所未有的计算能力。
CUDA 架构自诞生以来已发生了重大变化:
- 2006 年(G80):第一个统一的图形和计算 GPU 架构;
- 2008 年(GT200):引入双精度浮点运算;
- 2010 年(Fermi):第一个具有 L1/L2 缓存的完整 gpu 计算架构;
- 2012 年(Kepler):动态并行和 Hyper-Q 功能;
- 2014 年(Maxwell):改进的内存层次结构和电源效率;
- 2016 年(Pascal):统一内存和 NVLink 互连;
- 2018 年(Volta):用于 AI 加速的 Tensor Core;
- 2020 年(Ampere):第三代 Tensor Core 和多实例 GPU;
- 2022 年(Hopper):第四代 Tensor Core 和 Transformer Engine。
CUDA 的重要性源于它能够利用 GPU 的大规模并行处理能力。现代 GPU 包含数千个小型高效核心,旨在同时处理多个任务。作为比较:
-典型的 CPU 可能有 8-32 个核心
- NVIDIA 的 A100 GPU 具有 6912 个 CUDA 核心
- 最新的 H100 GPU 包含 18,432 个 CUDA 核心
这种并行架构使 GPU 在以下领域极其高效:
1. 深度学习和人工智能计算
2. 科学模拟
3. 密码学
4. 图像和视频处理
5. 金融建模
6. 分子动力学
7. 天气预报
虽然 CUDA C/C++ 提供了对 GPU 计算的低级访问,但许多开发人员更喜欢使用 Python 等高级语言。PyCUDA 由 Andreas Klöckner 于 2009 年开发,通过为 CUDA 提供 Python 绑定来弥补这一差距。它使开发人员能够:
- 使用 Python 熟悉的语法编写 GPU 代码
- 保持 CUDA 的性能优势
- 自动管理 GPU 资源
- 与流行的 Python 科学计算库集成
PyCUDA 的优势包括:
- 自动内存管理
- 运行时代码生成
- 与 numpy 无缝集成
- 错误检查和调试功能
- 面向对象的 CUDA 操作接口
现代 NVIDIA GPU(例如 A100 和 H100)代表了 GPU 计算的巅峰,具有以下特点:
A100(Ampere架构):
- 80GB HBM2e 内存
- 1.6TB/s 内存带宽
- 312 TFLOPS 用于 AI 训练
- 多实例 GPU 技术
- 第三代 Tensor Cores
H100(Hopper 架构):
- 80GB HBM3 内存
- 3TB/s 内存带宽
- 用于 AI 训练的 1000 TFLOPS
- 第四代 Tensor Cores
- 用于加速 AI 的 Transformer Engine
CUDA 和 GPU 计算已经改变了无数行业:
- AI/ML:推动深度学习革命
- 医疗保健:加速医学图像处理和药物发现
- 金融:增强实时风险分析和交易算法
- 科学研究:加速复杂模拟
- 云计算:支持 GPU 即服务产品
在深入研究 PyCUDA 编程之前,必须了解 GPU 与 cpu 的根本区别以及这些区别如何实现大规模并行处理能力。
CPU 架构(左侧):
1.控制单元: — 大型、复杂的控制逻辑 — 复杂的分支预测 — 无序执行 — 先进的指令流水线
2.缓存内存: — 大型 L1、L2 和 L3 缓存 — 复杂的缓存一致性协议 — 针对数据重用进行了优化
3.处理单元: — 数量少但功能强大(通常为 4-32 个) — 复杂算术逻辑单元 (ALU) — 高时钟速度 — 高级矢量处理单元
GPU 架构(右侧):
2.高速缓存: — 每个流多处理器 (SM) 均配备小型、快速的高速缓存 — 针对吞吐量而非延迟进行优化 — 高带宽内存接口
3.处理单元: — 数千个简单核心 — 精简的 ALU — 针对并行操作进行了优化 — 用于 ML 操作的专用张量核心
架构上的差异导致处理计算任务的方法根本不同。
理解处理图
CPU 处理模型(左侧):
1. 顺序执行:
— 任务逐个处理
— 高效处理复杂任务
— 高线程性能
— 针对串行工作负载进行了优化
2. 资源利用:
— 深度指令流水线
— 复杂的任务调度
— 动态资源分配
GPU 处理模型(右侧):
1. 并行执行:
— 同时处理多个任务
— 跨核心复制的简单任务
— 高聚合吞吐量
— 针对并行工作负载进行了优化
2. 资源利用率:
— 广泛的 SIMD 执行
— 大规模线程并行
— 硬件管理调度
现实世界的影响
这些架构差异直接转化为性能特征:
CPU 擅长:
- 复杂的决策任务
- 单线程应用程序
- 不可预测分支的任务
- 低延迟操作
- 操作系统任务
GPU 擅长:
- 并行计算
- 矩阵运算
- 图像/视频处理
- 深度学习训练
- 科学模拟
内存架构和带宽
另一个重要的区别在于内存架构:
CPU 内存系统:
- 标准 DDR 内存
- ~100 GB/s 带宽
- 大内存容量
- 复杂的缓存层次结构
GPU 内存系统:
- 高带宽内存 (HBM)
- 约 1-3 TB/s 带宽 (A100/H100)
- 针对并行访问进行了优化
- 专用于计算工作负载
在深入研究 PyCUDA 编程之前,让我们确保您的系统已正确配置。本节将指导您完成基本要求和安装步骤。
先决条件
首先,验证您的系统是否具有支持 CUDA 的 NVIDIA GPU。
最低要求:
- 具有计算能力 3.0 或更高版本的 NVIDIA GPU
- 至少 4GB GPU 内存(推荐)
- 最新的 NVIDIA 驱动程序
要检查你的 GPU:
import subprocessdef get_gpu_info(): try: output = subprocess.check_output( ["nvidia-smi"], universal_newlines=True ) print("GPU Information:") print(output) except: print("No NVIDIA GPU detected or nvidia-smi not installed")get_gpu_info()
CUDA 工具包对于 GPU 编程至关重要。请遵循以下步骤:
2.验证安装:
#检查CUDA 版本nvcc --version
建议使用虚拟环境:
# 创建一个新的虚拟环境python -m venv pycuda_env # 激活环境# 在 Windows 上: pycuda_env\Scripts\activate # 在 Linux/Mac 上:source pycuda_env/bin/activate # 安装所需的软件包pip install numpy
使用 pip 安装 PyCUDA:
pip install pycuda
让我们用一个简单的 PyCUDA 程序来验证一切是否正常:
import pycuda.autoinitimport pycuda.driver as drvimport numpy as npdef verify_pycuda_setup(): try: # Get device properties print(f"CUDA version: {drv.get_version()}") print(f"Device name: {pycuda.autoinit.device.name()}") print(f"Compute capability: {pycuda.autoinit.device.Compute_capability()}") print(f"Total memory: {pycuda.autoinit.device.total_memory() // (1024*1024)} MB") # Try a simple operation a = np.random.randn(100).astype(np.float32) a_gpu = drv.mem_alloc(a.nbytes) drv.memcpy_htod(a_gpu, a) print("\nSuccessfully allocated and copied memory to GPU!") print("PyCUDA is working correctly!") except Exception as e: print(f"An error occurred: {str(e)}") print("Please check your installation.")if __name__ == "__main__": verify_pycuda_setup()
预期输出(根据您的 GPU 而变化):
CUDA version: (11, 5, 0)Device name: NVIDIA A100-SXM4-80GBCompute capability: (8, 0)Total memory: 81037 MBSuccessfully allocated and copied memory to GPU!PyCUDA is working correctly!
在编写您的第一个 PyCUDA 程序之前,必须了解构成 GPU 编程基础的基本概念。
1. Host vs Device
在 GPU 编程中,我们使用特定的术语来区分 CPU 和 GPU:
- Host:指 CPU 及其内存(系统 RAM)
- Device:指 GPU 及其内存(VRAM)
2. 内核函数
内核是在 GPU 上运行的特殊函数。主要特征:
- 跨多个 GPU 线程并行执行
- 在 CUDA C 中用 `__global__` 标记
- 在 PyCUDA 中使用特殊语法 <<<>>> 调用
- 在设备(GPU)上运行
简单内核的示例:
import pycuda.autoinitimport pycuda.driver as cudaimport numpy as npfrom pycuda.compiler import SourceModule# Define Kernel with print statementsmod = SourceModule("""#include <stdio.h>__global__ void multiply_by_two(float *array){ int idx = ThreadIdx.x + blockIdx.x * blockDim.x; // Print the index and value before multiplication printf("Before: idx = %d, value = %f\\n", idx, array[idx]); array[idx] *= 2; // Print the index and value after multiplication printf("After: idx = %d, value = %f\\n", idx, array[idx]);}""")# Get function from modulemultiply_by_two = mod.get_function("multiply_by_two")# Create a sample arrayarray_size = 10array = np.random.randn(array_size).astype(np.float32)# Print Original array in Pythonprint("Original array:")print(array)# Allocate memory on the device and copy array to devicearray_gpu = cuda.mem_alloc(array.nbytes)cuda.memcpy_htod(array_gpu, array)# Execute the kernel with one block of 10 threadsmultiply_by_two(array_gpu, block=(array_size, 1, 1), grid=(1, 1))# Copy the result back from the devicecuda.memcpy_dtoh(array, array_gpu)# Print the modified array in Pythonprint("\nModified array (after multiply_by_two):")print(array)
Original array:[-4.3534854e-01 7.7988309e-01 6.1468828e-01 -1.5304610e-03 1.1495910e+00 7.5472385e-01 2.4845517e+00 8.3205029e-02 7.3480928e-01 -1.6091207e-01]Modified array (after multiply_by_two):[-8.7069708e-01 1.5597662e+00 1.2293766e+00 -3.0609220e-03 2.2991819e+00 1.5094477e+00 4.9691033e+00 1.6641006e-01 1.4696186e+00 -3.2182413e-01]Before: idx = 0, value = -0.435349Before: idx = 1, value = 0.779883Before: idx = 2, value = 0.614688Before: idx = 3, value = -0.001530Before: idx = 4, value = 1.149591Before: idx = 5, value = 0.754724Before: idx = 6, value = 2.484552Before: idx = 7, value = 0.083205Before: idx = 8, value = 0.734809Before: idx = 9, value = -0.160912After: idx = 0, value = -0.870697After: idx = 1, value = 1.559766After: idx = 2, value = 1.229377After: idx = 3, value = -0.003061After: idx = 4, value = 2.299182After: idx = 5, value = 1.509448After: idx = 6, value = 4.969103After: idx = 7, value = 0.166410After: idx = 8, value = 1.469619After: idx = 9, value = -0.321824
GPU 编程使用分层线程模型:
2.区块:
3.网格:
让我们用简单的类比来分解这些概念,以使它们更清晰:
1. 内核
- 将其视为在 GPU 上运行的特殊函数
- 就像许多厨师同时遵循的食谱
- 每个内核运行相同的代码,但针对不同的数据
- 在 PyCUDA 中,这是包装在 SourceModule 中的 CUDA C 代码
2. 线程
- CUDA 中最小的执行单位
- 就像一个执行内核代码的单个工作器
- 每个线程处理自己的数据
- 可以将其视为内核运行的一个实例
3. Warp
- 一组 32 个线程,一起执行
- Warp 中的所有线程同时执行相同的指令
- 就像一个由 32 名工人组成的团队,完美同步地移动
- 硬件级别分组(无法更改大小)
4. Block
- 一组可编程的线程(最多 1024 个线程)
- 块中的线程可以通信和共享内存
- 就像同一个房间里的一个部门的工人一样
- 多个 warp 组成一个 block
5. 网格
- 块的集合
- 就像多个部门在同一个项目上工作一样
- 网格中的块不能直接通信
- 整个 GPU 程序被组织为一个网格
使用餐厅厨房类比
为了更好地理解这些概念如何协同工作,我们可以想象一下一个大型餐厅的厨房:
1. Kernel 就像一个菜谱(例如“切蔬菜”)
2. Thread 就像一个按照菜谱操作的厨师
3. Warp 就像一个由 32 名厨师组成的团队,他们以完美的同步方式工作
4. Block 就像一个有多个团队的厨房站
5. Grid 就像有多个站台的整个厨房
初学者的要点
- 一个 warp 中的所有线程同时执行相同的指令
- 一个块内的线程可以共享资源(共享内存)
- 块是独立的,可以按任何顺序执行
- GPU 自动管理 warp 的执行
- 编写内核代码时,是从单个线程的角度编写的
CUDA 提供不同类型的内存:
1.全局内存:
— 所有线程都可以访问
— 最大但最慢
— 主机和设备之间传输数据的主要方式
2.共享内存:
— 在一个块中的线程之间共享
— 比全局内存快得多
— 大小有限(通常每个块约 48KB)
3.本地内存:
— 每个线程私有
— 自动管理
— 用于线程局部变量
4.常量内存:
— 只读
— 缓存以便快速访问
— 所有线程均可访问
在主机和设备之间移动数据涉及三个主要步骤:
# 在 GPU 上分配内存import pycuda.driver as drv gpu_array = drv.mem_alloc(cpu_array.nbytes)
2. 转移:
# 主机到设备(H2D) drv.memcpy_htod(gpu_array,cpu_array) # 设备到主机(D2H) drv.memcpy_dtoh(cpu_array,gpu_array)
3. 解除分配:
# 释放 GPU 内存gpu_array.free()
这是一个显示内存传输的完整示例:
import pycuda.autoinit import pycuda.driver as drv import numpy as np # 在CPU上创建数据cpu_data = np.random.randn( 1000 ).astype(np.float32) # 分配GPU显存gpu_data = drv.mem_alloc(cpu_data.nbytes) # 传到GPU drv.memcpy_htod(gpu_data, cpu_data) # 传回CPU result = np.empty_like(cpu_data) drv.memcpy_dtoh(result, gpu_data) # 验证print ( "数据传输成功:" , np.allclose(cpu_data, result))
数据传输成功: True
让我们创建一个简单但全面的向量加法示例,演示 GPU 编程的基本工作流程。我们将逐个元素添加两个向量,分解每个步骤并讨论过程中的重要概念。
import pycuda.autoinitimport pycuda.driver as drvimport numpy as npfrom pycuda.compiler import SourceModule
# 在 CUDA C 中定义内核cuda_kernel = """ __global__ void vector_add(float *a, float *b, float *result, int n) { // 计算唯一线程索引 int idx = threadIdx.x + blockIdx.x * blockDim.x; // 检查线程是否在数组边界内 if (idx < n) { result[idx] = a[idx] + b[idx]; } } """
# 编译内核mod = SourceModule(cuda_kernel)# 获取对核函数的引用vector_add = mod.get_function("vector_add")
def prepare_data(n): # Create input vectors a = np.random.randn(n).astype(np.float32) b = np.random.randn(n).astype(np.float32) # Create output vector result = np.zeros_like(a) return a, b, result
def calculate_grid_block_size ( n ): # 大多数 GPU 的每个块的最大线程数 max_threads_per_block = 512 # 计算块大小 block_size = min (max_threads_per_block, n) # 计算网格大小 grid_size = (n + block_size - 1 ) // block_size return (grid_size, 1 , 1 ), (block_size, 1 , 1 )
def gpu_vector_add ( a, b, n ): # 计算网格和块维度 grid_dim, block_dim = calculate_grid_block_size(n) # 分配输出数组 result = np.zeros_like(a) # 调用内核 vector_add( # 输入 drv.In(a), drv.In(b), drv.Out(result), np.int32(n), # 网格和块维度 grid=grid_dim, block=block_dim ) return result
# 示例用法def main (): # 向量大小 n = 1_000_000 # 准备数据 a, b, expected_result = prepare_data(n) # 执行 GPU 计算 result = gpu_vector_add(a, b, n) # 验证结果 print ( "正确结果:" , np.allclose(result, a + b))
2. 线程索引计算:
3.内存管理:
# 错误:混合类型a = np.random.randn( 1000 ) # 默认为 float64 # 修复:明确使用 float32 a = np.random.randn( 1000 ).astype(np.float32)
2. 索引越界
# 错误:没有边界检查if (idx < n) # 缺少这个检查 result[idx] = a[idx] + b[idx]; # 潜在的内存错误
3. 区块大小过大
# 错误:超过每个块的最大线程数block_dim = (2000 , 1 , 1) # 太大# 修复:使用 calculate_grid_block_size 函数
4.内存泄漏
# 错误:手动分配内存而不释放gpu_array = drv.mem_alloc(a.nbytes) # 更好:使用 drv.In 和 drv.Out 进行自动管理
2.错误处理
try: result = gpu_vector_add(a, b, n)except Exception as e: print(f"GPU operation failed: {e}")
3. 性能优化
4.内存管理
5.验证
def validate_inputs(a, b): assert a.dtype == np.float32, "Array must be float32" assert a.shape == b.shape, "Arrays must have same shape" assert a.flags.c_contiguous, "Array must be contiguous"
我们将首先使用常规 Python 代码(在 CPU 上运行)添加两个数字列表,然后展示如何使用 PyCUDA(在 GPU 上运行)执行相同操作。
# 创建两个数字列表a = [ 1 , 2 , 3 , 4 , 5 ] b = [ 10 , 20 , 30 , 40 , 50 ]
# 创建一个空列表来存储结果c = []# 循环遍历列表并添加相应的元素for i in range(len(a)): c.append(a[i] + b[i])# 显示结果print("Sum:", c)
解释:
输出:
Sum:[11, 22, 33, 44, 55]
现在,让我们使用 PyCUDA 执行相同的操作以利用 GPU 的功能。
确保您的系统上安装了 PyCUDA。
pip install pycuda
注意:您需要安装 NVIDIA GPU 和 CUDA 工具包才能使 PyCUDA 正常工作。
import pycuda.autoinit # 初始化 CUDA 驱动程序import pycuda.driver as cuda # 访问 CUDA 功能import numpy as np # 用于高效数组计算的库from pycuda.compiler import SourceModule # 编译 CUDA 代码
kernel_code = """ __global__ void add_arrays(float *a, float *b, float *c) { int idx = threadIdx.x; // 每个线程的唯一索引 c[idx] = a[idx] + b[idx]; // 执行加法} """
解释:
mod = SourceModule(kernel_code) # 编译内核代码add_arrays = mod.get_function( "add_arrays" ) # 获取编译后的函数
# 创建 NumPy 数组(它们与 PyCUDA 配合得很好)a = np.array([ 1 , 2 , 3 , 4 , 5 ], dtype=np.float32) b = np.array([ 10 , 20 , 30 , 40 , 50 ], dtype=np.float32) c = np.empty_like(a) # 空数组来存储结果
解释:
# 线程数(每个元素一个)threads_per_block = len(a)
# 执行核函数add_arrays ( cuda.In ( a ) , cuda.In ( b ) , cuda.Out ( c ) , # 输入和输出数组 block = ( threads_per_block , 1 , 1 ) , # 定义块大小 grid = ( 1 , 1 ) # 定义网格大小)
解释:
print ( "数组 a:",a) print ( "数组 b:",b) print ( "总和 c:",c)
输出:
数组 a:[1. 2. 3. 4. 5.]数组 b:[10. 20. 30. 40. 50.]总和 c:[11. 22. 33. 44. 55.]
a) 分配:
import pycuda.autoinitimport pycuda.driver as drvimport numpy as np
# 主机数组创建host_array = np.random.rand(1000).astype(np.float32) # GPU 内存分配方法:# 方法 1:使用 cuda.mem_alloc gpu_array = drv.mem_alloc(host_array.nbytes) # 方法 2:使用 gpuarray from pycuda import gpuarray gpu_array = gpuarray.to_gpu(host_array)
b) 转移:
# 主机到设备传输drv.memcpy_htod(gpu_array, host_array)
# 设备到主机传输result_host = np.empty_like(host_array) drv.memcpy_dtoh(result_host, gpu_array)
c) 解除分配:
# 释放 GPU 内存gpu_array.free() # 或者如果使用 gpuarray:del gpu_array
a)向量加法示例:
from pycuda.compiler import SourceModulemod = SourceModule("""__global__ void add_vectors(float *a, float *b, float *c, int n){ int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < n) c[idx] = a[idx] + b[idx];}""")# Get function from moduleadd_vectors = mod.get_function("add_vectors")# Prepare datan = 1000000a = np.random.randn(n).astype(np.float32)b = np.random.randn(n).astype(np.float32)c = np.zeros_like(a)# Execute kerneladd_vectors( drv.In(a), drv.In(b), drv.Out(c), np.int32(n), block=(256, 1, 1), grid=((n + 255) // 256, 1))
a)基本错误检查:
# 内核执行后cuda_error = drv.Context.synchronize() if cuda_error: raise RuntimeError( f"CUDA error: {cuda_error} " )
# Or using explicit error checkingerror = drv.Context.get_error() if error != drv.Success: raise RuntimeError(f "CUDA error: {error}" )
b)内存错误预防:
try: # Get device properties device = drv.Device(0) max_threads = device.get_attribute(drv.device_attribute.MAX_THREADS_PER_BLOCK) max_grid_dim = device.get_attribute(drv.device_attribute.MAX_GRID_DIMENSIONS) # Use these limits when launching kernels block_size = min(256, max_threads) grid_size = (n + block_size - 1) // block_size except drv.RuntimeError as e: print(f"CUDA Error: {e}")
2.内核优化:
3.错误处理:
1.线程结构及基本概念:
import pycuda.autoinitimport pycuda.driver as drvfrom pycuda.compiler import SourceModule
# Basic kernel showing thread structurekernel_code = """__global__ void thread_example(int *output) { // Thread identification int tid = threadIdx.x; // Thread ID within block int bid = blockIdx.x; // Block ID int bdim = blockDim.x; // Block dimension int gid = threadIdx.x + blockIdx.x * blockDim.x; // Global thread ID // Store thread info output[gid] = gid;}"""mod = SourceModule(kernel_code)thread_example = mod.get_function("thread_example")
2.块组织示例:
# 2D 块组织示例kernel_2d = """ __global__ void matrix_operation(float *matrix, int width, int height) { // 2D 线程和块索引 int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row < height && col < width) { int idx = row * width + col; matrix[idx] = row + col; // 示例操作 } } """ mod = SourceModule(kernel_2d) matrix_operation = mod.get_function( "matrix_operation" ) # 示例用法width, height = 1024 , 1024 matrix = np.zeros((height, width), dtype=np.float32) # 定义块和网格尺寸block_dim = ( 16 , 16 , 1 ) # 每个块 16x16 个线程grid_dim = ((width + block_dim[ 0 ] - 1 ) // block_dim[ 0 ], (height + block_dim[ 1 ] - 1 ) // block_dim[ 1 ], 1 ) matrix_operation( drv.Out(matrix), np.int32(width), np.int32(height), block=block_dim, grid=grid_dim )
3. 网格配置示例:
# 计算最佳网格和块尺寸的函数def calculate_grid_block_dim ( total_elements, max_threads_per_block= 1024 ): """ 计算最佳网格和块尺寸 """ # 获取设备属性 device = drv.Device( 0 ) max_threads = device.get_attribute(drv.device_attribute.MAX_THREADS_PER_BLOCK) max_block_dim = min (max_threads_per_block, max_threads) # 计算尺寸 block_dim = min (total_elements, max_block_dim) grid_dim = (total_elements + block_dim - 1 ) // block_dim return grid_dim, block_dim # 示例用法n = 1000000 grid_dim, block_dim = calculate_grid_block_dim(n) print ( f"网格大小:{grid_dim},块大小:{区块尺寸} " )
4. 优化线程/块尺寸:
kernel_optimization = """ __global__ voidoptimized_operation(float *input, float *output, int n) { // 用于块级数据的共享内存 __shared__ float shared_data[256]; int tid = threadIdx.x; int gid = blockIdx.x * blockDim.x + threadIdx.x; // 将数据加载到共享内存中 if (gid < n) { shared_data[tid] = input[gid]; } __syncthreads(); // 确保所有线程都已加载数据 // 使用共享内存处理数据 if (gid < n) { output[gid] = shared_data[tid] * 2.0f; // 示例操作 } } """
2. 内存考虑:
3.优化指南:
4.常见陷阱:
计算二维矩阵最佳维度的示例:
def get_optimal_block_dims ( width, height ): max_threads = 1024 # 每个块的典型最大线程数 # 从 16x16 块大小开始(常用于 2D 操作) block_x = min ( 16 , width ) block_y = min ( 16 , height ) # 计算网格尺寸 grid_x = (width + block_x - 1 ) // block_x grid_y = (height + block_y - 1 ) // block_y return (grid_x, grid_y), (block_x, block_y)
# 示例用法width, height = 2048 , 1024 grid_dim, block_dim = get_optimal_block_dims(width, height) print ( f"网格尺寸:{grid_dim} " ) print ( f"块尺寸:{block_dim} " )
import pycuda.autoinitimport pycuda.driver as drvimport numpy as npfrom pycuda.compiler import SourceModule# 1. Map Operation Example - Square each elementmap_kernel = SourceModule("""__global__ void square_elements(float *a, int n){ int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < n) a[idx] = a[idx] * a[idx];}""")def demonstrate_map(): # Create input data n = 10 a = np.random.randn(n).astype(np.float32) a_copy = a.copy() # Keep a copy for verification # Get the kernel function square = map_kernel.get_function("square_elements") # Allocate memory and copy data to GPU a_gpu = drv.mem_alloc(a.nbytes) drv.memcpy_htod(a_gpu, a) # Execute kernel square(a_gpu, np.int32(n), block=(256, 1, 1), grid=((n + 255) // 256, 1)) # Get results back drv.memcpy_dtoh(a, a_gpu) print("\nMap Operation (Squaring Elements):") print("Input:", a_copy) print("Output:", a) print("Verification (CPU):", a_copy ** 2)# 2. Reduce Operation Example - Sum all elementsreduce_kernel = SourceModule("""__global__ void reduce_sum(float *a, float *out, int n){ __shared__ float sdata[256]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; sdata[tid] = (i < n) ? a[i] : 0; __syncthreads(); for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } if (tid == 0) out[blockIdx.x] = sdata[0];}""")def demonstrate_reduce(): # Create input data n = 1000 a = np.random.randn(n).astype(np.float32) # Get the kernel function reduce = reduce_kernel.get_function("reduce_sum") # Calculate number of blocks needed num_blocks = (n + 255) // 256 out = np.zeros(num_blocks, dtype=np.float32) # Execute kernel reduce(drv.In(a), drv.Out(out), np.int32(n), block=(256, 1, 1), grid=(num_blocks, 1)) print("\nReduce Operation (Sum):") print("Input array size:", n) print("GPU sum:", np.sum(out)) print("CPU sum:", np.sum(a)) print("Difference:", abs(np.sum(out) - np.sum(a)))# 3. Scan Operation Example - Cumulative sumscan_kernel = SourceModule("""__global__ void scan(float *input, float *output, int n){ __shared__ float temp[256]; int thid = threadIdx.x; int offset = 1; temp[thid] = input[thid]; __syncthreads(); for (int d = n>>1; d > 0; d >>= 1) { if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; __syncthreads(); } output[thid] = temp[thid];}""")def demonstrate_scan(): # Create input data n = 8 # Using small size for demonstration a = np.random.randn(n).astype(np.float32) # Get the kernel function scan = scan_kernel.get_function("scan") # Execute kernel out = np.zeros_like(a) scan(drv.In(a), drv.Out(out), np.int32(n), block=(n, 1, 1), grid=(1, 1)) print("\nScan Operation (Cumulative Sum):") print("Input:", a) print("GPU scan:", out) print("CPU scan (verification):", np.cumsum(a))# 4. Element-wise Operation Example - Vector additionelementwise_kernel = SourceModule("""__global__ void vector_add(float *a, float *b, float *c, int n){ int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < n) c[idx] = a[idx] + b[idx];}""")def demonstrate_elementwise(): # Create input data n = 10 a = np.random.randn(n).astype(np.float32) b = np.random.randn(n).astype(np.float32) # Get the kernel function vector_add = elementwise_kernel.get_function("vector_add") # Execute kernel c = np.zeros_like(a) vector_add(drv.In(a), drv.In(b), drv.Out(c), np.int32(n), block=(256, 1, 1), grid=((n + 255) // 256, 1)) print("\nElement-wise Operation (Vector Addition):") print("Vector A:", a) print("Vector B:", b) print("GPU Result (A + B):", c) print("CPU Result (verification):", a + b)# Run all demonstrationsif __name__ == "__main__": demonstrate_map() demonstrate_reduce() demonstrate_scan() demonstrate_elementwise()
Map Operation (Squaring Elements):Input: [-1.0795693 -0.49946883 -0.85547155 -0.11256783 -1.809529 -2.3715997 0.7603225 -0.8601724 -0.5761283 -0.47840685]Output: [1.16547 0.24946912 0.73183155 0.01267152 3.274395 5.624485 0.5780903 0.73989654 0.3319238 0.2288731 ]Verification (CPU): [1.16547 0.24946912 0.73183155 0.01267152 3.274395 5.624485 0.5780903 0.73989654 0.3319238 0.2288731 ]Reduce Operation (Sum):Input array size: 1000GPU sum: -6.51562CPU sum: -6.515627Difference: 6.67572e-06Scan Operation (Cumulative Sum):Input: [-0.88912827 1.2799991 -0.06190103 0.66699725 -0.28361192 0.39906275 1.0735904 -0.22892074]GPU scan: [-0.88912827 0.39087087 -0.06190103 0.9959671 -0.28361192 0.11545083 1.0735904 1.9560876 ]CPU scan (verification): [-0.88912827 0.39087087 0.32896984 0.9959671 0.71235514 1.1114179 2.1850083 1.9560876 ]Element-wise Operation (Vector Addition):Vector A: [-0.38900524 -0.5881734 -1.8390615 -1.0052649 0.19294897 -0.62156844 0.29566982 0.8197685 -0.5074794 0.23306295]Vector B: [-0.11977915 -1.2113439 -0.25459185 0.11812047 0.9040481 -0.07971474 0.8966112 -0.5862821 -0.4948726 0.82815486]GPU Result (A + B): [-0.5087844 -1.7995173 -2.0936534 -0.8871444 1.096997 -0.70128316 1.192281 0.23348641 -1.002352 1.0612178 ]CPU Result (verification): [-0.5087844 -1.7995173 -2.0936534 -0.8871444 1.096997 -0.70128316 1.192281 0.23348641 -1.002352 1.0612178 ]
Input: [1, 2, 3, 4] ↓ ↓ ↓ ↓ (square each element)Output: [1, 4, 9, 16]
2. 归约操作:此模式将所有元素组合成一个结果。在我们的示例中,我们使用基于树的方法并行求和所有元素。
Level 0: [1 2 3 4 5 6 7 8]Level 1: [3 7 11 15 ]Level 2: [10 26 ]Level 3: [36 ]
3. 扫描操作:此模式计算累计总数。每个输出元素包含所有先前输入元素的总和。
Input: [1 2 3 4 5]Output: [1 3 6 10 15] ↑ ↑ ↑ ↑ ↑ 1 1+2 1+2+3 etc.
4. 逐元素操作:此模式在多个数组的相应元素之间执行操作。在我们的示例中,我们逐个元素将两个向量相加。
Array A: [1 2 3 4]Array B: [5 6 7 8] ↓ ↓ ↓ ↓ (add corresponding elements)Output: [6 8 10 12]
每种模式都展示了并行处理的不同方面:
对这些模式使用 GPU 的主要优势是可以同时对多个元素执行操作,而不是像在 CPU 上那样一次执行一个操作。
import pycuda.autoinitimport pycuda.driver as drvimport numpy as npfrom pycuda.compiler import SourceModule# 1. Image Processing Examples# 1.1 RGB to Grayscale Conversionrgb_to_gray_kernel = SourceModule("""__global__ void rgb_to_grayscale(unsigned char *rgb, unsigned char *gray, int width, int height){ int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; if (x < width && y < height) { int idx = y * width + x; int rgb_idx = idx * 3; // Standard grayscale conversion weights gray[idx] = (unsigned char)( 0.299f * rgb[rgb_idx] + // Red 0.587f * rgb[rgb_idx + 1] + // Green 0.114f * rgb[rgb_idx + 2] // Blue ); }}""")def demonstrate_grayscale(): # Create sample RGB image (100x100 pixels) width, height = 100, 100 rgb_image = np.random.randint(0, 256, (height, width, 3), dtype=np.uint8) gray_image = np.zeros((height, width), dtype=np.uint8) # Get the kernel function rgb_to_gray = rgb_to_gray_kernel.get_function("rgb_to_grayscale") # Execute kernel with 16x16 thread blocks block_size = (16, 16, 1) grid_size = ((width + block_size[0] - 1) // block_size[0], (height + block_size[1] - 1) // block_size[1]) rgb_to_gray( drv.In(rgb_image), drv.Out(gray_image), np.int32(width), np.int32(height), block=block_size, grid=grid_size ) print("\nGrayscale Conversion:") print("Original RGB shape:", rgb_image.shape) print("Grayscale shape:", gray_image.shape) print("Sample pixel (RGB):", rgb_image[0, 0]) print("Sample pixel (Gray):", gray_image[0, 0])# 1.2 Simple Blur Filterblur_kernel = SourceModule("""__global__ void blur_filter(unsigned char *input, unsigned char *output, int width, int height){ int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; if (x < width && y < height) { int idx = y * width + x; float sum = 0.0f; int count = 0; // Simple 3x3 average blur for(int dy = -1; dy <= 1; dy++) { for(int dx = -1; dx <= 1; dx++) { int nx = x + dx; int ny = y + dy; if(nx >= 0 && nx < width && ny >= 0 && ny < height) { sum += input[ny * width + nx]; count++; } } } output[idx] = (unsigned char)(sum / count); }}""")def demonstrate_blur(): # Create sample grayscale image width, height = 100, 100 input_image = np.random.randint(0, 256, (height, width), dtype=np.uint8) output_image = np.zeros_like(input_image) # Get the kernel function blur = blur_kernel.get_function("blur_filter") # Execute kernel block_size = (16, 16, 1) grid_size = ((width + block_size[0] - 1) // block_size[0], (height + block_size[1] - 1) // block_size[1]) blur( drv.In(input_image), drv.Out(output_image), np.int32(width), np.int32(height), block=block_size, grid=grid_size ) print("\nBlur Filter:") print("Input shape:", input_image.shape) print("Sample 3x3 region before blur:\n", input_image[0:3, 0:3]) print("Sample 3x3 region after blur:\n", output_image[0:3, 0:3])# 2. Matrix Operations# 2.1 Matrix Multiplicationmatrix_mul_kernel = SourceModule("""__global__ void matrix_multiply(float *a, float *b, float *c, int m, int n, int p){ int row = threadIdx.y + blockIdx.y * blockDim.y; int col = threadIdx.x + blockIdx.x * blockDim.x; if (row < m && col < p) { float sum = 0; for (int k = 0; k < n; k++) { sum += a[row * n + k] * b[k * p + col]; } c[row * p + col] = sum; }}""")def demonstrate_matrix_multiply(): # Create sample matrices m, n, p = 4, 3, 2 # Small matrices for demonstration a = np.random.randn(m, n).astype(np.float32) b = np.random.randn(n, p).astype(np.float32) c = np.zeros((m, p), dtype=np.float32) # Get the kernel function matrix_multiply = matrix_mul_kernel.get_function("matrix_multiply") # Execute kernel block_size = (16, 16, 1) grid_size = ((p + block_size[0] - 1) // block_size[0], (m + block_size[1] - 1) // block_size[1]) matrix_multiply( drv.In(a), drv.In(b), drv.Out(c), np.int32(m), np.int32(n), np.int32(p), block=block_size, grid=grid_size ) print("\nMatrix Multiplication:") print("Matrix A:\n", a) print("Matrix B:\n", b) print("Result (GPU):\n", c) print("Result (CPU verification):\n", np.dot(a, b))# 2.2 Matrix Transposetranspose_kernel = SourceModule("""__global__ void matrix_transpose(float *input, float *output, int width, int height){ int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; if (x < width && y < height) { output[x * height + y] = input[y * width + x]; }}""")def demonstrate_transpose(): # Create sample matrix width, height = 4, 3 input_matrix = np.random.randn(height, width).astype(np.float32) output_matrix = np.zeros((width, height), dtype=np.float32) # Get the kernel function transpose = transpose_kernel.get_function("matrix_transpose") # Execute kernel block_size = (16, 16, 1) grid_size = ((width + block_size[0] - 1) // block_size[0], (height + block_size[1] - 1) // block_size[1]) transpose( drv.In(input_matrix), drv.Out(output_matrix), np.int32(width), np.int32(height), block=block_size, grid=grid_size ) print("\nMatrix Transpose:") print("Input matrix:\n", input_matrix) print("Transposed (GPU):\n", output_matrix) print("Transposed (CPU verification):\n", input_matrix.T)# Run demonstrationsif __name__ == "__main__": demonstrate_grayscale() demonstrate_blur() demonstrate_matrix_multiply() demonstrate_transpose()
Grayscale Conversion:Original RGB shape: (100, 100, 3)Grayscale shape: (100, 100)Sample pixel (RGB): [ 20 193 233]Sample pixel (Gray): 145Blur Filter:Input shape: (100, 100)Sample 3x3 region before blur: [[ 14 251 61] [136 80 77] [221 229 9]]Sample 3x3 region after blur: [[120 103 144] [155 119 125] [149 113 73]]Matrix Multiplication:Matrix A: [[-0.66616106 -0.10933682 0.5089306 ] [-0.12389754 -0.28968918 0.97915065] [ 0.14836024 0.61603856 -0.20268178] [-0.74302 0.47540665 -0.32892975]]Matrix B: [[-1.8884981 0.39611024] [ 2.1660886 -1.6947594 ] [ 0.52779263 1.1569128 ]]Result (GPU): [[ 1.2898206 0.51021475] [ 0.12327631 1.5746683 ] [ 0.9472421 -1.2197553 ] [ 2.2593582 -1.4805607 ]]Result (CPU verification): [[ 1.2898206 0.51021475] [ 0.12327631 1.5746683 ] [ 0.9472421 -1.2197553 ] [ 2.2593582 -1.4805607 ]]Matrix Transpose:Input matrix: [[ 0.92167944 -1.2641314 -0.10569248 0.9058027 ] [-0.5706148 0.74336165 2.5295794 0.15187392] [-0.06107518 0.76098543 0.71997863 0.93019915]]Transposed (GPU): [[ 0.92167944 -0.5706148 -0.06107518] [-1.2641314 0.74336165 0.76098543] [-0.10569248 2.5295794 0.71997863] [ 0.9058027 0.15187392 0.93019915]]Transposed (CPU verification): [[ 0.92167944 -0.5706148 -0.06107518] [-1.2641314 0.74336165 0.76098543] [-0.10569248 2.5295794 0.71997863] [ 0.9058027 0.15187392 0.93019915]]
A. RGB 到灰度转换:这是一个将彩色图像转换为黑白的基本图像处理操作。
RGB Image (3 channels) Grayscale Image (1 channel)[R G B] [Gray][230 120 40] → [150]
转换使用 RGB 值的加权和:
2. 模糊滤镜:简单的 3x3 平均模糊,通过对每个像素与其相邻像素进行平均来平滑图像。
Original Image Blurred Image[5 2 7] [4 4 5][1 9 3] → [4 5 4][6 4 8] [5 5 6]
3.矩阵运算
A. 矩阵乘法:并行乘以两个矩阵。每个线程计算结果矩阵的一个元素。
Matrix A Matrix B Result C[1 2] [5 6] [19 22][3 4] × [7 8] = [43 50]
4. 矩阵转置:沿对角线翻转矩阵,将行变成列,反之亦然。
Original Matrix Transposed Matrix[1 2 3] [1 4][4 5 6] → [2 5] [3 6]
使用GPU进行这些操作的优点是:
2.矩阵运算:
import pycuda.autoinit import pycuda.driver as drv from pycuda.compiler import SourceModule import numpy as np import time # 1.线程配置优化thread_config_kernel = SourceModule( """ __global__ void vector_add(float *a, float *b, float *c, int n) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < n) c[idx] = a[idx] + b[idx]; } """ ) def test_thread_configurations (): # 测试数据 n = 10000000 a = np.random.randn(n).astype(np.float32) b = np.random.randn(n).astype(np.float32) c = np.zeros_like(a) vector_add = thread_config_kernel.get_function( "vector_add" ) # 测试不同的线程配置 thread_configs = [ 32 , 64 , 128 , 256 , 512 , 1024 ] times = [] print ( "\n线程配置测试:" ) print ( "每个块的线程数 | 网格大小 | 时间(毫秒)" ) print ( "-" * 45 ) forthreads_per_block in thread_configs: grid_size = (n +threads_per_block - 1 ) //threads_per_block # 计时内核执行 start = drv.Event() end = drv.Event() start.record () vector_add( drv.In(a), drv.In(b), drv.Out(c), np.int32(n), block=(threads_per_block, 1 , 1 ), grid=(grid_size, 1 ) ) end.record() end.synchronize() time_ms = start.time_till(end) times.append(time_ms) print ( f" {threads_per_block:^ 15 } | {grid_size:^ 9 } | {time_ms:^ 8.2 f} " ) # 2. 内存传输优化def demonstrate_memory_transfer (): n = 10000000 data = np.random.randn(n).astype(np.float32) print ( "\n内存传输测试:" ) # 测试 1:多次小型传输 start_time = time.time() chunk_size = 1000000 for i in range ( 0 , n, chunk_size): chunk = data[i:i+chunk_size] gpu_mem = drv.mem_alloc(chunk.nbytes) drv.memcpy_htod(gpu_mem, chunk) small_transfer_time = time.time() - start_time # 测试 2:单次大型传输 start_time = time.time() gpu_mem = drv.mem_alloc(data.nbytes) drv.memcpy_htod(gpu_mem, data) large_transfer_time = time.time() - start_time print ( f "多次小型传输:{small_transfer_time: .4 f}秒" ) print ( f "单次大型传输:{large_transfer_time: .4 f}秒" ) print ( f"加速: {small_transfer_time/large_transfer_time: .2 f} x" ) # 3. 合并与非合并内存访问coalesced_kernel = SourceModule( """ // 合并内存访问__global__ void coalesced_access(float *input, float *output, int n) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < n) output[idx] = input[idx]; } // 非合并内存访问 (strided) __global__ void noncoalesced_access(float *input, float *output, int n) { int idx = threadIdx.x + blockIdx.x * blockDim.x; 如果 (idx < n/2) 输出[idx] = 输入[idx * 2]; } """ ) def demonstrate_memory_access (): n = 10000000 input_data = np.random.randn(n).astype(np.float32) output_data = np.zeros_like(input_data) coalesced = coalesced_kernel.get_function( "coalesced_access" ) noncoalesced = coalesced_kernel.get_function( "noncoalesced_access" ) print ( "\nMemory Access Pattern Tests:" ) # 测试合并访问 start = drv.Event() end = drv.Event() start.record() coalesced( drv.In(input_data), drv.Out(output_data), np.int32(n), block=( 256 , 1 , 1 ), grid=((n + 255) // 256 , 1 ) ) end.record() end.synchronize() coalesced_time = start.time_till(end) # 测试非合并访问 start.record() noncoalesced( drv.In(input_data), drv.Out(output_data), np.int32(n), block=( 256 , 1 , 1 ), grid=((n + 255 ) // 256 , 1 ) ) end.record() end.synchronize() noncoalesced_time = start.time_till(end) print ( f"合并访问时间:{coalesced_time: .2 f} ms" ) print ( f"非合并访问时间:{noncoalesced_time: .2 f} ms" ) print ( f"减速因素:{noncoalesced_time/coalesced_time: .2 f} x" ) # 4. 基本分析示例def profile_kernel_execution (): n = 10000000 input_data = np.random.randn(n).astype(np.float32) # 创建计时事件 start = drv.Event() end = drv.Event() print ( "\nKernel Profiling:" ) print ( "Operation | Time (ms)" ) print ( "-" * 35 ) # 时间内存分配 start.record() gpu_mem = drv.mem_alloc(input_data.nbytes) end.record() end.synchronize() print ( f"内存分配 | {start.time_till(end): 8.2 f} " ) # 主机到设备传输时间 start.record() drv.memcpy_htod(gpu_mem, input_data) end.record() end.synchronize() print ( f"主机到设备复制 | {start.time_till(end): 8.2 f} " ) # 内核执行时间 square = thread_config_kernel.get_function( "vector_add" ) start.record() square(gpu_mem, gpu_mem, gpu_mem, np.int32(n), block=( 256 , 1 , 1 ), grid=((n + 255 ) // 256 , 1)) end.record() end.synchronize() print ( f"内核执行 | {start.time_till(end): 8.2 f} " ) # 设备到主机传输时间 start.record() drv.memcpy_dtoh(input_data, gpu_mem) end.record() end.synchronize() print ( f"设备到主机复制 | {start.time_till(end): 8.2 f} " ) if __name__ == "__main__" : test_thread_configurations() demonstrate_memory_transfer() demonstrate_memory_access() profile_kernel_execution()
Thread Configuration Tests:Threads per Block | Grid Size | Time (ms)--------------------------------------------- 32 | 312500 | 26.43 64 | 156250 | 29.92 128 | 78125 | 29.33 256 | 39063 | 29.32 512 | 19532 | 29.35 1024 | 9766 | 29.25 Memory Transfer Tests:Multiple small transfers: 0.0124 secondsSingle large transfer: 0.0085 secondsSpeedup: 1.46xMemory Access Pattern Tests:Coalesced access time: 16.55 msNon-coalesced access time: 19.77 msSlowdown factor: 1.20xKernel Profiling:Operation | Time (ms)-----------------------------------Memory allocation | 0.43Host to Device copy | 8.27Kernel execution | 0.17Device to Host copy | 6.47
这涉及找到每个块的最佳线程数。要点:
Threads/Block: 32 64 128 256 512 1024Performance: ▂ ▄ ▆ █ ▆ ▄
2. 内存传输优化
证明了批量转移的重要性:
Multiple small transfers:[Data] → [GPU] → [Data] → [GPU] → [Data] → [GPU](Slow, high overhead)
Single large transfer:[Data Data Data Data] → [GPU](Fast, low overhead)
3. 合并与非合并内存访问
显示内存访问模式的重要性:
Coalesced Access:Thread 0: [0] →Thread 1: [1] → Memory: [0][1][2][3]Thread 2: [2] →Thread 3: [3] →
Non-coalesced Access:Thread 0: [0] ----→Thread 1: [2] ----→ Memory: [0][1][2][3]Thread 2: [4] ----→Thread 3: [6] ----→
4. 基本分析
显示如何衡量不同操作的性能:
|--alloc--|--H2D--|--kernel--|--D2H--|
关键优化技巧:
2.内存传输:
3.内存访问:
4. 分析:
import pycuda.autoinitimport pycuda.driver as drvimport numpy as npfrom pycuda.compiler import SourceModuleimport gc# 1. Memory Leak Examples and Solutionsdef demonstrate_memory_leak(): print("\n1. Memory Leak Examples and Solutions") print("=" * 50) # Bad practice - not freeing GPU memory def bad_memory_management(): for i in range(5): data = np.random.randn(1000000).astype(np.float32) gpu_mem = drv.mem_alloc(data.nbytes) # Memory allocated but never freed drv.memcpy_htod(gpu_mem, data) print(f"Iteration {i+1}: Allocated {data.nbytes/1024/1024:.2f} MB") # Good practice - properly freeing GPU memory def good_memory_management(): for i in range(5): data = np.random.randn(1000000).astype(np.float32) gpu_mem = drv.mem_alloc(data.nbytes) drv.memcpy_htod(gpu_mem, data) print(f"Iteration {i+1}: Allocated {data.nbytes/1024/1024:.2f} MB") gpu_mem.free() # Properly free GPU memory print("Bad Memory Management (will eventually cause out of memory):") try: bad_memory_management() except drv.MemoryError: print("Out of GPU memory!") # Clear GPU memory gc.collect() print("\nGood Memory Management:") good_memory_management()# 2. Synchronization Issuessynchronization_kernel = SourceModule("""__global__ void increment_array(float *data, int n){ int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < n) data[idx] += 1;}""")def demonstrate_synchronization(): print("\n2. Synchronization Issues") print("=" * 50) # Bad practice - no synchronization def bad_synchronization(): data = np.ones(1000000, dtype=np.float32) gpu_mem = drv.mem_alloc(data.nbytes) drv.memcpy_htod(gpu_mem, data) increment = synchronization_kernel.get_function("increment_array") # Multiple kernel launches without synchronization for _ in range(5): increment(gpu_mem, np.int32(len(data)), block=(256, 1, 1), grid=((len(data) + 255) // 256, 1)) # No synchronization here drv.memcpy_dtoh(data, gpu_mem) gpu_mem.free() return data[0] # Good practice - proper synchronization def good_synchronization(): data = np.ones(1000000, dtype=np.float32) gpu_mem = drv.mem_alloc(data.nbytes) drv.memcpy_htod(gpu_mem, data) increment = synchronization_kernel.get_function("increment_array") # Multiple kernel launches with synchronization for _ in range(5): increment(gpu_mem, np.int32(len(data)), block=(256, 1, 1), grid=((len(data) + 255) // 256, 1)) drv.Context.synchronize() # Proper synchronization drv.memcpy_dtoh(data, gpu_mem) gpu_mem.free() return data[0] print("Bad Synchronization Result:", bad_synchronization()) print("Good Synchronization Result:", good_synchronization())# 3. Dimensioning Errorsdimension_kernel = SourceModule("""__global__ void process_matrix(float *matrix, int width, int height){ int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; if (x < width && y < height) { int idx = y * width + x; matrix[idx] *= 2; }}""")def demonstrate_dimensioning(): print("\n3. Dimensioning Errors") print("=" * 50) # Bad practice - incorrect dimensions def bad_dimensioning(): width, height = 1024, 1024 data = np.random.randn(height, width).astype(np.float32) gpu_mem = drv.mem_alloc(data.nbytes) drv.memcpy_htod(gpu_mem, data) process = dimension_kernel.get_function("process_matrix") # Incorrect block/grid dimensions process(gpu_mem, np.int32(width), np.int32(height), block=(32, 32, 1), grid=(32, 32, 1)) # Won't cover entire matrix drv.memcpy_dtoh(data, gpu_mem) gpu_mem.free() return np.sum(data == 0) # Count unprocessed elements # Good practice - correct dimensions def good_dimensioning(): width, height = 1024, 1024 data = np.random.randn(height, width).astype(np.float32) gpu_mem = drv.mem_alloc(data.nbytes) drv.memcpy_htod(gpu_mem, data) process = dimension_kernel.get_function("process_matrix") # Correct block/grid dimensions block_size = (32, 32, 1) grid_size = ((width + block_size[0] - 1) // block_size[0], (height + block_size[1] - 1) // block_size[1]) process(gpu_mem, np.int32(width), np.int32(height), block=block_size, grid=grid_size) drv.memcpy_dtoh(data, gpu_mem) gpu_mem.free() return np.sum(data == 0) # Count unprocessed elements print("Unprocessed elements (Bad):", bad_dimensioning()) print("Unprocessed elements (Good):", good_dimensioning())# 4. Resource Managementdef demonstrate_resource_management(): print("\n4. Resource Management") print("=" * 50) # Bad practice - not using context manager def bad_resource_management(): data = np.random.randn(1000000).astype(np.float32) gpu_mem = drv.mem_alloc(data.nbytes) # If an error occurs here, memory won't be freed drv.memcpy_htod(gpu_mem, data) return gpu_mem # Good practice - using context manager class GPUMemoryContext: def __init__(self, data): self.data = data self.gpu_mem = None def __enter__(self): self.gpu_mem = drv.mem_alloc(self.data.nbytes) drv.memcpy_htod(self.gpu_mem, self.data) return self.gpu_mem def __exit__(self, exc_type, exc_val, exc_tb): if self.gpu_mem: self.gpu_mem.free() def good_resource_management(): data = np.random.randn(1000000).astype(np.float32) with GPUMemoryContext(data) as gpu_mem: # Memory will be automatically freed after this block return gpu_mem print("Bad Resource Management:") try: mem = bad_resource_management() # Memory leak if we forget to free except Exception as e: print(f"Error occurred: {e}") print("\nGood Resource Management:") try: with GPUMemoryContext(np.random.randn(1000000).astype(np.float32)) as mem: print("Memory allocated and will be automatically freed") except Exception as e: print(f"Error occurred but memory was freed: {e}")if __name__ == "__main__": demonstrate_memory_leak() demonstrate_synchronization() demonstrate_dimensioning() demonstrate_resource_management()
1. Memory Leak Examples and Solutions==================================================Bad Memory Management (will eventually cause out of memory):Iteration 1: Allocated 3.81 MBIteration 2: Allocated 3.81 MBIteration 3: Allocated 3.81 MBIteration 4: Allocated 3.81 MBIteration 5: Allocated 3.81 MBGood Memory Management:Iteration 1: Allocated 3.81 MBIteration 2: Allocated 3.81 MBIteration 3: Allocated 3.81 MBIteration 4: Allocated 3.81 MBIteration 5: Allocated 3.81 MB2. Synchronization Issues==================================================Bad Synchronization Result: 6.0Good Synchronization Result: 6.03. Dimensioning Errors==================================================Unprocessed elements (Bad): 0Unprocessed elements (Good): 04. Resource Management==================================================Bad Resource Management:Good Resource Management:Memory allocated and will be automatically freed
问题:使用后不释放 GPU 内存 视觉表示:
Bad Practice:Iteration 1: [Memory Allocated]Iteration 2: [Memory Allocated]Iteration 3: [Memory Allocated] → OUT OF MEMORY!
Good Practice:Iteration 1: [Memory Allocated] → [Freed]Iteration 2: [Memory Allocated] → [Freed]Iteration 3: [Memory Allocated] → [Freed]
解决方案:
2.同步问题
问题:未正确同步 GPU 操作视觉表示:
Bad Practice (Race Condition):Operation 1 ----→Operation 2 --→Operation 3 ------→(Undefined order)
Good Practice:Operation 1 ----→ | | SyncOperation 2 ----→ | | SyncOperation 3 ----→ |
解决方案:
3.尺寸标注错误
问题:线程/块尺寸不正确 视觉表示:
Bad Practice:Matrix Size: 1024x1024Thread Blocks: 32x32Grid: 32x32Coverage: [▓▓▓░░░] Incomplete!
Good Practice:Matrix Size: 1024x1024Thread Blocks: 32x32Grid: (32+)x(32+)Coverage: [▓▓▓▓▓▓] Complete!
解决方案:
4.资源管理
问题:GPU资源处理不当视觉表现:
Bad Practice:[Allocate Memory] → [Use Memory] → [??? Memory Leak]
Good Practice:[Context Start] → [Allocate Memory] → [Use Memory] → [Auto Free] → [Context End]
解决方案:
关键最佳实践:
2.同步:
3.尺寸标注:
4.资源管理:
import pycuda.autoinitimport pycuda.driver as drvimport numpy as npfrom pycuda.compiler import SourceModuleimport time# 1. Shared Memory Example# Demonstrates using shared memory for faster matrix operationsshared_memory_kernel = SourceModule("""__global__ void matrix_mul_shared(float *a, float *b, float *c, int width){ __shared__ float shared_a[16][16]; __shared__ float shared_b[16][16]; int tx = threadIdx.x; int ty = threadIdx.y; int row = blockIdx.y * blockDim.y + ty; int col = blockIdx.x * blockDim.x + tx; float sum = 0.0f; // Loop over the tiles for (int tile = 0; tile < width/16; tile++) { // Load data into shared memory shared_a[ty][tx] = a[row * width + tile * 16 + tx]; shared_b[ty][tx] = b[(tile * 16 + ty) * width + col]; __syncthreads(); // Ensure all data is loaded // Compute partial dot product for (int k = 0; k < 16; k++) { sum += shared_a[ty][k] * shared_b[k][tx]; } __syncthreads(); // Wait before next iteration } c[row * width + col] = sum;}""")def demonstrate_shared_memory(): print("\n1. Shared Memory Example") print("=" * 50) # Matrix dimensions width = 64 # Must be multiple of 16 for this example # Create input matrices a = np.random.randn(width, width).astype(np.float32) b = np.random.randn(width, width).astype(np.float32) c = np.zeros((width, width), dtype=np.float32) # Get the kernel function matrix_mul = shared_memory_kernel.get_function("matrix_mul_shared") # Execute kernel with timing start = drv.Event() end = drv.Event() start.record() matrix_mul( drv.In(a), drv.In(b), drv.Out(c), np.int32(width), block=(16, 16, 1), grid=(width//16, width//16) ) end.record() end.synchronize() print(f"Shared Memory Execution Time: {start.time_till(end):.2f} ms") print(f"Result matches CPU: {np.allclose(c, np.dot(a, b))}")# 2. Atomic Operations Exampleatomic_kernel = SourceModule("""__global__ void histogram_atomic(int *data, int *hist, int n){ int tid = threadIdx.x + blockIdx.x * blockDim.x; if (tid < n) { // Atomic increment of histogram bin atomicAdd(&hist[data[tid]], 1); }}""")def demonstrate_atomic_operations(): print("\n2. Atomic Operations Example") print("=" * 50) # Create random data for histogram n = 1000000 num_bins = 256 data = np.random.randint(0, num_bins, n).astype(np.int32) hist = np.zeros(num_bins, dtype=np.int32) # Get the kernel function histogram = atomic_kernel.get_function("histogram_atomic") # Execute kernel histogram( drv.In(data), drv.Out(hist), np.int32(n), block=(256, 1, 1), grid=((n + 255) // 256, 1) ) # Verify with CPU computation cpu_hist = np.bincount(data, minlength=num_bins) print("Histogram computation matches CPU:", np.array_equal(hist, cpu_hist)) print("Sample histogram bins (0-9):", hist[:10])# 3. Streams and Events Exampledef demonstrate_streams(): print("\n3. Streams and Events Example") print("=" * 50) # Create two streams for concurrent execution stream1 = drv.Stream() stream2 = drv.Stream() # Create some data n = 1000000 data1 = np.random.randn(n).astype(np.float32) data2 = np.random.randn(n).astype(np.float32) # Allocate GPU memory gpu_data1 = drv.mem_alloc(data1.nbytes) gpu_data2 = drv.mem_alloc(data2.nbytes) # Create events for timing start1 = drv.Event() end1 = drv.Event() start2 = drv.Event() end2 = drv.Event() # Execute operations in different streams start1.record(stream1) drv.memcpy_htod_async(gpu_data1, data1, stream1) end1.record(stream1) start2.record(stream2) drv.memcpy_htod_async(gpu_data2, data2, stream2) end2.record(stream2) # Synchronize and get timing end1.synchronize() end2.synchronize() print(f"Stream 1 time: {start1.time_till(end1):.2f} ms") print(f"Stream 2 time: {start2.time_till(end2):.2f} ms") # Clean up gpu_data1.free() gpu_data2.free()# 4. Advanced Memory Patterns Examplememory_pattern_kernel = SourceModule("""// Example of memory coalescing and bank conflict avoidance__global__ void transpose_coalesced(float *input, float *output, int width, int height){ __shared__ float tile[16][16+1]; // Padded to avoid bank conflicts int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < width && y < height) { // Coalesced read from global memory tile[threadIdx.y][threadIdx.x] = input[y * width + x]; } __syncthreads(); // Transpose block indices x = blockIdx.y * blockDim.y + threadIdx.x; y = blockIdx.x * blockDim.x + threadIdx.y; if (x < height && y < width) { // Coalesced write to global memory output[y * height + x] = tile[threadIdx.x][threadIdx.y]; }}""")def demonstrate_memory_patterns(): print("\n4. Advanced Memory Patterns Example") print("=" * 50) # Create test matrix width, height = 1024, 1024 input_matrix = np.random.randn(height, width).astype(np.float32) output_matrix = np.zeros((width, height), dtype=np.float32) # Get the kernel function transpose = memory_pattern_kernel.get_function("transpose_coalesced") # Execute and time the kernel start = drv.Event() end = drv.Event() start.record() transpose( drv.In(input_matrix), drv.Out(output_matrix), np.int32(width), np.int32(height), block=(16, 16, 1), grid=((width + 15) // 16, (height + 15) // 16) ) end.record() end.synchronize() print(f"Coalesced transpose time: {start.time_till(end):.2f} ms") print("Transpose correct:", np.allclose(output_matrix, input_matrix.T))if __name__ == "__main__": demonstrate_shared_memory() demonstrate_atomic_operations() demonstrate_streams() demonstrate_memory_patterns()
1. Shared Memory Example==================================================Shared Memory Execution Time: 0.34 msResult matches CPU: True2. Atomic Operations Example==================================================Histogram computation matches CPU: FalseSample histogram bins (0-9): [-1085608738 -1093652818 1068795183 -1088075006 -1096309612 -1074392307 -1087534583 1066090569 1064575260 1041087911]3. Streams and Events Example==================================================Stream 1 time: 0.96 msStream 2 time: 1.07 ms4. Advanced Memory Patterns Example==================================================Coalesced transpose time: 3.11 msTranspose correct: True
Global Memory (Slow) ↓Shared Memory (Fast) ↓Threads in Block
Without Shared Memory:Thread 1: [Global] → [Compute]Thread 2: [Global] → [Compute]With Shared Memory:Thread 1: [Global] → [Shared] → [Compute]Thread 2: [Global] → [Shared] → [Compute]
好处:
2. 原子操作 这些操作可以一步完成而不会中断。
Without Atomic:Thread 1: Read(5) → Add(1) → Write(6)Thread 2: Read(5) → Add(1) → Write(6)Result: 6 (Wrong!)
With Atomic:Thread 1: AtomicAdd(1) → 6Thread 2: AtomicAdd(1) → 7Result: 7 (Correct!)
用途:
3.流和事件允许并发执行操作。
Sequential:[Operation 1] → [Operation 2] → [Operation 3]
With Streams:Stream 1: [Operation 1] → [Operation 3]Stream 2: [Operation 2] → [Operation 4]
好处:
4. 先进的内存模式技术,实现最佳内存访问。
Bad Pattern (Strided):Thread 0: [0] → [4] → [8]Thread 1: [1] → [5] → [9](Many memory transactions)
Good Pattern (Coalesced):Thread 0: [0] → [1] → [2]Thread 1: [3] → [4] → [5](Fewer memory transactions)
关键概念:
使用高级功能的提示:
2.原子操作:
3. 流:
4.记忆模式:
通过 PyCUDA 探索 CUDA 编程展示了 GPU 并行处理计算任务的强大功能。通过我们的实际示例,我们看到了 GPU 加速如何显著提高各种操作的性能:
我们探索的关键结论包括:
- GPU 加速在可以充分利用并行化的大规模计算中变得最有效
- 主机和设备之间适当的内存管理对于实现最佳性能至关重要
- 线程层次结构和块组织是高效 CUDA 编程的基本概念
- 设计 CUDA 内核时必须考虑特定于硬件的限制
虽然 PyCUDA 为 CUDA 编程提供了 Python 接口,但了解底层概念有助于开发人员优化代码并在机器学习应用程序中更好地利用 GPU 资源。
参考:
https://levelup.gitconnected.com/pycuda-from-ground-up-a-comprehensive-guide-6458f0c5df0e