Pinned Memory 多设备异步拷贝 梦里梦外; 2022-12-23 06:56 87阅读 0赞 调用cudaMallocHost可以申请主机端Pinned内存,对比Pageable内存,Pinned内存具有更快的拷贝速度。并且支持同一时刻多设备异步内存拷贝,实现Overlap。 ### cudaMallocHost使用 ### cudaMallocHost //申请Pinned内存 cudaFreeHost //释放Pinned内存 ### Pinned内存与Pageable内存拷贝性能对比 ### 分别申请400M Pinned内存与Pageable内存,向设备端进行内存拷贝,通过nsys获取执行时间。 代码实现: void TestHostPinnedMem() { //compare pinned and pagable memcpy const int size = 1024 * 1024 * 100; int* h1 = (int*)malloc(size * sizeof(int)); int* d1; cudaMalloc(&d1, size * sizeof(int)); cudaMemcpy(d1, h1, size * sizeof(int), cudaMemcpyHostToDevice); int* h2; cudaMallocHost(&h2, size * sizeof(int)); cudaMemcpy(d1, h2, size * sizeof(int), cudaMemcpyHostToDevice); cudaFree(d1); free(h1); cudaFreeHost(h2); } 拷贝耗时,其中Pageable内存耗时情况如下: Begins: 0.291734s Ends: 0.37133s (+79.596 ms) HtoD memcpy 419,430,400 bytes Source memory kind: Pageable Destination memory kind: Device Throughput: 5.26951 GiB/s Correlation ID: 207 Stream: Default stream (7) Pinned内存耗时情况如下: Begins: 0.512981s Ends: 0.547286s (+34.305 ms) HtoD memcpy 419,430,400 bytes Source memory kind: Pinned Destination memory kind: Device Throughput: 12.2266 GiB/s Correlation ID: 209 Stream: Default stream (7) 对比两者执行情况,Pinned在吞吐量和执行速度快了2倍多。 ### 异步内存拷贝 ### 申请Pinned内存,并异步拷贝4M数据至不同的设备,代码实现如下: void TestAsyncPinnedMem() { const int size = 1024 * 1024 ; int* d0,*d1,*h1; cudaMallocHost(/*(void**)*/&h1, sizeof(int) * size); cudaSetDevice(0); cudaStream_t s0; cudaStreamCreate(&s0); cudaMalloc(&d0, size * sizeof(int)); cudaSetDevice(1); cudaStream_t s1; cudaStreamCreate(&s1); cudaMalloc(&d1, size * sizeof(int)); cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0); cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1); cudaStreamSynchronize(s0); cudaStreamSynchronize(s1); cudaFree(d0); cudaFree(d1); cudaFreeHost(h1); } nsys抓取两个异步函数的时间线,可以看到两者在时间轴上是存在重叠区域的,说明是实现了overlap ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3lhbjMxNDE1_size_16_color_FFFFFF_t_70_pic_center] 两次拷贝执行时间分别如下: Begins: 0.981878s Ends: 0.982224s (+346.270 μs) HtoD memcpy 4,194,304 bytes Source memory kind: Pinned Destination memory kind: Device Throughput: 12.1128 GiB/s Correlation ID: 213 Stream: Stream 15 Begins: 0.981892s Ends: 0.982237s (+344.959 μs) HtoD memcpy 4,194,304 bytes Source memory kind: Pinned Destination memory kind: Device Throughput: 12.1588 GiB/s Correlation ID: 214 Stream: Stream 25 申请Pageable内存,并异步拷贝4M数据至不同的设备,代码实现如下: 在这里插入代码片void TestAsyncPagableMem() { const int size = 1024 * 1024; int* d0, * d1, * h1; h1 = (int*)malloc( sizeof(int) * size); cudaSetDevice(0); cudaStream_t s0; cudaStreamCreate(&s0); cudaMalloc(&d0, size * sizeof(int)); cudaSetDevice(1); cudaStream_t s1; cudaStreamCreate(&s1); cudaMalloc(&d1, size * sizeof(int)); cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0); cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1); cudaStreamSynchronize(s0); cudaStreamSynchronize(s1); cudaFree(d0); cudaFree(d1); free(h1); } nsys抓取两个异步函数的时间线,可以看到两者在时间轴上没有重叠区域,是串行执行 ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3lhbjMxNDE1_size_16_color_FFFFFF_t_70_pic_center 1] 两次拷贝执行时间分别如下,对比Pinned内存拷贝,执行时间增加了2倍多,说明Pinned内存在多设备异步拷贝时,拷贝overlap并没有导致各自拷贝时间的增加: Begins: 0.503774s Ends: 0.504617s (+843.035 μs) HtoD memcpy 4,194,304 bytes Source memory kind: Pageable Destination memory kind: Device Throughput: 4.97524 GiB/s Correlation ID: 212 Stream: Stream 15 Begins: 0.504744s Ends: 0.505368s (+623.935 μs) HtoD memcpy 4,194,304 bytes Source memory kind: Pageable Destination memory kind: Device Throughput: 6.72234 GiB/s Correlation ID: 213 Stream: Stream 25 [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3lhbjMxNDE1_size_16_color_FFFFFF_t_70_pic_center]: /images/20221120/12da43099b9d4a4fb2a0fd1341a10842.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3lhbjMxNDE1_size_16_color_FFFFFF_t_70_pic_center 1]: /images/20221120/9a1ea1db0f8b4bb6ac8340f1b7c32d07.png
还没有评论,来说两句吧...