Pinned Memory 多设备异步拷贝

梦里梦外; 2022-12-23 06:56 123阅读 0赞

调用cudaMallocHost可以申请主机端Pinned内存，对比Pageable内存，Pinned内存具有更快的拷贝速度。并且支持同一时刻多设备异步内存拷贝，实现Overlap。

### cudaMallocHost使用 ###

cudaMallocHost //申请Pinned内存
    cudaFreeHost //释放Pinned内存

### Pinned内存与Pageable内存拷贝性能对比 ###

分别申请400M Pinned内存与Pageable内存，向设备端进行内存拷贝，通过nsys获取执行时间。  
代码实现：

void TestHostPinnedMem()
    { 
    	//compare pinned and pagable memcpy
    	const int size = 1024 * 1024 * 100;
    	int* h1 = (int*)malloc(size * sizeof(int));
    	int* d1;
    	cudaMalloc(&d1, size * sizeof(int));
    	cudaMemcpy(d1, h1, size * sizeof(int), cudaMemcpyHostToDevice);
    	int* h2;
    	cudaMallocHost(&h2, size * sizeof(int));
    	cudaMemcpy(d1, h2, size * sizeof(int), cudaMemcpyHostToDevice);
    	cudaFree(d1);
    	free(h1);
    	cudaFreeHost(h2);
    }

拷贝耗时，其中Pageable内存耗时情况如下：

Begins: 0.291734s
    Ends: 0.37133s (+79.596 ms)
    HtoD memcpy 419,430,400 bytes
    Source memory kind: Pageable
    Destination memory kind: Device
    Throughput: 5.26951 GiB/s
    Correlation ID: 207
    Stream: Default stream (7)

Pinned内存耗时情况如下：

Begins: 0.512981s
    Ends: 0.547286s (+34.305 ms)
    HtoD memcpy 419,430,400 bytes
    Source memory kind: Pinned
    Destination memory kind: Device
    Throughput: 12.2266 GiB/s
    Correlation ID: 209
    Stream: Default stream (7)

对比两者执行情况，Pinned在吞吐量和执行速度快了2倍多。

### 异步内存拷贝 ###

申请Pinned内存，并异步拷贝4M数据至不同的设备，代码实现如下：

void TestAsyncPinnedMem()
    { 
    	const int size = 1024 * 1024 ;
    	int* d0,*d1,*h1;
    	cudaMallocHost(/*(void**)*/&h1, sizeof(int) * size);
    	cudaSetDevice(0);
    	cudaStream_t s0;
    	cudaStreamCreate(&s0);
    	cudaMalloc(&d0, size * sizeof(int));
    	cudaSetDevice(1);
    	cudaStream_t s1;
    	cudaStreamCreate(&s1);
    	cudaMalloc(&d1, size * sizeof(int));
    	cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0);
    	cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1);
    	cudaStreamSynchronize(s0);
    	cudaStreamSynchronize(s1);
    	cudaFree(d0);
    	cudaFree(d1);
    	cudaFreeHost(h1);
    }

nsys抓取两个异步函数的时间线，可以看到两者在时间轴上是存在重叠区域的，说明是实现了overlap  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3lhbjMxNDE1_size_16_color_FFFFFF_t_70_pic_center]  
两次拷贝执行时间分别如下：

Begins: 0.981878s
    Ends: 0.982224s (+346.270 μs)
    HtoD memcpy 4,194,304 bytes
    Source memory kind: Pinned
    Destination memory kind: Device
    Throughput: 12.1128 GiB/s
    Correlation ID: 213
    Stream: Stream 15

Begins: 0.981892s
    Ends: 0.982237s (+344.959 μs)
    HtoD memcpy 4,194,304 bytes
    Source memory kind: Pinned
    Destination memory kind: Device
    Throughput: 12.1588 GiB/s
    Correlation ID: 214
    Stream: Stream 25

申请Pageable内存，并异步拷贝4M数据至不同的设备，代码实现如下：

在这里插入代码片void TestAsyncPagableMem()
    { 
    	const int size = 1024 * 1024;
    	int* d0, * d1, * h1;
    	h1 = (int*)malloc( sizeof(int) * size);
    	cudaSetDevice(0);
    	cudaStream_t s0;
    	cudaStreamCreate(&s0);
    	cudaMalloc(&d0, size * sizeof(int));
    	cudaSetDevice(1);
    	cudaStream_t s1;
    	cudaStreamCreate(&s1);
    	cudaMalloc(&d1, size * sizeof(int));
    	cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0);
    	cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1);
    	cudaStreamSynchronize(s0);
    	cudaStreamSynchronize(s1);
    	cudaFree(d0);
    	cudaFree(d1);
    	free(h1);
    }

nsys抓取两个异步函数的时间线，可以看到两者在时间轴上没有重叠区域，是串行执行  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3lhbjMxNDE1_size_16_color_FFFFFF_t_70_pic_center 1]  
两次拷贝执行时间分别如下，对比Pinned内存拷贝，执行时间增加了2倍多，说明Pinned内存在多设备异步拷贝时，拷贝overlap并没有导致各自拷贝时间的增加：

Begins: 0.503774s
    Ends: 0.504617s (+843.035 μs)
    HtoD memcpy 4,194,304 bytes
    Source memory kind: Pageable
    Destination memory kind: Device
    Throughput: 4.97524 GiB/s
    Correlation ID: 212
    Stream: Stream 15

Begins: 0.504744s
    Ends: 0.505368s (+623.935 μs)
    HtoD memcpy 4,194,304 bytes
    Source memory kind: Pageable
    Destination memory kind: Device
    Throughput: 6.72234 GiB/s
    Correlation ID: 213
    Stream: Stream 25

[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3lhbjMxNDE1_size_16_color_FFFFFF_t_70_pic_center]: /images/20221120/12da43099b9d4a4fb2a0fd1341a10842.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3lhbjMxNDE1_size_16_color_FFFFFF_t_70_pic_center 1]: /images/20221120/9a1ea1db0f8b4bb6ac8340f1b7c32d07.png