【CUDA并行编程之八】Cuda实现Kmeans算法

「爱情、让人受尽委屈。」 2022-08-01 12:17 321阅读 0赞

本文主要介绍如何使用CUDA并行计算框架编程实现机器学习中的Kmeans算法,Kmeans算法的详细介绍在这里,本文重点在并行实现的过程。

当然还是简单的回顾一下kmeans算法的串行过程:

伪代码:

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. 创建k个点作为起始质心(经常是随机选择)
  2. 当任意一个点的簇分配结果发生改变时
  3. 对数据集中的每个数据点
  4. 对每个质心
  5. 计算质心与数据点之间的距离
  6. 将数据点分配到距其最近的簇
  7. 对每一个簇,计算簇中所有点的均值并将均值作为质心

我们可以观察到有两个部分可以并行优化:

①line03-04:将每个数据点到多个质心的距离计算进行并行化

②line05:将数据点到某个执行的距离计算进行并行化

KMEANS类:

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. class KMEANS
  2. {
  3. private:
  4. int numClusters;
  5. int numCoords;
  6. int numObjs;
  7. int *membership;//[numObjs]
  8. char *filename;
  9. float **objects;//[numObjs][numCoords] data objects
  10. float **clusters;//[numClusters][unmCoords] cluster center
  11. float threshold;
  12. int loop_iterations;
  13. public:
  14. KMEANS(int k);
  15. void file_read(char *fn);
  16. void file_write();
  17. void cuda_kmeans();
  18. inline int nextPowerOfTwo(int n);
  19. void free_memory();
  20. virtual ~KMEANS();
  21. };//KMEANS

成员变量:

numClusters:中心点的个数

numCoords:每个数据点的维度

numObjs:数据点的个数

membership:每个数据点所属类别的数组,维度为numObjs

filename:读入的文件名

objects:所有数据点,维度为[numObjs][numCoords]

clusters:中心点数据,维度为[numObjs][numCoords]

threshold:控制循环次数的一个域值

loop_iterations:循环的迭代次数

成员函数:

KMEANS(int k):含参构造函数。初始化成员变量

file_read(char *fn):读入文件数据并初始化object以及membership变量

file_write():将计算结果写回到结果文件中去

cuda_kmeans():kmeans计算的入口函数

nextPowerOfTwo(int n):它计算大于等于输入参数n的第一个2的幂次数。

free_memory():释放内存空间

~KMEANS():析构函数

并行的代码主要三个函数:

find_nearest_cluster(…)

compute_delta(…)

euclid_dist_2(…)

首先看一下函数euclid_dist_2(…):

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. __host__ __device__ inline static
  2. float euclid_dist_2(int numCoords,int numObjs,int numClusters,float *objects,float *clusters,int objectId,int clusterId)
  3. {
  4. int i;
  5. float ans = 0;
  6. for( i=0;i<numCoords;i++ )
  7. {
  8. ans += ( objects[numObjs * i + objectId] - clusters[numClusters*i + clusterId] ) *
  9. ( objects[numObjs * i + objectId] - clusters[numClusters*i + clusterId] ) ;
  10. }
  11. return ans;
  12. }

这段代码实际上就是并行的计算向量objects[objectId]和clusters[clusterId]之间的距离,即第objectId个数据点到第clusterId个中心点的距离。

再看一下函数compute_delta(…):

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. /*
  2. * numIntermediates:The actual number of intermediates
  3. * numIntermediates2:The next power of two
  4. */
  5. __global__ static void compute_delta(int *deviceIntermediates,int numIntermediates, int numIntermediates2)
  6. {
  7. extern __shared__ unsigned int intermediates[];
  8. intermediates[threadIdx.x] = (threadIdx.x < numIntermediates) ? deviceIntermediates[threadIdx.x] : 0 ;
  9. __syncthreads();
  10. //numIntermediates2 *must* be a power of two!
  11. for(unsigned int s = numIntermediates2 /2 ; s > 0 ; s>>=1)
  12. {
  13. if(threadIdx.x < s)
  14. {
  15. intermediates[threadIdx.x] += intermediates[threadIdx.x + s];
  16. }
  17. __syncthreads();
  18. }
  19. if(threadIdx.x == 0)
  20. {
  21. deviceIntermediates[0] = intermediates[0];
  22. }
  23. }

这段代码的意义就是将一个线程块中每个线程的对应的intermediates的数据求和最后放到deviceIntermediates[0]中去然后拷贝回主存块中去。这个问题的更好的解释在这里,实际上就是一个数组求和的问题,应用在这里求得的是有改变的membership中所有数据的和,即改变了簇的点的个数。

最后再看函数finid_nearest_cluster(…):

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. /*
  2. * objects:[numCoords][numObjs]
  3. * deviceClusters:[numCoords][numClusters]
  4. * membership:[numObjs]
  5. */
  6. __global__ static void find_nearest_cluster(int numCoords,int numObjs,int numClusters,float *objects, float *deviceClusters,int *membership ,int *intermediates)
  7. {
  8. extern __shared__ char sharedMemory[];
  9. unsigned char *membershipChanged = (unsigned char *)sharedMemory;
  10. float *clusters = deviceClusters;
  11. membershipChanged[threadIdx.x] = 0;
  12. int objectId = blockDim.x * blockIdx.x + threadIdx.x;
  13. if( objectId < numObjs )
  14. {
  15. int index;
  16. float dist,min_dist;
  17. /*find the cluster id that has min distance to object*/
  18. index = 0;
  19. min_dist = euclid_dist_2(numCoords,numObjs,numClusters,objects,clusters,objectId,0);
  20. for(int i=0;i<numClusters;i++)
  21. {
  22. dist = euclid_dist_2(numCoords,numObjs,numClusters,objects,clusters,objectId,i) ;
  23. /* no need square root */
  24. if( dist < min_dist )
  25. {
  26. min_dist = dist;
  27. index = i;
  28. }
  29. }
  30. if( membership[objectId]!=index )
  31. {
  32. membershipChanged[threadIdx.x] = 1;
  33. }
  34. //assign the membership to object objectId
  35. membership[objectId] = index;
  36. __syncthreads(); //for membershipChanged[]
  37. #if 1
  38. //blockDim.x *must* be a power of two!
  39. for(unsigned int s = blockDim.x / 2; s > 0 ;s>>=1)
  40. {
  41. if(threadIdx.x < s)
  42. {
  43. membershipChanged[threadIdx.x] += membershipChanged[threadIdx.x + s];//calculate all changed values and save result to membershipChanged[0]
  44. }
  45. __syncthreads();
  46. }
  47. if(threadIdx.x == 0)
  48. {
  49. intermediates[blockIdx.x] = membershipChanged[0];
  50. }
  51. #endif
  52. }
  53. }//find_nearest_cluster

这个函数计算的就是第objectId个数据点到numClusters个中心点的距离,然后根据情况比较更新membership。

这三个函数将所有能够并行的地方都进行了并行,实现了整体算法的并行化~

在此呈上全部代码:

kmeans.h:

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. #ifndef _H_KMEANS
  2. #define _H_KMEANS
  3. #include
  4. #define malloc2D(name, xDim, yDim, type) do { \
  5. name = (type **)malloc(xDim * sizeof(type *)); \
  6. assert(name != NULL); \
  7. name[0] = (type *)malloc(xDim * yDim * sizeof(type)); \
  8. assert(name[0] != NULL); \
  9. for (size_t i = 1; i < xDim; i++) \
  10. name[i] = name[i-1] + yDim; \
  11. } while (0)
  12. double wtime(void);
  13. #endif

wtime.cu:

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. #include
  2. #include
  3. #include
  4. double wtime(void)
  5. {
  6. double now_time;
  7. struct timeval etstart;
  8. struct timezone tzp;
  9. if (gettimeofday(&etstart, &tzp) == -1)
  10. perror(“Error: calling gettimeofday() not successful.\n”);
  11. now_time = ((double)etstart.tv_sec) + /* in seconds */
  12. ((double)etstart.tv_usec) / 1000000.0; /* in microseconds */
  13. return now_time;
  14. }

cuda_kmeans.cu:

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. #include
  2. #include
  3. #include
  4. #include
  5. #include
  6. #include
  7. #include
  8. #include
  9. #include “kmeans.h”
  10. using namespace std;
  11. const int MAX_CHAR_PER_LINE = 1024;
  12. class KMEANS
  13. {
  14. private:
  15. int numClusters;
  16. int numCoords;
  17. int numObjs;
  18. int *membership;//[numObjs]
  19. char *filename;
  20. float **objects;//[numObjs][numCoords] data objects
  21. float **clusters;//[numClusters][unmCoords] cluster center
  22. float threshold;
  23. int loop_iterations;
  24. public:
  25. KMEANS(int k);
  26. void file_read(char *fn);
  27. void file_write();
  28. void cuda_kmeans();
  29. inline int nextPowerOfTwo(int n);
  30. void free_memory();
  31. virtual ~KMEANS();
  32. };
  33. KMEANS::~KMEANS()
  34. {
  35. free(membership);
  36. free(clusters[0]);
  37. free(clusters);
  38. free(objects[0]);
  39. free(objects);
  40. }
  41. KMEANS::KMEANS(int k)
  42. {
  43. threshold = 0.001;
  44. numObjs = 0;
  45. numCoords = 0;
  46. numClusters = k;
  47. filename = NULL;
  48. loop_iterations = 0;
  49. }
  50. void KMEANS::file_write()
  51. {
  52. FILE *fptr;
  53. char outFileName[1024];
  54. //output:the coordinates of the cluster centres
  55. sprintf(outFileName,”%s.cluster_centres”,filename);
  56. printf(“Writingcoordinates of K=%d cluster centers to file \“%s\“\n”,numClusters,outFileName);
  57. fptr = fopen(outFileName,”w”);
  58. for(int i=0;i<numClusters;i++)
  59. {
  60. fprintf(fptr,”%d “,i) ;
  61. for(int j=0;j<numCoords;j++)
  62. fprintf(fptr,”%f “,clusters[i][j]);
  63. fprintf(fptr,”\n”);
  64. }
  65. fclose(fptr);
  66. //output:the closest cluster centre to each of the data points
  67. sprintf(outFileName,”%s.membership”,filename);
  68. printf(“writing membership of N=%d data objects to file \“%s\“ \n”,numObjs,outFileName);
  69. fptr = fopen(outFileName,”w”);
  70. for(int i=0;i<numObjs;i++)
  71. {
  72. fprintf(fptr,”%d %d\n”,i,membership[i]) ;
  73. }
  74. fclose(fptr);
  75. }
  76. inline int KMEANS::nextPowerOfTwo(int n)
  77. {
  78. n—;
  79. n = n >> 1 | n;
  80. n = n >> 2 | n;
  81. n = n >> 4 | n;
  82. n = n >> 8 | n;
  83. n = n >> 16 | n;
  84. //n = n >> 32 | n; // for 64-bit ints
  85. return ++n;
  86. }
  87. __host__ __device__ inline static
  88. float euclid_dist_2(int numCoords,int numObjs,int numClusters,float *objects,float *clusters,int objectId,int clusterId)
  89. {
  90. int i;
  91. float ans = 0;
  92. for( i=0;i<numCoords;i++ )
  93. {
  94. ans += ( objects[numObjs * i + objectId] - clusters[numClusters*i + clusterId] ) *
  95. ( objects[numObjs * i + objectId] - clusters[numClusters*i + clusterId] ) ;
  96. }
  97. return ans;
  98. }
  99. /*
  100. * numIntermediates:The actual number of intermediates
  101. * numIntermediates2:The next power of two
  102. */
  103. __global__ static void compute_delta(int *deviceIntermediates,int numIntermediates, int numIntermediates2)
  104. {
  105. extern __shared__ unsigned int intermediates[];
  106. intermediates[threadIdx.x] = (threadIdx.x < numIntermediates) ? deviceIntermediates[threadIdx.x] : 0 ;
  107. __syncthreads();
  108. //numIntermediates2 *must* be a power of two!
  109. for(unsigned int s = numIntermediates2 /2 ; s > 0 ; s>>=1)
  110. {
  111. if(threadIdx.x < s)
  112. {
  113. intermediates[threadIdx.x] += intermediates[threadIdx.x + s];
  114. }
  115. __syncthreads();
  116. }
  117. if(threadIdx.x == 0)
  118. {
  119. deviceIntermediates[0] = intermediates[0];
  120. }
  121. }
  122. /*
  123. * objects:[numCoords][numObjs]
  124. * deviceClusters:[numCoords][numClusters]
  125. * membership:[numObjs]
  126. */
  127. __global__ static void find_nearest_cluster(int numCoords,int numObjs,int numClusters,float *objects, float *deviceClusters,int *membership ,int *intermediates)
  128. {
  129. extern __shared__ char sharedMemory[];
  130. unsigned char *membershipChanged = (unsigned char *)sharedMemory;
  131. float *clusters = deviceClusters;
  132. membershipChanged[threadIdx.x] = 0;
  133. int objectId = blockDim.x * blockIdx.x + threadIdx.x;
  134. if( objectId < numObjs )
  135. {
  136. int index;
  137. float dist,min_dist;
  138. /*find the cluster id that has min distance to object*/
  139. index = 0;
  140. min_dist = euclid_dist_2(numCoords,numObjs,numClusters,objects,clusters,objectId,0);
  141. for(int i=0;i<numClusters;i++)
  142. {
  143. dist = euclid_dist_2(numCoords,numObjs,numClusters,objects,clusters,objectId,i) ;
  144. /* no need square root */
  145. if( dist < min_dist )
  146. {
  147. min_dist = dist;
  148. index = i;
  149. }
  150. }
  151. if( membership[objectId]!=index )
  152. {
  153. membershipChanged[threadIdx.x] = 1;
  154. }
  155. //assign the membership to object objectId
  156. membership[objectId] = index;
  157. __syncthreads(); //for membershipChanged[]
  158. #if 1
  159. //blockDim.x *must* be a power of two!
  160. for(unsigned int s = blockDim.x / 2; s > 0 ;s>>=1)
  161. {
  162. if(threadIdx.x < s)
  163. {
  164. membershipChanged[threadIdx.x] += membershipChanged[threadIdx.x + s];//calculate all changed values and save result to membershipChanged[0]
  165. }
  166. __syncthreads();
  167. }
  168. if(threadIdx.x == 0)
  169. {
  170. intermediates[blockIdx.x] = membershipChanged[0];
  171. }
  172. #endif
  173. }
  174. }//find_nearest_cluster
  175. void KMEANS::cuda_kmeans()
  176. {
  177. int index,loop = 0;
  178. int *newClusterSize;//[numClusters]:no.objects assigned in each new cluster
  179. float delta; //% of objects changes their clusters
  180. float **dimObjects;//[numCoords][numObjs]
  181. float **dimClusters;
  182. float **newClusters;//[numCoords][numClusters]
  183. float *deviceObjects; //[numCoords][numObjs]
  184. float *deviceClusters; //[numCoords][numclusters]
  185. int *deviceMembership;
  186. int *deviceIntermediates;
  187. //Copy objects given in [numObjs][numCoords] layout to new [numCoords][numObjs] layout
  188. malloc2D(dimObjects,numCoords,numObjs,float);
  189. for(int i=0;i<numCoords;i++)
  190. {
  191. for(int j=0;j<numObjs;j++)
  192. {
  193. dimObjects[i][j] = objects[j][i];
  194. }
  195. }
  196. //pick first numClusters elements of objects[] as initial cluster centers
  197. malloc2D(dimClusters, numCoords, numClusters,float);
  198. for(int i=0;i<numCoords;i++)
  199. {
  200. for(int j=0;j<numClusters;j++)
  201. {
  202. dimClusters[i][j] = dimObjects[i][j];
  203. }
  204. }
  205. newClusterSize = new int[numClusters];
  206. assert(newClusterSize!=NULL);
  207. malloc2D(newClusters,numCoords,numClusters,float);
  208. memset(newClusters[0],0,numCoords * numClusters * sizeof(float) );
  209. //To support reduction,numThreadsPerClusterBlock *must* be a power of two, and it *must* be no larger than the number of bits that will fit into an unsigned char ,the type used to keep track of membership changes in the kernel.
  210. const unsigned int numThreadsPerClusterBlock = 32;
  211. const unsigned int numClusterBlocks = (numObjs + numThreadsPerClusterBlock -1)/numThreadsPerClusterBlock;
  212. const unsigned int numReductionThreads = nextPowerOfTwo(numClusterBlocks);
  213. const unsigned int clusterBlockSharedDataSize = numThreadsPerClusterBlock * sizeof(unsigned char);
  214. const unsigned int reductionBlockSharedDataSize = numReductionThreads * sizeof(unsigned int);
  215. cudaMalloc(&deviceObjects,numObjs*numCoords*sizeof(float));
  216. cudaMalloc(&deviceClusters,numClusters*numCoords*sizeof(float));
  217. cudaMalloc(&deviceMembership,numObjs*sizeof(int));
  218. cudaMalloc(&deviceIntermediates,numReductionThreads*sizeof(unsigned int));
  219. cudaMemcpy(deviceObjects,dimObjects[0],numObjs*numCoords*sizeof(float),cudaMemcpyHostToDevice);
  220. cudaMemcpy(deviceMembership,membership,numObjs*sizeof(int),cudaMemcpyHostToDevice);
  221. do
  222. {
  223. cudaMemcpy(deviceClusters,dimClusters[0],numClusters*numCoords*sizeof(float),cudaMemcpyHostToDevice);
  224. find_nearest_cluster<<>>(numCoords,numObjs,numClusters,deviceObjects,deviceClusters,deviceMembership,deviceIntermediates);
  225. cudaDeviceSynchronize();
  226. compute_delta<<<1,numReductionThreads,reductionBlockSharedDataSize>>>(deviceIntermediates,numClusterBlocks,numReductionThreads);
  227. cudaDeviceSynchronize();
  228. int d;
  229. cudaMemcpy(&d,deviceIntermediates,sizeof(int),cudaMemcpyDeviceToHost);
  230. delta = (float)d;
  231. cudaMemcpy(membership,deviceMembership,numObjs*sizeof(int),cudaMemcpyDeviceToHost);
  232. for(int i=0;i<numObjs;i++)
  233. {
  234. //find the array index of nestest
  235. index = membership[i];
  236. //update new cluster centers:sum of objects located within
  237. newClusterSize[index]++;
  238. for(int j=0;j<numCoords;j++)
  239. {
  240. newClusters[j][index] += objects[i][j];
  241. }
  242. }
  243. //average the sum and replace old cluster centers with newClusters
  244. for(int i=0;i<numClusters;i++)
  245. {
  246. for(int j=0;j<numCoords;j++)
  247. {
  248. if(newClusterSize[i] > 0)
  249. dimClusters[j][i] = newClusters[j][i]/newClusterSize[i];
  250. newClusters[j][i] = 0.0;//set back to 0
  251. }
  252. newClusterSize[i] = 0 ; //set back to 0
  253. }
  254. delta /= numObjs;
  255. }while( delta > threshold && loop++ < 500 );
  256. loop_iterations = loop + 1;
  257. malloc2D(clusters,numClusters,numCoords,float);
  258. for(int i=0;i<numClusters;i++)
  259. {
  260. for(int j=0;j<numCoords;j++)
  261. {
  262. clusters[i][j] = dimClusters[j][i];
  263. }
  264. }
  265. cudaFree(deviceObjects) ;
  266. cudaFree(deviceClusters);
  267. cudaFree(deviceMembership);
  268. cudaFree(deviceMembership);
  269. free(dimObjects[0]);
  270. free(dimObjects);
  271. free(dimClusters[0]);
  272. free(dimClusters);
  273. free(newClusters[0]);
  274. free(newClusters);
  275. free(newClusterSize);
  276. }
  277. void KMEANS::file_read(char *fn)
  278. {
  279. FILE *infile;
  280. char *line = new char[MAX_CHAR_PER_LINE];
  281. int lineLen = MAX_CHAR_PER_LINE;
  282. filename = fn;
  283. infile = fopen(filename,”r”);
  284. assert(infile!=NULL);
  285. /*find the number of objects*/
  286. while( fgets(line,lineLen,infile) )
  287. {
  288. numObjs++;
  289. }
  290. /*find the dimension of each object*/
  291. rewind(infile);
  292. while( fgets(line,lineLen,infile)!=NULL )
  293. {
  294. if( strtok(line,” \t\n”)!=0 )
  295. {
  296. while( strtok(NULL,” \t\n”) )
  297. numCoords++;
  298. break;
  299. }
  300. }
  301. /*allocate space for object[][] and read all objcet*/
  302. rewind(infile);
  303. objects = new float*[numObjs];
  304. for(int i=0;i<numObjs;i++)
  305. {
  306. objects[i] = new float[numCoords];
  307. }
  308. int i=0;
  309. /*read all object*/
  310. while( fgets(line,lineLen,infile)!=NULL )
  311. {
  312. if( strtok(line,” \t\n”) ==NULL ) continue;
  313. for(int j=0;j<numCoords;j++)
  314. {
  315. objects[i][j] = atof( strtok(NULL,” ,\t\n”) ) ;
  316. }
  317. i++;
  318. }
  319. /* membership: the cluster id for each data object */
  320. membership = new int[numObjs];
  321. assert(membership!=NULL);
  322. for(int i=0;i<numObjs;i++)
  323. membership[i] = -1;
  324. }
  325. int main(int argc,char *argv[])
  326. {
  327. KMEANS kmeans(atoi(argv[1]));
  328. kmeans.file_read(argv[2]);
  329. kmeans.cuda_kmeans();
  330. kmeans.file_write();
  331. return 0;
  332. }

makefile:

[cpp] view plain copy 在CODE上查看代码片 派生到我的代码片

  1. target:
  2. nvcc cuda_kmeans.cu
  3. ./a.out 4 ./Image_data/color100.txt

所有代码和文件数据在这里:http://yunpan.cn/cKBZMPAJ8tcAs(提取码:9476)

运行代码:

Center

kmeans的cuda实现代码相对复杂,在阅读的过程中可能会有困难,有问题请留言~

Author:忆之独秀

Email:leaguenew@qq.com

注明出处:http://blog.csdn.net/lavorange/article/details/41942323

发表评论

表情:
评论列表 (有 0 条评论,321人围观)

还没有评论,来说两句吧...

相关阅读

    相关 CUDA 并行计算

    CUDA 并行计算 并行计算可以被定义为同时使用许多计算资源 (核心或计算机) 来执行并发计算,一个大的问题可以被分解成多个小问题,然后在不同的计算资源上并行处理这些小