220 likes | 310 Views
Weekly Report- Kmeans. Ph.D. Student: Leo Lee date: Nov. 13, 2009. Outline. K-means CPU-based algorithm workflow; Reading Kaiyong’s code; Some naïve thoughts; Work plan. K-means. CPU-based algorithm workflow;. N data and K centers, dim dimension;. Compute D[N][K].
E N D
Weekly Report-Kmeans Ph.D. Student: Leo Leedate: Nov. 13, 2009
Outline • K-means • CPU-based algorithm workflow; • Reading Kaiyong’s code; • Some naïve thoughts; • Work plan
K-means • CPU-based algorithm workflow; N data and K centers, dim dimension; Compute D[N][K] Compute MinD[N] Compute NewCenter[K] If NewCenter == center No Yes
K-means • Pseudocode: While(!bFlag && ++i <= nIterationsTime) { ComputeDis(&dis, data, centers); FindMinDis(dis, &index); ComputeNewCen(&newCen, data, index); if(newCen-centers < b) bFlag = true else centers = newCen; }
K-means • Since each iteration relays on the previous one, we would sequential run each iteration but parallel each function inside the iteration. • Compute distance; • Find the nearest center; • Computer new centers;
K-means-compute the distance • Data[N][Dim], Centers[Dim][K] • Dis[N][K] • Nearly the same as Matrix mulitipication • Only replace A[i][k]*B[K][j]->(A[i][k]-B[K][j])2 • Using the so called tiles, increase the compute to memory access ratio Centers Data Distances
K-means-compute the distance • From Kaiyong • dim3 threads(16, 2, 1); • dim3 grid(k/32, n/32, 1); • ComputeDistance<32,32><<<grid,threads>>>(…) • template <unsigned int B_WIDTH, unsigned int C_HIGH> __global__ void ComputeDistance(….) • { • }
K-means-compute the distance • float* indexQ = Query + threadIdx.x + (blockIdx.y*C_HIGH + threadIdx.y) * dim; • float* indexR = Ref + blockIdx.x*B_WIDTH + threadIdx.y * blockDim.x + threadIdx.x; • float* indexC = C + blockIdx.x * C_HIGH + threadIdx.y * blockDim.x + threadIdx.x + blockIdx.y *C_HIGH* wB ;
K-means-compute the distance • __shared__ float as[16][C_HIGH+1]; • Do • for(int i = 0; i < C_HIGH; i += 2) • as[threadIdx.x][threadIdx.y + i] = indexQ[i*dim]; • indexQ += 16; • __syncthreads(); • for(int i = 0; i < 16; i++, indexR += wB) • for( int j = 0; j < C_HIGH; j++) • { • c_temp = indexR[0]-as[i][j]; • c[j] += c_temp*c_temp; • } • __syncthreads(); • while(indexQ < Alast);
K-means-compute the distance for(int i = 0; i < C_HIGH; i++, indexC += wB) { indexC[0] = c[i]; }
K-means - compute the distance • Questions • Why template <unsigned int B_WIDTH, unsigned int C_HIGH>, not parameters? • Why load sub matrix in that way? Sth to do with WARP? If we use 16*16 tile instead of 32*32, should load method change? • This algorithm is nearly the same as the so-called most efficient Matrix Mulitiplication. • Thread(16, 4), grid(wc/4, hc/16)
Compute the distance • Very useful in data mining • K-means; • K-nn; • Hieratical clustering; • …
K-means-Find the nearest center K N • N Reductions • Sum, max, min… • Sequential addressing • Completely unroll • n/logn threads, logn steps;
K-means-Find the nearest center dim3 threads_find(16,1,1); dim3 grid_find(1, data_height, 1); template <unsigned int blockSize>__global__ voidcpu_FindSmallDistance( float* Dist, int* D_index, int k) { __shared__ float sdata[blockSize]; __shared__ int d_index[blockSize]; // perform first level of reduction, reading from g-memory, writing to s-memory unsigned int tid = threadIdx.x; unsigned int i = blockSize + threadIdx.x; float* p_data= Dist + blockIdx.y*k; sdata[tid] = p_data[tid]; d_index[tid] = tid; if (i < k) if( sdata[tid] > p_data[i]){ sdata[tid] = p_data[i]; d_index[tid] = i; } EMUSYNC;
K-means-Find the nearest center if( sdata[tid] > sdata[tid + 8] ) {sdata[tid] = sdata[tid + 8]; d_index[tid] = d_index[tid+8];} EMUSYNC; if( sdata[tid] > sdata[tid + 4] ) {sdata[tid] = sdata[tid + 4]; d_index[tid] = d_index[tid+4];} EMUSYNC; if( sdata[tid] > sdata[tid + 2] ) {sdata[tid] = sdata[tid + 2]; d_index[tid] = d_index[tid+2];} EMUSYNC; if( sdata[tid] > sdata[tid + 1] ) {d_index[tid] = d_index[tid+1];} EMUSYNC; // write result for this block to global mem if (tid == 0) D_index[blockIdx.y] = d_index[0]; } • Since the K is presumed to be equal or smaller than 32, this implementation is optimal.
K-means-Computer new centers; • CPU-based Algorithm • For each Data • C[ Index[i] ] += Data[i] • Not direct addressing • Not as beautiful as Matrix Mul and Reduction • // 把最近的点都加起来,分成100组,每组512个 • dim3 threads_collect(32,1,1); • dim3 grid_collect(100, 1, 1); Index dim N
K-means-Computer new centers; __shared__ int index[32]; __shared__ float as[32*34]; int idx = threadIdx.x; float* p_Data= Data + blockIdx.x * GroupSize*Dim; int* p_index= D_index + blockIdx.x * GroupSize; int* p_index_last= p_index + GroupSize; float* p_centro= Centro + blockIdx.x * k*Dim; int* p_counter= Counter + blockIdx.x*32; int index_i = 0; int centro_count = 0; // initial shared mem centero if(idx < k) for( int i = 0; i < Dim; i++) as[i*k+idx] = 0; EMUSYNC; Index dim N
K-means-Computer new centers; //每次取32个index,所以如果一共有512个数据,,512/32=16轮 for(; p_index < p_index_last; p_index += blockDim.x) { //取32个index放入到shared mem中,这样可以让IO结合 index[idx] = p_index[idx]; EMUSYNC; //循环次,每次计算一个数据 for(int i = 0; i < 32; i++, p_Data += Dim) { index_i = index[i]; // 每一个thread对应一个centro,所以当处理一个对应的数据时,这里++ if(idx == index_i) centro_count++; for(int j = idx; j < Dim; j += blockDim.x) { as[j*k+index_i] += p_Data[j]; } EMUSYNC; } } Index dim N
K-means-Computer new centers; • //只有在centro范围内的线程才回写数据,每个线程里面负责一个centro if(idx < k) { p_counter[idx] = centro_count; for(int j = 0; j < Dim; j++, p_centro += k) { p_centro[idx] = as[j*k+idx]; } } • Now we got 100 Dim*K Matrix • Matrix adding-reduction, each element is a matrix. • Kaiyong’s code reduces to 10, and gets the final. Dim Index N
Work plan - K-means • Test the program • Each function, GPU VS CPU; • Compare with other papers.
Work plan • K-means; • Learn data mining, prepare for final exam; • Go on reading parallel computing books.