210 likes | 413 Views
Clustering Very Large Multi-dimensional Datasets with MapReduce. 蔡跳. INTRODUCTION. large dataset of moderate-to-high dimensional elements serial subspace clustering algorithms TB、PB e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB 方法:combine a fast, scalable serial algorithm
E N D
Clustering Very Large Multi-dimensional Datasets with MapReduce 蔡跳
INTRODUCTION • large dataset of moderate-to-high dimensional elements • serial subspace clustering algorithms • TB、PB • e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB • 方法:combine a fast, scalable serial algorithm and makes it run efficiently in parallel
INTRODUCTION • bottleneck: I/O, network • Best of both Worlds -- BoW automatically spots the bottleneck and picks a good strategy serial clustering methods as a plugged-in clustering subroutine
RELATED WORK • MapReduce--简化的分布式编程模式,用于大规模数据集的并行运算 • mapper, reducer • map stage:input file and outputs(key, value)pairs • shuffle stage:transfers the mappers'output to the reducers based on the key • reduce stage: processes the received pairs and outputs thefinal result
BoW • ParC:数据划分,合并结果 • SnI:先抽样,牺牲I/O减少network cost • trade-off
ParC--Parallel Clustering • 划分数据、分配数据到不同的机器 • 每台机器在分配到的数据中聚类,得到簇称为β-clusters • 合并β-clusters得到最终的类
SnI--Sample and Ignore • 抽样,聚类得到clusters • 排除属于clusters空间内的数据 • ParC
COST-BASED OPTIMIZATION • ParC Cost: • Map Cost: • Shuffle Cost: • Reduce Cost:
Bow • compute ParC Cost->costC • compute SnI Cost->costCs • if costC > costCs then clusters = result of SnI • else clusters = result of ParC
EXPERIMENTAL RESULTS • 采用Hadoop • M45:1.5PB storage,1TB memory, • DISC/Cloud:512 cores,64 machines,1TB RAM,256TB disk storage,
Quality of results • 聚类的平均准确率、召回率 • 模拟数据
Scale-up results • 增加reducer
Scale-up results • 增加数据,r=128,m=700
感谢聆听! Thanks for your time