Clustering Very Large Multi-dimensional Datasets with MapReduce

Clustering Very Large Multi-dimensional Datasets with MapReduce 蔡跳

INTRODUCTION • large dataset of moderate-to-high dimensional elements • serial subspace clustering algorithms • TB、PB • e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB • 方法：combine a fast, scalable serial algorithm and makes it run efficiently in parallel

INTRODUCTION • bottleneck: I/O, network • Best of both Worlds -- BoW automatically spots the bottleneck and picks a good strategy serial clustering methods as a plugged-in clustering subroutine

RELATED WORK • MapReduce--简化的分布式编程模式，用于大规模数据集的并行运算 • mapper, reducer • map stage：input file and outputs(key, value)pairs • shuffle stage：transfers the mappers'output to the reducers based on the key • reduce stage: processes the received pairs and outputs thefinal result

BoW • ParC：数据划分，合并结果 • SnI：先抽样，牺牲I/O减少network cost • trade-off

ParC--Parallel Clustering • 划分数据、分配数据到不同的机器 • 每台机器在分配到的数据中聚类，得到簇称为β-clusters • 合并β-clusters得到最终的类

SnI--Sample and Ignore • 抽样，聚类得到clusters • 排除属于clusters空间内的数据 • ParC

COST-BASED OPTIMIZATION • ParC Cost： • Map Cost： • Shuffle Cost: • Reduce Cost:

SnI Cost：

Bow • compute ParC Cost->costC • compute SnI Cost->costCs • if costC > costCs then clusters = result of SnI • else clusters = result of ParC

EXPERIMENTAL RESULTS • 采用Hadoop • M45：1.5PB storage，1TB memory， • DISC/Cloud：512 cores，64 machines，1TB RAM，256TB disk storage，

Quality of results • 聚类的平均准确率、召回率 • 模拟数据

Scale-up results • 增加reducer

Scale-up results • 增加数据，r=128，m=700

Accuracy of our cost equations

感谢聆听! Thanks for your time

Clustering Very Large Multi-dimensional Datasets with MapReduce

Clustering Very Large Multi-dimensional Datasets with MapReduce

Presentation Transcript

Working Efficiently with Large SAS® Datasets

Successful Dimensional Modeling of Very Large Data Warehouses

Challenges in survival analysis with large datasets

Very Large-Scale Incremental Clustering

Large-scale Processing with MapReduce

Analysis with Extremely Large Datasets

Large-Scale Data Processing with MapReduce

Assembling Large, Multi-Sensor Climate Datasets Using SciFlo

ICS 624 Spring 2011 Multi-Dimensional Clustering

Assembling Large, Multi-Sensor Climate Datasets Using CVO

Visualization of High dimensional Datasets

Multi-dimensional Clustering Analysis Ex: Cluster on Information

Algorithms for clustering large datasets in arbitrary metric spaces

Handling Large (Vector) Datasets with MapServer

Efficient Bitmap Indexing Techniques for Very Large Datasets

A Very Fast Method for Clustering Big Text Datasets

Analysis with Extremely Large Datasets

Course Expectations High-throughput sequencing technology Very large datasets

Spatial Indexing and Visualizing Large Multi-dimensional Databases

Clustering Large Datasets in Arbitrary Metric Space

Very Large-Scale Incremental Clustering