1 / 41

DisCo : Distributed Co-clustering with Map-Reduce

DisCo : Distributed Co-clustering with Map-Reduce. 2008 IEEE International Conference on Data Engineering (ICDE). S. Papadimitron , J. Sun. Tzu-Li Tai, Tse -En Liu Kai-Wei Chan, He- Chuan Hoh. IBM T.J. Watson Research Center NY, USA. National Cheng Kung University

Download Presentation

DisCo : Distributed Co-clustering with Map-Reduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. DisCo: Distributed Co-clustering with Map-Reduce 2008 IEEE International Conference on Data Engineering (ICDE) S. Papadimitron, J. Sun Tzu-Li Tai, Tse-En Liu Kai-Wei Chan, He-Chuan Hoh IBM T.J. Watson Research Center NY, USA National Cheng Kung University Dept. of Electrical Engineering HPDS Laboratory

  2. Agenda Motivation Background: Co-Clustering + MapReduce Proposed Distributed Co-Clustering Process Implementation Details Experimental Evaluation Conclusions Discussion 0 39

  3. Fast Growth in Volume of Data Motivation • Google processes 20 petabytes of data per day • Amazon and eBay with petabytes of transactional data every day Highly variant structure of data • Data sources naturally generate data in impure forms • Unstructured, semi-structured 1 39

  4. Problems with Big Data mining for DBMSs Motivation • Significant preprocessing costs for the majority of data mining tasks • DBMS lacks performance for large amount of data 2 39

  5. Why distributed processing can solve the issues: Motivation • MapReduceis irrelevant to the schema or form of the input data • Many preprocessing tasks are naturally expressible with MapReduce • Highly scalable with commodity machines 3 39

  6. Contributions of this paper: Motivation • Presents the whole process for distributed data mining • Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce 4 39

  7. BackGround: Co-Clustering • Also named biclustering, or two-mode clustering • Input format: a matrix of rows and columns • Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns 4*5 4*5 5 39

  8. BackGround: Co-Clustering Why Co-Clustering? Traditional Clustering: Social Science Chinese English Math A C Student A Student B BD Student C Can only know that students A & C / B & D have similar scores Student D 6 39

  9. Why Co-Clustering? BackGround: Co-Clustering Social Science Chinese English Math Co-Clustering: Student A Student B Student C Cluster 1 Cluster 2 Student D Good at Science + Math Good at English + Chinese + Social Studies Chinese Science English Social Math B & D A & C Student D Rows that have similar properties for a subset of selected columns Student B Student C Student A 7 39

  10. Another Co-Clustering Example: Animal Data BackGround: Co-Clustering 8 39

  11. Another Co-Clustering Example: Animal Data BackGround: Co-Clustering 9 39

  12. Another Co-Clustering Example: Animal Data BackGround: Co-Clustering 10 39

  13. The MapReduce Paradigm BackGround: MapReduce Map Reduce Map Reduce Map Reduce Map 11 39

  14. Mining Network Logs to Co-Cluster Communication Behavior Distributed Co-Clustering Process 12 39

  15. Mining Network Logs to Co-Cluster Communication Behavior Distributed Co-Clustering Process 13 39

  16. The Preprocessing Process Distributed Co-Clustering Process HDFS HDFS MapReduce Job Build transpose adjacency list MapReduce Job Extract SrcIP + DstIP and build adjacency matrix DstIP HDFS MapReduce Job Build adjacency list IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress HDFS SrcIP IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress … 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 0 1 00 0 0 0 0 00 0 …… 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 0 1 00 0 0 0 0 00 0 …… 0 0 1 00 0 0 0 0 00 0 …… 14 39

  17. Co-Clustering (Generalized Algorithm) Distributed Co-Clustering Process Goal: c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 Co-cluster into 2x2 = 4 sub-matrices r(1) = 1 1 or 2, r(2) = 1 1 or 2, r(3) = 1 r(4) = 2 Random Initialize: 15 39

  18. Co-Clustering (Generalized Algorithm) Distributed Co-Clustering Process Fix column labels, Iterate through rows: c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 r(1) = 1 r(2) = 1 r(3) = 1 r(4) = 2 r(2) = 2 16 39

  19. Co-Clustering (Generalized Algorithm) Distributed Co-Clustering Process Fix row labels, Iterate through columns: c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 c(2) = 2 17 39

  20. Co-Clustering with MapReduce Distributed Co-Clustering Process 1 -> 2,4,5 1 -> 2, 4, 5 2 -> 1, 3 3 -> 2, 4, 5 4 -> 1, 3 MR 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 18 39

  21. Co-Clustering with MapReduce Distributed Co-Clustering Process 1 -> 2,4,5 1 -> 2, 4, 5 2 -> 1, 3 3 -> 2, 4, 5 4 -> 1, 3 MR 2 -> 1,3 MapReduce Job based on parameters 3 -> 2,4,5 4 -> 1,3 19 39

  22. M c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 Distributed Co-Clustering Process 1 -> 2,4,5 M if r(1) = 2, cost becomes higher r(1) = 1 2 -> 1,3 emit (r(k), () ) = (1, {(1,2), 1}) M Mapper Function: 3 -> 2,4,5 For each K-V input, Calculate (with and ) Change row labels if results in lower cost (function of ) Emit (r(k), ()) M 4 -> 1,3 20 39

  23. M c(1) = 1 c(5) =2 c(2) = 1 c(4) =2 c(3) = 1 Distributed Co-Clustering Process 1 -> 2,4,5 M 2 -> 1,3 if r(2) = 2, cost becomes lower r(2) = 2 M Mapper Function: emit (r(k), () ) = (2, {(2,0), 2}) 3 -> 2,4,5 For each K-V input, Calculate (with and ) Change row labels if results in lower cost (function of ) Emit (r(k), ()) M 4 -> 1,3 21 39

  24. M Distributed Co-Clustering Process R 1 -> 2,4,5 M 2 -> 1,3 M 3 -> 2,4,5 R M 4 -> 1,3 22 39

  25. Distributed Co-Clustering Process R Emit Reducer Function: For each K-V input, For each , Accumulate all into Union of all Emit R 23 39

  26. Distributed Co-Clustering Process R Sync Results R 24 39

  27. Preprocessing Co-Clustering Random given Distributed Co-Clustering Process Synced with best permutation Sync Results HDFS MapReduce Job Fix column Row iteration MapReduce Job Build transpose adjacency list MapReduce Job Fix row Column iteration Final Co-Clustering result with best permutations HDFS 25 39

  28. Tuning the number of Reduce Tasks Implementation Details • The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase • For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either or 26 39

  29. M Implementation Details R 1 -> 2,4,5 M 2 -> 1,3 (row-iterate) inter-keys M 3 -> 2,4,5 R M 4 -> 1,3 27 39

  30. Tuning the number of Reduce Tasks Implementation Details • So, for the row-iteration/column-iteration jobs, 1 reduce task is enough • However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks 28 39

  31. The Preprocessing Process Implementation Details HDFS HDFS MapReduce Job Build transpose adjacency list MapReduce Job Extract SrcIP + DstIP and build adjacency matrix DstIP HDFS MapReduce Job Build adjacency list IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress HDFS SrcIP IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress … 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 0 1 00 0 0 0 0 00 0 …… 0 1 0 1 1 0 0 0 1 1 1 0 …… 0 0 1 00 0 0 0 0 00 0 …… 0 0 1 00 0 0 0 0 00 0 …… 29 39

  32. Environment Experimental Evaluation • There are 39 nodes in four different blade enclosure • Gigabit Ethernet • Blade Server • CPU: two dual-core (Intel Xeon 2.66GHz) • Memory: 8GB • OS: Red Hat Enterprise Linux • Hadoop Distributed File System(HDFS) capacity: 2.4 TB 30 39

  33. Datasets Experimental Evaluation 31 39

  34. Preprocessing ISS Data Experimental Evaluation Optimal values of each situation Map tasks number 6 Reduce tasks number 5 Input splitsize 256MB 6 256MB 5 32 39

  35. Co-Clustering TREC Data Experimental Evaluation After 25 nodes per iteration is roughly about 20 ± 2 seconds. It is better than what we can get on a machine with 48GB RAM. 33 39

  36. Conclusion • Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach • Designed a general MapReduce approach for co-clustering algorithms • Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC) 34 39

  37. Discussion • Necessity of the global sync action • Questionable Scalability for DisCo 35 39

  38. Co-Clustering Random given Necessity of the global sync action Discussion Synced with best permutation Sync Results MapReduce Job Fix column Row iteration MapReduce Job Fix row Column iteration Final Co-Clustering result with best permutations 36 39

  39. M Discussion R 1 -> 2,4,5 M 2 -> 1,3 M 3 -> 2,4,5 R M 4 -> 1,3 37 39

  40. Questionable Scalability of DisCo Discussion • For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be (or ) • This implies that for a given and , as the input matrix gets larger, the reducer size* will increase dramatically • Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance *reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB 38 39

  41. M Discussion R 1 -> 2,4,5 M 2 -> 1,3 M 3 -> 2,4,5 R M 4 -> 1,3 39 39

More Related