280 likes | 495 Views
Tree and Graph Processing On Hadoop. Ted Malaska. Schedule. Intro Overview of Hadoop and Eco-System Summarize Tree Rooting MR Overview/Implementation Options Hbase Overview/Implementation Options Giraph Overview/Implementation Options Spark Overview/Implementation Options Summery
E N D
Tree and Graph Processing On Hadoop Ted Malaska
Schedule • Intro • Overview of Hadoop and Eco-System • Summarize Tree Rooting • MR Overview/Implementation Options • HbaseOverview/Implementation Options • Giraph Overview/Implementation Options • Spark Overview/Implementation Options • Summery • Quesitons
Intro • Hi there
Overview of Hadoop and Eco-System Machine Learning NoSql Search Batch Ingestion Streaming RTQ LFP Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka NFS Storm Spark Streaming Impala Spark Mahout Oryx R Python Streaming SAS HBase Accumulo Search SolR Auditing and Monitoring Security and Access Controls HDFS
In Scope for Tonight Machine Learning NoSql Search Batch Ingestion Streaming RTQ LFP Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka NFS Storm Spark Streaming Impala Spark Mahout Oryx R Python Streaming SAS HBase Accumulo Search SolR Auditing and Monitoring Security and Access Controls HDFS
Summarize Tree Rooting • Basic Tree 3 3 3 Leafs Vertex 2 2 2 Edge 2 1 1 Branches Depth 0 True Root
Summarize Tree Rooting • More Complex Tree Circular Link 3 2 3 2 2 Multiple Parents 2 2 1 1 0
Summarize Tree Rooting • Merging Trees • Borderline True Graph Problem Multi Rooted Vertex 3 2 3 2 2 2 0 2 1 1 0 0 True Root True Root
Summarize Tree Rooting • Know your data
Basic Storage Format • <NodeID>|<EdgeID> • Example • 101 • 101|201 • 101|202 • 201 • 202|301 • 301
Preprocessing • Terming Data • Nodes and edges have data • Data has weight • Normally linkage information is under 10% of true data size • Organize Data by Partitioning
Basic Solution • Step 1: Identify Roots • Echo to all edges • Vertexes with that receive no echoes are roots • Root the root • Step 2: Walk the tree • Echo from last newly rooted Vertex to all edges • If vertex is not already rooted then root it. • 101 • 101|201 • 101|202 • 201 • 202|301 • 301 • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:Null • 202|301|R:Null • 301|R:Null • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:101 • 202|301|R:101 • 301|R:Null • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:101 • 202|301|R:101 • 301|R:101
Map Reduce • Massive parallel processing on Hadoop • Based on the Google 2004 MapReduce white paper • Able to process PBs of data
Map Reduce Data Blocks Mapper Sort & Shuffle Data Blocks Mapper Sort & Shuffle Data Blocks Mapper Data Blocks Mapper Sort & Shuffle Data Blocks Mapper
Map Reduce • Self Joins • Always dumping two output: • Newly Rooted • Still Un-Rooted Un-Rooted Un-Rooted Un-Rooted All Data Newly Rooted MR – Stage1 Rooting MR – Stage2 Rooting Newly Rooted MR - Stage0 Root Identifying Old Rooted 1 Old Rooted 0 Old Rooted 0 Newly Rooted
Map Reduce • Great for large batch operations • No memory limit • Not good at iterations
HBase • Largest and Most used NoSql Implementation in the World • Based on the Google 2006 BigTable white paper • Imagine it like a giant HashMap with keys and values • Handles 100k of operations a second on even a small 10 node cluster
HBase Getting Client HBase Master HBase Region Server HBase Region Server HBase Region Server Block Cache Block Cache Block Cache
HBase Putting Client HBase Master HBase Region Server HBase Region Server HBase Region Server WAL WAL WAL MemStore MemStore MemStore HFile HFile HFile
HBase • Good for graph traversing • Bad for large batch processing • Scan rate about 8x slower then HDFS • Good for end of a long tail
Giraph • System built for Large Batch Graph Processing • Based on Pregel 2009 white paper • Hardened by LinkedIn and FaceBook • Recorded to handle up to a Trillion edges
Giraph Loading Data Blocks Master Worker Worker Data Blocks Data Blocks Worker Worker
Giraph (Bulk Synchronous Parallel) Communication Barrier synchronization Worker Worker Worker Local vertex computing Local vertex computing Local vertex computing
Giraph • Most mature bulk graph processing out there • Of all the solutions, most graph focused
Spark • At Berkeley around 2011 some asked is we could do better then MR • Take advantage of lower cost memory • Building on everything before
Spark Task Scheduler RDD Objects Worker Dag Scheduler (Like a queue planner Spark Worker Cluster Manager Threads Task Threads Block Manager Block Manager Rdd1.join(rdd2). groupBy(…) .filter(…)
Spark • Implementations • Onion MR approach with Basic Spark • Pregel approach with Bagel or GraphX • Bagel is a Façade over Generic Spark Functionality • GraphX is an effort extend to Spark • Less code • Learning curve • Its Raw will be changing a lot in the next year