Tree and Graph Processing On Hadoop

Tree and Graph Processing On Hadoop Ted Malaska

Schedule • Intro • Overview of Hadoop and Eco-System • Summarize Tree Rooting • MR Overview/Implementation Options • HbaseOverview/Implementation Options • Giraph Overview/Implementation Options • Spark Overview/Implementation Options • Summery • Quesitons

Intro • Hi there

Overview of Hadoop and Eco-System Machine Learning NoSql Search Batch Ingestion Streaming RTQ LFP Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka NFS Storm Spark Streaming Impala Spark Mahout Oryx R Python Streaming SAS HBase Accumulo Search SolR Auditing and Monitoring Security and Access Controls HDFS

In Scope for Tonight Machine Learning NoSql Search Batch Ingestion Streaming RTQ LFP Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka NFS Storm Spark Streaming Impala Spark Mahout Oryx R Python Streaming SAS HBase Accumulo Search SolR Auditing and Monitoring Security and Access Controls HDFS

Summarize Tree Rooting • Basic Tree 3 3 3 Leafs Vertex 2 2 2 Edge 2 1 1 Branches Depth 0 True Root

Summarize Tree Rooting • More Complex Tree Circular Link 3 2 3 2 2 Multiple Parents 2 2 1 1 0

Summarize Tree Rooting • Merging Trees • Borderline True Graph Problem Multi Rooted Vertex 3 2 3 2 2 2 0 2 1 1 0 0 True Root True Root

Summarize Tree Rooting • Know your data

Basic Storage Format • <NodeID>|<EdgeID> • Example • 101 • 101|201 • 101|202 • 201 • 202|301 • 301

Preprocessing • Terming Data • Nodes and edges have data • Data has weight • Normally linkage information is under 10% of true data size • Organize Data by Partitioning

Basic Solution • Step 1: Identify Roots • Echo to all edges • Vertexes with that receive no echoes are roots • Root the root • Step 2: Walk the tree • Echo from last newly rooted Vertex to all edges • If vertex is not already rooted then root it. • 101 • 101|201 • 101|202 • 201 • 202|301 • 301 • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:Null • 202|301|R:Null • 301|R:Null • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:101 • 202|301|R:101 • 301|R:Null • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:101 • 202|301|R:101 • 301|R:101

Map Reduce • Massive parallel processing on Hadoop • Based on the Google 2004 MapReduce white paper • Able to process PBs of data

Map Reduce Data Blocks Mapper Sort & Shuffle Data Blocks Mapper Sort & Shuffle Data Blocks Mapper Data Blocks Mapper Sort & Shuffle Data Blocks Mapper

Map Reduce • Self Joins • Always dumping two output: • Newly Rooted • Still Un-Rooted Un-Rooted Un-Rooted Un-Rooted All Data Newly Rooted MR – Stage1 Rooting MR – Stage2 Rooting Newly Rooted MR - Stage0 Root Identifying Old Rooted 1 Old Rooted 0 Old Rooted 0 Newly Rooted

Map Reduce • Great for large batch operations • No memory limit • Not good at iterations

HBase • Largest and Most used NoSql Implementation in the World • Based on the Google 2006 BigTable white paper • Imagine it like a giant HashMap with keys and values • Handles 100k of operations a second on even a small 10 node cluster

HBase Getting Client HBase Master HBase Region Server HBase Region Server HBase Region Server Block Cache Block Cache Block Cache

HBase Putting Client HBase Master HBase Region Server HBase Region Server HBase Region Server WAL WAL WAL MemStore MemStore MemStore HFile HFile HFile

HBase • Good for graph traversing • Bad for large batch processing • Scan rate about 8x slower then HDFS • Good for end of a long tail

Giraph • System built for Large Batch Graph Processing • Based on Pregel 2009 white paper • Hardened by LinkedIn and FaceBook • Recorded to handle up to a Trillion edges

Giraph Loading Data Blocks Master Worker Worker Data Blocks Data Blocks Worker Worker

Giraph (Bulk Synchronous Parallel) Communication Barrier synchronization Worker Worker Worker Local vertex computing Local vertex computing Local vertex computing

Giraph • Most mature bulk graph processing out there • Of all the solutions, most graph focused

Spark • At Berkeley around 2011 some asked is we could do better then MR • Take advantage of lower cost memory • Building on everything before

Spark Task Scheduler RDD Objects Worker Dag Scheduler (Like a queue planner Spark Worker Cluster Manager Threads Task Threads Block Manager Block Manager Rdd1.join(rdd2). groupBy(…) .filter(…)

Spark • Implementations • Onion MR approach with Basic Spark • Pregel approach with Bagel or GraphX • Bagel is a Façade over Generic Spark Functionality • GraphX is an effort extend to Spark • Less code • Learning curve • Its Raw will be changing a lot in the next year

Tree and Graph Processing On Hadoop

Tree and Graph Processing On Hadoop

Presentation Transcript

Cubes on Hadoop

SQL on Hadoop

Map-Reduce Graph Processing

Graph OLAP: Towards Online Analytical Processing on Graphs

Processing a Decision Tree

Making Pig Fly Optimizing Data Processing on Hadoop

Programming on Hadoop

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop

Trecul – Data Flow Processing using Hadoop and LLVM

GRAPH PROCESSING

Scalable Regression Tree Learning on Hadoop using OpenPlanet

Graph Processing

Distributed Graph Processing

Graph Indexing: Tree + Δ ≥ Graph

CuSha : Vertex-Centric Graph Processing on GPUs

From Graph to Tree: Processing UNL Graphs Using an Existing MT System

Graph Theory Chapter 3 Tree

Pig, a high level data processing system on Hadoop

Tree and Graph Drawing

Pig, a high level data processing system on Hadoop

Tree and Graph Drawing