190 likes | 323 Views
Benchmarking traversal operations over graph databases. Marek Ciglan 1 , Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of In f ormatics , Slovak Academy of sciences, Bratislava 2 Swedish Institute of Computer Science Stockholm , Sweden. Overview. Graph data management
E N D
Benchmarking traversal operations over graphdatabases Marek Ciglan1, AlexAverbuch2 and Ladialav Hluchý1 1Institute of Informatics, Slovak Academy of sciences, Bratislava 2 Swedish Institute of Computer ScienceStockholm, Sweden
Overview • Graph data management • Graph databases • Characteristics • Unique features • Challenges • GDB Benchmarking • Motivation • Related work • Graph traversal benchmark • Goals • Design • Preliminary results 21 November 2011
Graph data management • Booming area of R&D in recent years • Reasons: • Increased availability and importance of graph data • Natural way for modelling various real world phenomena • (networks: social, information, communication) • Two dominant data management directions: • Distributed graph processing frameworks • Mining/processing of large graphs • Pregeland clones (Goden Orb, Giraph) • Graph databases • Persistent management of graph data • Neo4J, OrientDB, Dex 21 November 2011
Graph databases • Property graph data model • Graph structure • Elements have properties Node K2 Attr I1: val Attr I2: val Attr I3: val L1 L3 Node K1 Attr I1: val Attr I2: val Attr I3: val Node K4 Attr I1: val Attr I2: val Attr I3: val Node K3 Attr I1: val Attr I2: val Attr I3: val L2 L1 21 November 2011
Graph databases • Property graph data model • Graph structure • Elements have properties • Unique feature • Graph topology capturing the relations of objects • Graph database should be • Efficient in exploiting topology • Allows for fast traversal • Challenges • Traditionally – graph processing/traversing done in memory • Reasons: • Data driven computation • Random access pattern for data access 21 November 2011
Graph database benchmarking • Motivation • Number of emerging graph data management solutions. • Which is right one for a specific problem? • Fair measurement of performance for distinct use cases. • Identify limits – what use cases have good performance. 21 November 2011
Graph database benchmarking • Motivation • Number of emerging graph data management solutions. • Which is right one for a specific problem? • Fair measurement of performance for distinct use cases. • Identify limits – what use cases have good performance. • Related work • Only few works address directly graph databases • D. Dominguez-Sal et al: • Adoption of HPC benchmark for graph data processing • Design of a benchmark suitable for graph database systems • GraphBench - basic benchmarking framework implementation 21 November 2011
Graph database benchmarking • Motivation • Number of emerging graph data management solutions. • Which is right one for a specific problem? • Fair measurement of performance for distinct use cases. • Identify limits – what use cases have good performance. • Traversal operation benchmarking • Graph topology – unique feature of the graph databases • Test the ability to do: • Local traversals (exploring k-hops neighbourhood) • Global traversals (traversals of whole graph) • Perform traversals in a memory constraint environment • (can we deal efficiently with data sets exceeding the physical memory?) 21 November 2011
Benchmark design • Fairness • Blueprints API – effort to provide common API • https://github.com/tinkerpop/blueprints/wiki/ • Using Blueprints – one implementation of benchmark for all the benchmarked systems • Avoid bias of different implementation of benchmark for different systems • execution of the same sequence of operations on the same data • log operations and their parameters in the first run over the defined data • logs are persistent, allowing benchmarks to be rerun on different versions of a product, and the change in performance can thus be measured 21 November 2011
Benchmark design • Data • Different data properties / distributions affects benchmark results • E.g. dense vs. sparse graphs • Ideally, data sets properties similar to those of real world data sets • Use: scale free networks with small world properties • social networks, the Internet, traffic networks, biological networks, and term co-occurrence networks • LFR-Benchmark generator - networks with power-law degree distribution and implanted communities within the network 21 November 2011
Benchmark design • Traversal operations • Local traversals • Compute local clustering coefficient (2-hops breadth first traversal) • 3-hops breadth first traversal • Global traversals • Compute connected components • Incomming / ougoing edges • k-iterations of HITS algorithm • Memory constraint environment • Intermediate results for global traversals operations: • Kept in memory • Kept as properties on nodes 21 November 2011
Benchmark implementation • Implemented on top of Blueprints API • Test performed on: • Neo4J, • DEX, • OrientDB6 , • Native RDF repository (NativeSail) • SGDB (research prototype ) • Challenge: deal with differences in underlying systems, E.g.: • triple stores – naming constraints, • some impl. do not support properties on some elements • Some impl. do not support iteration over nodes/edges • Nodes Ids generation – user provided vs. autogenerated • Transaction support / no transactions 21 November 2011
Benchmark Runs • Performed on older hardware: • 2G mem • Data sets sizes: • 1K, 10K, 40K, 50K, 100K, 200K, 400K, 800K, 1M • Most systems were not able to load nets with 400K+ edges • (constraint: load 10K edges in less than 60 sec.) 21 November 2011
Graphloading – elementsinsertion 21 November 2011
Localtraversal – BFS 3 hops 21 November 2011
Globaltraversals – connectedcomponents 21 November 2011
Conclusion • Extending work on benchmarking graph databases • Focusing on graph traversal operations • Local/Global traversals • Preliminary results: • Problem just to load larger datasets into GDBs • Stable performance for local traversals with 2-3 hops • Suitable for most ego-centric node properties analysis • Bad performance for global traversal operations on larger networks 21 November 2011
Thankyouforyourattention. http://ups.savba.sk/~marek/gbench.html 21 November 2011
SemSets – activation spreading over network 21 November 2011