370 likes | 756 Views
Survey of Graph Database Models. Byoung Ju Yang 2011. 04. 01. IDS Lab., Seoul National University. Table of contents. Survey of Graph Database Models Renzo Angles, Alaudio Gutierrez ACM Computing Surveys, Vol. 40, No. 1, Article 1 (2008)
E N D
Survey of Graph Database Models ByoungJu Yang 2011. 04. 01. IDS Lab., Seoul National University
Table of contents • Survey of Graph Database Models • Renzo Angles, Alaudio Gutierrez • ACM Computing Surveys, Vol. 40, No. 1, Article 1 (2008) • Data structures, Query languages, and Integrity constraints 1. Introduction 2. Graph Data Modeling 3. Graph Database Models (~2002) • The latest Graph Database Models • Neo4j, FlockDB • Blueprint • Sharding
2-1. What is a Graph Data Model? • Data Structure(Schema) • Represented by graph, or by data structure generalizing the notion of graph(hypergraph) - (un)labeled, (un)directed • Separation between schema and data in most cases. • Data Manipulation (Query languages) • Expressed by graph transformations, or by operations whose main primitives are on graph features like paths, neighborhoods, subgraphs, graph patterns, connectivity, and graph statistics. • Integrity constraints • Enforce graph data consistency
2-2. Why a Graph Data Model? • It allows for a more natural modeling of data • Being able to keep all the information about an entity in a single node and showing related information by arcs connected to it. • Queries can refer directly to this graph structure • Such as finding shortest paths, determining certain subgraphs, and so forth. • For implementation, graph databases may provide special graph storage structures and efficient graph algorithms for realizing specific operations.
2-3. Comparison with other DB Models • Physical DB Models • Hierarchical(1976), network(1976) models • Lack a good abstraction level • Relational DB Models • Introduced a separation btw physical and logical levels • Landmark development (mathematical foundation) • Geared toward simple record-type data (schema is known) • Not easy to integrate different schemas • Query language cannot explore the underlying graph of relationships among the data (path, neighborhoods, patterns)
2-3. Comparison with other DB Models • Semantic DB Models • DB designer can represent objects and their relations in a natural and clear manner by using high-level abstraction concepts (E-R) • Relevant to graph DB (graph-like structures) • Object-oriented DB Models • For data-intensive domains (knowledge bases, eng. applications) • Permit much richer structures but still require predefined schema • Related to graph DB (use graph structures in definitions) • Semi-structured DB Models • Irregular, implicit, and partial structures
2-4. Motivations and Applications • Motivations • Real-life App. where component interconnectivity is a key feature • Applications • Classical applications • Complex networks - Social networks (people, groups) - Information networks (citation, word thesaurus) - Technological networks (spatial and geographical) - Biological networks (genomics)
3-2. Data Structures name name Person1 Young key Person1 Young key name Person2 Sang 1 Person2 Sang 1 name Person3 Yong chin Person3 Yong chin • Hypernode • Simple flat graph is not good at presenting information to user • Hypernode provides inherent support (nested graphs) • Hypergraph • Generalization of a graph • 2-uniform hypergraph is a graph
3-3. Integrity Constraints • Schema-instance consistency • The instance should contain only concrete entities and relations from entity types and relations that were defined in the schema • Schema-instance separation • In most models there is a separation • An exception is the hypernode (dynamic DB) • Concentrated in the creation of consistent instances and the correct identification and reference of entities.
3-4. Query and Manipulation Languages • There is substantial work focused on query languages, the problem of querying graphs, the visual presentation of results, and graphical query languages • Some graph-oriented object models regard database transformations as graph transformations based on graph-pattern matching • GOOD, GOAL, etc.
NoSQLDataBases • Schema-less • Shared nothing architecture • Each server uses only its own local storage (faster) • Elasticity • Able to add servers without downtime • Sharding • Asynchronous replication • BASE instead of ACID
Graph Database Models • Scalability • ACID vs. BASE • Complexity • Relational - no redundancy or information loss (normalization) powerful SQL, optimization by RDBMS - performance problem in deep queries (many joins) no schema evolution, etc • Graph – property graph model
The latest Graph Database Models AllegroGraphRDFStore HyperGraphDB InfoGrid Neo4j FlockDB Sones Virtuoso
The latest Graph Database Models • License • Distribution • The only one truly distributed solution is HyperGraphDB • Indexing • Neo4j, indexing is not default behavior (index by Lucene, Solr) • Storage system • General vs. Special • HyperGraphDB uses Berkeley DB • APIs • Most of them provide java and web APIs
Neo4j • Full ACID-transaction compliant graph DB written in java • High performance • Handles several billion nodes, relationships and properties • 1~2 million traversal / second - constant time (independent of total size) • Example code • Node creation • Find friend
Neo4j • Example code • Traversal • Indexing
FlockDB • Goals • High rate of add/update/remove operations • Complex set arithmetic queries • Paging through query result sets containing millions of entries • Ability to ‘archive’ and later restore archived edges • Horizontal scaling including replication • Non-goals • Multi-hop queries (or graph-walking queries) • Automatic shard migrations • Characteristics • Optimized for very large adjacency lists (no traversal)
FlockDB - Twitter • Previous models (could not have both) • Relational tables – handling write operations • Key-value storage – paging through giant result sets • Implementation goals • Write the simplest possible thing that could work • Use off-the-shelf MySQL as the storage engine • Allow horizontal partitioning • Allow write operations to arrive out of order or be processed more than one. (allow redundant work rather than lost work) • Twitter (April 2010) • More than 13 B edges, 20k writes/second, 100k reads/second
FlockDB - Twitter • Stores graphs as sets of edges • Primary key (a compound key of the source ID, state, and position) • When an adge is deleted, the row is just marked ‘removed’ without deleting from MySQL • Keep only a compound primary key and a secondary index for each row, and answer all queries from a single index.
Sharding in Graph DB • Especially hard in graph DB due to traversal • Unless we store the entire graph on a single machine, we are forced to query across machine boundaries (expensive) • Neo4j provides master/slave structure (still has limit) • FlockDB(twitter) does not consider (interested in 1-level relations)
How to shard? • A proposal: gravity • Localizing data leads to greater performance (like cache) • Shard graph data based on gravity
Blueprints • A collection of interfaces, etc for the property graph DB model • Analogous to the JDBC, but for graph DB • Provides a common set of interfaces to allow developers to plug-and-play their graph DB backend. (Pipes, Gremlin, Rexster)