290 likes | 448 Views
Aules d’empresa 2011 DEX. Contents. Graph database Motivation DEX Experiments. Graph database. What is a graph database? Data and schema are represented by graphs. Nodes, edges, and properties. Data manipulation is expressed as graph operations.
E N D
Contents • Graphdatabase • Motivation • DEX • Experiments
Graph database • What is a graph database? • Data and schema are represented by graphs. • Nodes, edges, and properties. • Data manipulation is expressed as graph operations. • Integrity constraints enforce graph consistency.
Motivation • Trends in current data sets: • A higher degree of connectivity among entities. • A higher degree of complexity of data models. • Decentralization of data generation. • Users provide contents. • Requirements: • Queries with different flavors: • Structural queries (not based on the schema). • Link analysis. • Manage unstructured data. • Flexible schemas.
Scenarios Social networks MySpace, Facebook, Flickr … Information networks Bibliographic databases: DBLP, Scopus … On-line encyclopedias: Wikipedia … Technological networks Electric power grids, airline routes, telephone networks … Biological networks Genomics, chemical structures …
Why not RDBMS? • Classical relational model • Inefficient for unstructured data or flexible schemas • Prefixed schema, based on relations (tables) • Inefficient for structural queries • Intensive use of join operations
, a graph database • DEX is a programming library which allows to manage a graph database. • Focuses on: • Very large datasets. • High performance query processing.
Basic concepts • Persistent and temporary graph management programming library. • Data model: Typed and attributed directed multigraph. • Node and edge instances belong to a type (label). • Node and edge instances have attribute values. • Edge can be directed or undirected. • Multiple edges between two nodes. • Type of edges: • Materialized: directed and undirected. • Virtual: constrained by the values of two attributes (foreign keys) • Just for navigation
Software architecture • Java library: jdex.jar public API • Native library • Linux: libjdex.so • Windows: jdex.dll • System requirements: • Java Runtime Environment, v1.5 or higher. • Operative system: • Windows – 32 bits • Linux – 32 and 64 bits
Application architecture Desktop application Web application Presentation Java Swing Application Browser HTML + Javascript INTERNET Network Load and Query Application Logic Servlet Query API DEX API DEX Data DEX DEX Data Sources Data Sources Graphs Graphs
Experiments • Five categories: • Bulk load performance. • Core operations performance and memory usage • Scalability. • Comparison with other approaches. • Relational (MySQL) and OIM. • Query performance analysis • Different datasets: • Wikipedia. • IMDb, the Internet Movie Database. • XMark, a standard and scalable benchmark for XML. • LUBM, a benchmark to evaluate the performance of RDF repositories. • R-MAT, a synthetic scale-free network.
Load performance Single CPU with 4096 KB of cache, 2 GB of RAM and 80 GB of disk. Operating system: Linux Debian etch 4.0 DEX buffer pool: 1.5 GB max.
Operations performance and memory usage Benchmark: Wikipedia with more than 200 million nodes and edges
Scalability XMark over 5 different scale factors ranging from 0.1 (110MB) to 25 (2.78GB)
Comparison with Other Approaches Comparison with a relational database (MySQL) and with an Oriented Incidence Matrix
Comparisonwith Neo4j Query 1: max-outdegree + SPTQuery 2: paper recommender (2-hops) Query 3: patternmatchingQuery 4: for eachlanguage: number of papers and imagesQuery 5: for each paper: materializenumber of imagesQuery 6: delete papers with no images
Another comparison with a RDBMS • Datasets: • D1: Synthetic data, generatedfrom R-MAT • Scale factor = 16 (524K edges) • D2: Synthetic data, generatedfrom R-MAT • Scale factor = 18 (2M edges) • D1 and D2 bothjustnodes and edges, no attributes. • R-MAT generatesscale-free networks. • Queries: • Q1: 3-hops from a givennode.
Another comparison with RDBMS • Test: Execute Q1 for 5 specificnodes. • Thesequerynodeshave a significantnumber of out-goingedges. • Scale factor 16: aboutsometens • Scale factor 18: aboutsomehundreds • Results: • Scale factor 16: reachedabout 160K nodes • Scale factor 18: reachedabout 600K nodes
Another comparison with RDBMS • Schema: CREATE TABLE `edges` ( `src` int(11) NOT NULL, `dst` int(11) NOT NULL, INDEX `srcI` (`src`) USING BTREE, INDEX `dstI` (`dst`) USING BTREE ) ENGINE=InnoDB; • Query: SELECT DISTINCT c.dst FROM edges as a, edges as b, edges as c WHERE (a.dst=b.src AND b.dst=c.src AND a.src=node);
Results • Platform test • MacBook 2.4GHz Intel Core 2 Duo (Mac OS X 10.6) • Up to 1GB memoryforMySQL buffer pool. • Results
Any question? DAMA Group Web Site: www.dama.upc.edu Sparsity Web Site: www.sparsity-technologies.com