Aules d’empresa 2011 DEX

Aulesd’empresa 2011DEX

Contents • Graphdatabase • Motivation • DEX • Experiments

Graph database • What is a graph database? • Data and schema are represented by graphs. • Nodes, edges, and properties. • Data manipulation is expressed as graph operations. • Integrity constraints enforce graph consistency.

Motivation • Trends in current data sets: • A higher degree of connectivity among entities. • A higher degree of complexity of data models. • Decentralization of data generation. • Users provide contents. • Requirements: • Queries with different flavors: • Structural queries (not based on the schema). • Link analysis. • Manage unstructured data. • Flexible schemas.

Scenarios Social networks MySpace, Facebook, Flickr … Information networks Bibliographic databases: DBLP, Scopus … On-line encyclopedias: Wikipedia … Technological networks Electric power grids, airline routes, telephone networks … Biological networks Genomics, chemical structures …

Why not RDBMS? • Classical relational model • Inefficient for unstructured data or flexible schemas • Prefixed schema, based on relations (tables) • Inefficient for structural queries • Intensive use of join operations

, a graph database • DEX is a programming library which allows to manage a graph database. • Focuses on: • Very large datasets. • High performance query processing.

Basic concepts • Persistent and temporary graph management programming library. • Data model: Typed and attributed directed multigraph. • Node and edge instances belong to a type (label). • Node and edge instances have attribute values. • Edge can be directed or undirected. • Multiple edges between two nodes. • Type of edges: • Materialized: directed and undirected. • Virtual: constrained by the values of two attributes (foreign keys) • Just for navigation

A graph model

Software architecture

Software architecture • Java library: jdex.jar public API • Native library • Linux: libjdex.so • Windows: jdex.dll • System requirements: • Java Runtime Environment, v1.5 or higher. • Operative system: • Windows – 32 bits • Linux – 32 and 64 bits

Application architecture Desktop application Web application Presentation Java Swing Application Browser HTML + Javascript INTERNET Network Load and Query Application Logic Servlet Query API DEX API DEX Data DEX DEX Data Sources Data Sources Graphs Graphs

Experiments • Five categories: • Bulk load performance. • Core operations performance and memory usage • Scalability. • Comparison with other approaches. • Relational (MySQL) and OIM. • Query performance analysis • Different datasets: • Wikipedia. • IMDb, the Internet Movie Database. • XMark, a standard and scalable benchmark for XML. • LUBM, a benchmark to evaluate the performance of RDF repositories. • R-MAT, a synthetic scale-free network.

Load performance Single CPU with 4096 KB of cache, 2 GB of RAM and 80 GB of disk. Operating system: Linux Debian etch 4.0 DEX buffer pool: 1.5 GB max.

Operations performance and memory usage Benchmark: Wikipedia with more than 200 million nodes and edges

Scalability XMark over 5 different scale factors ranging from 0.1 (110MB) to 25 (2.78GB)

R-MATscalability

Comparison with Other Approaches Comparison with a relational database (MySQL) and with an Oriented Incidence Matrix

Comparisonwith Neo4j Query 1: max-outdegree + SPTQuery 2: paper recommender (2-hops) Query 3: patternmatchingQuery 4: for eachlanguage: number of papers and imagesQuery 5: for each paper: materializenumber of imagesQuery 6: delete papers with no images

Another comparison with a RDBMS • Datasets: • D1: Synthetic data, generatedfrom R-MAT • Scale factor = 16 (524K edges) • D2: Synthetic data, generatedfrom R-MAT • Scale factor = 18 (2M edges) • D1 and D2 bothjustnodes and edges, no attributes. • R-MAT generatesscale-free networks. • Queries: • Q1: 3-hops from a givennode.

Another comparison with RDBMS • Test: Execute Q1 for 5 specificnodes. • Thesequerynodeshave a significantnumber of out-goingedges. • Scale factor 16: aboutsometens • Scale factor 18: aboutsomehundreds • Results: • Scale factor 16: reachedabout 160K nodes • Scale factor 18: reachedabout 600K nodes

Another comparison with RDBMS • Schema: CREATE TABLE `edges` ( `src` int(11) NOT NULL, `dst` int(11) NOT NULL, INDEX `srcI` (`src`) USING BTREE, INDEX `dstI` (`dst`) USING BTREE ) ENGINE=InnoDB; • Query: SELECT DISTINCT c.dst FROM edges as a, edges as b, edges as c WHERE (a.dst=b.src AND b.dst=c.src AND a.src=node);

Results • Platform test • MacBook 2.4GHz Intel Core 2 Duo (Mac OS X 10.6) • Up to 1GB memoryforMySQL buffer pool. • Results

Any question? DAMA Group Web Site: www.dama.upc.edu Sparsity Web Site: www.sparsity-technologies.com

Aules d’empresa 2011 DEX

Aules d’empresa 2011 DEX

Presentation Transcript

EVERYTHING NONPUBLIC

2011