Path Processing using Solid State Storage

Path Processing usingSolid State Storage Manos Athanassoulis, DIAS, EPFL* Mustafa Canim, IBM Watson Research Labs Kenneth Ross, IBM Watson Research Labs, Columbia University BishwaranjanBhattacharjee, IBM Watson Research Labs *work done during an internship at IBM.

Why Path Processing? Why Solid State Storage (SSS)? • App’s use linkage information • Social • Scientific • Government • Financial • Knowledge • Watson (Jeopardy Champ) • Graph processing not enough • Link type modeled by RDF • Increasing capacity • Exponential increase • Follows Moore’s law • Read performance • OOM faster than disks • Random read performance • Crucial for path processing • New technologies • Flash already mature • Phase Change Memory (PCM) • … more tech’s are coming

Path processing

Path processing 1) Cannot prefetch 2) Retrieve-data-then-follow-link 3) A lot of useless data are retrieved How can Solid State Storage help?

Path processing (and Solid State Storage) 1) Small access latency 2) Read mostly usefull data 3) Efficient random IO accesses 4) Can we do something better? Build SSS-aware systems

In the rest of the talk … • RDF data model and systems • Solid State Storage for Path Processing • Technology • Flash vsPCM • Storing and managing RDF data over Solid State Storage • Conclusions

Resource Description Framework (RDF) meta-data model • Data is represented in Statements each one comprised by a triple • Statement: <Subject, Predicate, Object> • Each statement describes a property of a subject: • <“IBM”, “is-a”, “Corporation”> • or a connection between to objects: • <“Manos”, “interned-at”, “IBM”> • or a value of a Property of a Subject: • <“Manos”, “born-in”, “1984”> • The notation is more complex: • Subjects are Universal Resource Identifiers (URIs) • Predicates are URIs • Objects are either URIs or literals

RDF data management • Two alternatives are used to store data • Relational RDF storage • Use existing relational stores • Create relational tables • Basic approach: A triple-store • One big table with three columns • Native RDF storage • Tailored to the needs of the specific workload • No underlying system assumed Can we take the best of both worlds?

Outline • RDF data model and systems • Solid State Storage for Path Processing • Technology • Flash vs PCM • Storing and managing RDF data over Solid State Storage • Conclusions

Solid State Storage facts • We have access to a PCI-based PCM prototype (compared with fusionIO) • PCM prototype vs Flash state-of-the-art *Very early Micron PCM prototype

Exploiting Solid State Storage for path processing • Path-processing involves link-following queries • Access latency is critical • Solid State Storage is tailored for path-processing: • OOM lower read latency than traditional storage • Very fast random-read performance • PCM is expected to outperform Flash in read performance • Next in this talk: • PCM vs Flash when running link-following queries • Storing and managing RDF data on Solid State Storage

PCM vs Flash in path processing • Prototype implementation of link-following queries • Workload: Given a randomly generated graph, execute link-following queries of variable length without buffering • Graph generation 5GB synthetic data with random number of edges (between 3 and 30 edges per vertex) • Querying Parameters Number of threads (1, 2, 4, 8, 16, 32, 64, 96, 128, 192) Pagesize (4K, 8K, 16K, 32K) Length of the query (2, 4, 10, 100 accesses per query) • Hypothesis: PCM can offer important performance improvements

PCM vs Flash Query length: 100 hops PCM performs consistently better for smaller page granularities

An RDF repository for Solid State Storage Pythia

Building a SSS-aware RDF repository • We focused on building a graph-based RDF repository • We need to design a new system which: • Takes into account the graph-structure of the data • Supports any RDF-based query • We introduce Pythia, a new RDF repository, which uses: • The notion of RDF-tuple • New internal structures • New data layout

RDF-tuple <Subject>, <Predicate1>, {<Object1_1>, <Object1_2>, …}, <Predicate2>, {<Object2_1>, <Object2_2>, …}, … <PredicateN>, {<ObjectN_1>, <ObjectN_2>, …}, • The RDF-tuple design: • allows us to locate within a page the most important information of a Subject. • allows us to avoid repeating redundant information (Subject and Predicate resources) • This is further optimized by the URL Dictionary

DRAM Pythia SSS Query Engine Literals Dictionary URL Dictionary Hash Index Hash Index • Repository for Very Large Objects Aux storage: O, P, S Main storage: S, P, O

Data layout on Pythia Tuple 0 Tuple Metadata Subject Resource Predicates dictionary IDs Objects: (if literal) Literal dictionary ID Objects: (else) Object Resource and pageID, tupleID Tuple 1 Tuple 2 Tuple 3

Storing Yago2 using Pythia • Yago2 is a semantic knowledge base, introduced by Max-Planck Institute in 2007, derived from wikipedia, WordNet, and GeoNames (currently ~10M entries, 460M facts). Yago2 in Pythia • Initial data: 2.3GB • Main DB files: 1.3GB • Large objects: 192MB • Can be aggressively decreased with page-level compression (tuples will move to main file as well) • Indexes: 121MB (hash-based, in memory) • Dictionaries: 569MB • Possible optimization: Take into account type of literal (now string) • More than 99% of the SPO tuples can fit in a single 4K page

Evaluating Pythia (Setup & Dataset) • Prototype C++ implementation • System Setup • 24-core Intel XEON X560 with linux x86_64 (2.6.32-28) • 32GB of memory • 12GB PCM card (Micron prototype card) • 74GB Flash card (fusionIO) • Workload: Yago2 • Queries: a mix of 6 queries with randomized parameters

How often can you ask Pythia?

How fast does Pythia answer?

Pythia vs RDF-3X • RDF – 3X is the de facto research state-of-the-art • Data in a virtual table and accessed through compressed indexes • 6 indexes (all permutations of S,P,O) and 3 aggregate indexes

Pythia vs RDF-3X • Q1: Find all male citizens of Greece. • Q2: Find all OECD member economies that Switzerlanddeals with. • Q3: Find all mafia films that Al Pacino acted in. • Size on disk for Yago2: Raw data 2.3GB • Pythia: 2.2GB (no compression) 1.5GB db files (on disk) 0.7GB dictionaries/indexes (loaded in memory during startup) • RDF-3X: 2.2GB (aggressive compression) 2.2GB a single file (on disk)

Conclusions • Solid State Storage is naturally tailored for path processing • PCM, Flash and more new technologies • PCM comparative advantage against flash is lower read latency • 1.5x-2.5x speedup in a workload with dependent reads • Pythia: A solid-state-storage-aware path-processing system • 1.5x – 2.5x high bandwidth on PCM compared to Flash • 1.5x – 2.0x lower response times on PCM compared to Flash • Competitive against state-of-the-art (RDF-3X)

Thank you! Pythia (Greek: Πυθία; IPA pɪθiːɑː), commonly known as the Oracle of Delphi, was the priestess at the Temple of Apollo at Delphi, located on the slopes of Mount Parnassus, delivering prophecies.

Path Processing using Solid State Storage