YARS2: A Federated Repository for Querying Graph Structured Data from the Web

YARS2: A Federated Repository for Querying Graph Structured Data from the Web Andreas Harth, Juergen Umbrich, Aidan Hogan, Stefan Decker ISWC 2007, Busan, Korea Wednesday, November 14, 2007

Outline • Motivation • System Architecture • Indexing • Distribution • Query Processing • Conclusion

Problem Statement • Current search technology allows to locate information resources via keyword searches • Users with complex information needs require the ability to browse information spaces • Browsing unknown information spaces allows for • learning about a subject area • discover previously unknown associations • viewing data integrated from a number of sources

Challenges • Browsing information spaces requires query processing • System has to scale, scale, scale • Build a system able to answer queries over web-scale data sets • Combined data from a large number of sources • Web data is scruffy • unknown schemas • varying quality • Ad-hoc query answering over combined data from millions of Web sources • Data mining operations over portions of the data • In database speak: build a data warehouse over Web data • Indexing and query processing are at the core of search engines - hint: how does Google index, and how do they do query processing? you don’t know? the haven’t published? 

Goal: SPARQL query processing E.g. all people working at DERI CONSTRUCT { ?s ?p ?o . } WHERE { ?s rdf:type foaf:Person . ?s foaf:workplaceHomepage <http://www.deri.org/> . ?s ?p ?o . }

YARS2 Data Flow

Index Manager

Index File Organisation

Index Organisation • Split between memory and on-disk allows to perform lookups in O(1) disk seeks • binary search on in-memory data structure • Read-optimised (very fast), updates in batch processing • Sort is most expensive operation O(nlogn), can be done offline at intervals

Complete Index on Quads • Given prefix lookup capabilities, only 6 indexes are needed to cover all access patterns

Index Lookups

Discussion • What we see there is that cpu time is high in the smaller block sizes, and disk i/o becomes the bottleneck in the larger block sizes • Possible to trade memory space for time • smaller block size -> faster lookups, but requires more memory • larger block size -> slower lookups, but uses less memory

Data distribution

Data Distribution • Distributed hash tables offer very good scaling properties • S, O, C are typically well distributed (and hash buckets are about the same size) • P values are not well distributed (rdf:type is notorious example) • Keywords are also not well distributed • Two distribution strategies: hash-based partitioning, and random partitioning (flooding)

Pushing joins • Cool thing about index distribution is that it’s possbile to compute some joins locally • e.g. ocsp >< spoc (where o == s), because both o and s are hashed to the same machine • e.g. keyword >< spoc (because keyword is distributed not really randomly, but on the spoc index machines)

Network lookup overhead • Overhead to initialise/shut down connection • Typical model for query processing is tuple-at-a-time (using iterator pattern) • Not suitable for network communication • Thus, query/results blocking that ships queries and results in batches

Network Throughput

Discussion • 2k row blocking seems optimal • doesn’t scale linearly (possible causes: network is bottleneck (our router does currently only 100MBits) • or, single queue requires synchronisation and creates single hot-spot

Query Processing

Join Processing • Focus on join processing, because that is the most expensive operation • Index nested loops joins where left side of query plan is evaluated, and then queries constructed for right side, that are shipped to remote machine • mulitple thread lookups, coordination done using queues

Multithreaded Join Processing

Distributed Join Processing

Conclusion • RDF query processing possible using adaptations of indexing and query processing techniques known from the 70ies to 90ies of last century • Scale = use basic operations, and optimise them well • Measure, measure, measure

YARS2: A Federated Repository for Querying Graph Structured Data from the Web