230 likes | 369 Views
YARS2: A Federated Repository for Querying Graph Structured Data from the Web. Andreas Harth, Juergen Umbrich, Aidan Hogan, Stefan Decker ISWC 2007, Busan, Korea Wednesday, November 14, 2007. Outline. Motivation System Architecture Indexing Distribution Query Processing Conclusion.
E N D
YARS2: A Federated Repository for Querying Graph Structured Data from the Web Andreas Harth, Juergen Umbrich, Aidan Hogan, Stefan Decker ISWC 2007, Busan, Korea Wednesday, November 14, 2007
Outline • Motivation • System Architecture • Indexing • Distribution • Query Processing • Conclusion
Problem Statement • Current search technology allows to locate information resources via keyword searches • Users with complex information needs require the ability to browse information spaces • Browsing unknown information spaces allows for • learning about a subject area • discover previously unknown associations • viewing data integrated from a number of sources
Challenges • Browsing information spaces requires query processing • System has to scale, scale, scale • Build a system able to answer queries over web-scale data sets • Combined data from a large number of sources • Web data is scruffy • unknown schemas • varying quality • Ad-hoc query answering over combined data from millions of Web sources • Data mining operations over portions of the data • In database speak: build a data warehouse over Web data • Indexing and query processing are at the core of search engines - hint: how does Google index, and how do they do query processing? you don’t know? the haven’t published?
Goal: SPARQL query processing E.g. all people working at DERI CONSTRUCT { ?s ?p ?o . } WHERE { ?s rdf:type foaf:Person . ?s foaf:workplaceHomepage <http://www.deri.org/> . ?s ?p ?o . }
Index Organisation • Split between memory and on-disk allows to perform lookups in O(1) disk seeks • binary search on in-memory data structure • Read-optimised (very fast), updates in batch processing • Sort is most expensive operation O(nlogn), can be done offline at intervals
Complete Index on Quads • Given prefix lookup capabilities, only 6 indexes are needed to cover all access patterns
Discussion • What we see there is that cpu time is high in the smaller block sizes, and disk i/o becomes the bottleneck in the larger block sizes • Possible to trade memory space for time • smaller block size -> faster lookups, but requires more memory • larger block size -> slower lookups, but uses less memory
Data Distribution • Distributed hash tables offer very good scaling properties • S, O, C are typically well distributed (and hash buckets are about the same size) • P values are not well distributed (rdf:type is notorious example) • Keywords are also not well distributed • Two distribution strategies: hash-based partitioning, and random partitioning (flooding)
Pushing joins • Cool thing about index distribution is that it’s possbile to compute some joins locally • e.g. ocsp >< spoc (where o == s), because both o and s are hashed to the same machine • e.g. keyword >< spoc (because keyword is distributed not really randomly, but on the spoc index machines)
Network lookup overhead • Overhead to initialise/shut down connection • Typical model for query processing is tuple-at-a-time (using iterator pattern) • Not suitable for network communication • Thus, query/results blocking that ships queries and results in batches
Discussion • 2k row blocking seems optimal • doesn’t scale linearly (possible causes: network is bottleneck (our router does currently only 100MBits) • or, single queue requires synchronisation and creates single hot-spot
Join Processing • Focus on join processing, because that is the most expensive operation • Index nested loops joins where left side of query plan is evaluated, and then queries constructed for right side, that are shipped to remote machine • mulitple thread lookups, coordination done using queues
Conclusion • RDF query processing possible using adaptations of indexing and query processing techniques known from the 70ies to 90ies of last century • Scale = use basic operations, and optimise them well • Measure, measure, measure