170 likes | 272 Views
Managing Large RDF Graphs Vaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science, The University of Texas at Dallas December 2008.
E N D
Managing Large RDF GraphsVaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science,The University of Texas at DallasDecember 2008
IntroductionThe Provost for the University of Texas at Dallas, Dr. B. Hobson Wildenthal, in conjunction with the Vice President for Research and Development, Dr. Bruce Gnade made a commitment on becoming a leader in emerging technologies recognizing that the university did not want to compete in legacy technologies.After a detailed analysis and examination of unsolved problems the university committed to the Semantic Web and Cloud Computing as research areas. This was vetted through a large number of government and industrial clients. This resulted in the creation of the Semantic Web Lab.
Our Projects on Semantic Web • Confidentiality, Privacy and Trust for the Semantic Web • Texas Enterprise Funds, 2005; NSF 2007 • Building Geospatial Semantic Web • Raytheon, 2006; NGA, 2007 • Blackbook Experimentation • Texas Enterprise Funds, 2007 • Ontology Mining – part of Text mining project • NASA 2007 • Assured Information Sharing • AFOSR MURI, 2008 • Managing Large RDF Graphs and Ontology Homogenization • IARPA, 2008
Managing Large RDF Graphs • Current problems • Semantic web does not scale • Hinders ability to do reasoning and large graph processing • Current work focuses on load balancing and fault tolerance, but the big bottleneck is memory • Current systems can be broken with even 100,000 triples • We work on load balancing and polynomial reasoning but memory management breaks the systems even before any of the other problems can be addressed
Managing Large RDF Graphs • Solution History • To solve this problem we only look at history • In the 1960’s Dijkstra invented the multiprocess operating system • This gave us general purpose resource management for files and memory • In the 1970’s efforts were directed to taking the general purpose OS and placing database applications on top of them • The drawback was that these systems did not scale • In the 1980’s Robert Epstein and Michael Stonebreaker from UC Berkley defined specific algorithms for database processing like LRU/MRU • These principles are accepted as a solved solution space resulting in ORACLE, MySQL and others
Managing Large RDF Graphs Mem Mgt LRU/MRU • Solution History • In 2001 we started with the Semantic Web • Oracle, HP and others tried to apply database algorithms to graph processing • We worked to expand resource management to use specific graph algorithms • The solution is constructed so that memory is boundless (infinite graph) with deterministic reads that are an order of magnitude slower than pure memory solutions C A B
Managing Large RDF Graphs • Relevance of problem • This was an unsolved problem • Critical in handling terabytes of data relevant in today’s times • Virtualize from memory space to disk space
Managing Large RDF Graphs • Tools Used • Jena • An open source Semantic Web framework used to build and manipulate large RDF graphs • Also gives the capability to handle RDFS and OWL • Provides a query language SPARQL and a rule based inference engine • Developed by HP Labs • Can represent RDF graphs as a model
Managing Large RDF Graphs • Tools Used • Lucene • Lucene is a Java based text search engine library • Is suitable for any application and is platform independent • Does indexing and retrieval in a few milliseconds across terabytes of data • MySQL • An open source RDBMS used with the various database representations in Jena (RDB, SDB, and, TDB) • An easy to use alternative compared to other RDBMS’s
Managing Large RDF Graphs • In-memory Jena Model • This solution formed the basis of the solution that we will use for the RDB problem • As nodes are added to the in-memory graph, memory fills up • Therefore we can handle medium sized graphs • After a certain point when memory is full we get an out of memory exception stopping program execution • We want to solve this out of memory problem
Managing Large RDF Graphs • Memory Management Algorithm • Graph representation http://www.johnSmith.com author http://www.johnSmith.com/paper1 Age Phone Journal Society Journal Name 35 123-456-7890 ACM Society Semantic Web Journal Time
Managing Large RDF Graphs • Memory Management Algorithm • Graph Representation • The graph is constructed in Jena by specifying nodes and their properties. • Triples are added in a monotonically increasing fashion. • Nodes may be accessed at any time (this is a key point in the algorithm) • Data structure used in the algorithm • Create an in-memory LRU based cache • For each node in the graph store an index number, a timestamp value for when it was last accessed, and, the number of connections for that node • Each time the node is accessed or a triple added, update the associated cache entry • This structure will be used to determine the candidate node that will be written to disk
Managing Large RDF Graphs Memory Management Algorithm Algorithm We use the LIMIT clause in MySQL to get back only a part of the results at a time The triples retrieved are added to the revised in-memory Jena model This leverages the memory management algorithm for the in-memory model Since the revised in-memory model never runs out of memory this RDB solution does not run out of memory
Managing Large RDF Graphs • Conclusions from In-Memory Jena Model • As threshold increases the time required for the calculations reduces • As the memory size increases the time needed for the calculations increases since more triples can be stored in memory • A node in memory takes about 35 ms whereas one cached to lucene takes about 300ms • The goal is for usage patterns to pull from memory.
Managing Large RDF Graphs • Conclusions from the RDF Jena Model • Database creation times are almost the same as with the original Jena implementation • Database querying times vary depending upon the threshold value set in the algorithm • General Conclusions • Implemented an in-memory based LRU/Connectivity memory management algorithm • Solves the in-memory and RDB based models in Jena by creating an infinite memory impression for the user
Managing Large RDF Graphs • Future Work • Implement the memory management algorithms for cloud computing • Generalize the algorithm for all models • Try various other memory management algorithms which effect usage