220 likes | 341 Views
Problems in Semantic Search. Krishnamurthy Viswanathan and Varish Mulwad {krishna3, varish1} AT umbc DOT edu. 1. Agenda. Introduction Swoogle Cool things others do Swoogle facts/figures Our ideas References. 2. Why is Semantic Search significant?. 3. Swoogle.
E N D
Problems in Semantic Search Krishnamurthy Viswanathan and Varish Mulwad {krishna3, varish1} AT umbc DOT edu 1
Agenda • Introduction • Swoogle • Cool things others do • Swoogle facts/figures • Our ideas • References 2
Swoogle • Swoogle is a search engine for Semantic Web (SW) documents • It offers the following services: • Search SW ontologies and documents • Search SW terms, i.e. URIs that have been defined as classes and properties • Provide metadata of SW documents and support browsing the Semantic Web 4
Swoogle • Swoogle supports two relevant query types: • Ontology: Searches a small collection that consists only of Semantic Web Ontologies • Document: Searches all SW documents. This search space is much larger • Swoogle indexes only the document’s URL, the terms being defined in the document, explicit descriptions about the document, and the namespaces used by the document 5
Swoogle capabilities • Web search: • Basic metadata: e.g. url, desc, ns etc. • Document metadata: hasEncoding, hasLength etc. • RDF metadata: hasGrammar, hasCntTriple etc. • Advanced search using Lucene features • REST based services: Compose an HTTP GET query and retrieve the results in the form of RDF/XML 6
Examples of REST queries • A query is represented as a URL: • REST_QUERY ::= SERVICE_URI ? PARAMS • Example: search SW documents which are classified as ontologies (ontoRatio > 0) • queryType: e.g. search_swd_ontology • searchString: user constructed (see manual) • Key http://logos.cs.umbc.edu:8080/swoogle31/q?queryType=search_swd_ontology&searchString=person&key=demo 7
Sindice • Sindice is a Semantic Web search engine created at Digitial Enterprise Research Institute (DERI) • Interesting things to note about Sindice – • Architecture • Indexing 9
Sindice • Sindice uses the paradigms of cloud computing for their architecture • Sindice uses Hadoop / Nutch to distribute crawling across multiple machines • Collected data is stored in a HBase – a distributed column store 10
Sindice • Sindice indexes based on – • Inverse Functional Properties (IFP) • URI’s • Literals (Keywords) IFP – An OWL cardinality restriction • Benefits – Faster Retrieval 11
Watson – A gateway to the Semantic Web • From the Knowledge Management Institute at the Open University in UK • Interesting things to note about Watson – • Consider implicit semantic relationships • Quality of Semantic documents • “Rich access” to semantic data 12
Watson • Implicit relationships between semantic web documents • Equivalence (Duplicate detection) • Quality of Semantic Documents • “Richer” access to Semantic Data • Web Interface for Humans • SparQL end point • Java/SOAP and REST APIs 13
Others • Semantic Web Search Engine (SWSE) • Pipelined architecture for crawling and indexing • Improved index and storage structure • Falcons • Class subsumption reasoning • Includes a Triple Store 14
Power Aqua • Multi-ontology based QA system powered by PowerMap and Watson • Takes inputs in the form of NL queries • Factual queries that can be expressed as one or more linguistic triples • Common wh-questions 15
Power Aqua • Key challenges in order to be able to answer NL-questions: • Locating the ontologies relevant to a particular query • Identifying semantically sound relationships • Combining information from multiple queries 16
Swoogle facts/figures • The search engine components currently run on 4 machines • These machines host the crawler, the Lucene index, the MySQL database etc. and access the NFS • Approximately 20,000 pages are accessed by Swoogle everyday (which get queued) • About 1,731,371 pure SW documents have been discovered 17
Swoogle facts/figures • Swoogle crawler has a large queue of documents to be crawled and indexed • Swoogle accesses metadata and index files over the NFS that makes information retrieval slower 18
Our Ideas: Research and Engineering • Acquire new hardware • Parallelize Swoogle • Focus on a particular domain • Project Swoogle as a search engines for agents 19
Our Ideas: Research and Engineering • Improve Swoogle’s indexing scheme • Analyze Swoogle’s ranking scheme • Use of Swoogle Metadata • Improve the usability of the website • Google like Services 20
References • Li Ding et al., "Swoogle: A Search and Metadata Engine for the Semantic Web", Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management, November 2004. • P. Mika, G. Tummarello “Web Semantics in the Clouds”, IEEE Intelligent Systems, Volume 23 , Issue 5 (September 2008) • E. Oren, R.Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, G. • Tummarello “Sindice.com: A document-oriented lookup index for open linked data.” In International Journal of Metadata, Semantics and Ontologies, 3(1), 2008. • Mathieu d’Aquin et al., “Watson: A Gateway for the Semantic Web” ,Poster session of the European Semantic Web Conference, ESWC 2007 • Gong Cheng, Weiyi Ge, Honghan Wu, Yuzhong Qu , “Searching Semantic Web Objects Based on Class Hierarchies” In WWW 2008 Workshop on Linked Data on the Web, 2008 21
Questions ? 22