1 / 7

Crawling, extraction and pre-processing

Crawling, extraction and pre-processing. Crawling and Extraction. Linked Data Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled Crawling of complete datasets can be faster using Semantic Sitemap / VOID descriptions

caspar
Download Presentation

Crawling, extraction and pre-processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawling, extraction and pre-processing

  2. Crawling and Extraction • Linked Data • Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled • Crawling of complete datasets can be faster using Semantic Sitemap/VOID descriptions • Metadata in HTML • Same as HTML crawling, but data is extracted after crawling • SPARQL endpoints • Endpoints are not linked, need to be discovered by other means • Semantic Sitemap/VOID can describe contents of the repository

  3. Crawling and Extraction tools • Crawling services • 80 legs offers professional crawling as a service • Tools and services for microformat/RDFa extraction • E.g. Any23, RDFa Distiller, Triplr • Google, Yahoo, Facebook offer validators for particular types of markup • RDF and OWL APIs for parsing and validation • RDF triple stores (Sesame, Jena, Redland, ARC,…) come with readers and writers for RDF/XML and others in different language • OWL API is a Java library for OWL reading/writing • OWL reasoners such as Pellet perform OWL validation

  4. Practical issues • Linked Data • Serving issues, syntax • Hogan et al. Weaving the Pedantic Web. LDOW 2010. • See also pedantic-web.org • sameAs misuse • Halpin et al. When owl:sameAs isn't the Same: An Analysis of Identity in Linked Data, ISWC 2010. • Lack of interlinking, VOID descriptions • Bizer et al. State of the LOD Cloud. Oct 19, 2010 • http://www4.wiwiss.fu-berlin.de/lodcloud/state/

  5. Practical issues II. • RDFa • Sloppiness • Forgetting to declare prefixes • Not specifying the subject of triples • Some RDFa attributes interact in unexpected ways • e.g. rel and typeof on the same HTML element • Complexity • For large pages, difficult to follow the scope of triples • Microformats • Sloppiness, complexity less of an issue due to simpler markup • Some keyword stuffing is already apparent

  6. Data (pre)processing • Quality of data determines pre-processing needs • Quality ranges from well-curated data sets (e.g. Freebase) to microformats • Data integration and reasoning • Can be applied offline or at query time • Open question precisely what kinds of reasoning are useful for retrieval • Example: RDF/OWL reasoning may add noise to the data • Example: removing duplicate results may reduce the relevance of search results, but improve diversity

  7. Data integration for Semantic Search • Ontology matching • Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org • Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies • Entity resolution • Logic-based approaches in the Semantic Web • Studied as record linkage in the database literature • Machine learning based approaches, focusing on attributes • Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data • Improvements over only attribute based matching • Blending • Merging objects that represent the same real world entity and reconciling information from multiple sources

More Related