70 likes | 247 Views
Crawling, extraction and pre-processing. Crawling and Extraction. Linked Data Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled Crawling of complete datasets can be faster using Semantic Sitemap / VOID descriptions
E N D
Crawling and Extraction • Linked Data • Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled • Crawling of complete datasets can be faster using Semantic Sitemap/VOID descriptions • Metadata in HTML • Same as HTML crawling, but data is extracted after crawling • SPARQL endpoints • Endpoints are not linked, need to be discovered by other means • Semantic Sitemap/VOID can describe contents of the repository
Crawling and Extraction tools • Crawling services • 80 legs offers professional crawling as a service • Tools and services for microformat/RDFa extraction • E.g. Any23, RDFa Distiller, Triplr • Google, Yahoo, Facebook offer validators for particular types of markup • RDF and OWL APIs for parsing and validation • RDF triple stores (Sesame, Jena, Redland, ARC,…) come with readers and writers for RDF/XML and others in different language • OWL API is a Java library for OWL reading/writing • OWL reasoners such as Pellet perform OWL validation
Practical issues • Linked Data • Serving issues, syntax • Hogan et al. Weaving the Pedantic Web. LDOW 2010. • See also pedantic-web.org • sameAs misuse • Halpin et al. When owl:sameAs isn't the Same: An Analysis of Identity in Linked Data, ISWC 2010. • Lack of interlinking, VOID descriptions • Bizer et al. State of the LOD Cloud. Oct 19, 2010 • http://www4.wiwiss.fu-berlin.de/lodcloud/state/
Practical issues II. • RDFa • Sloppiness • Forgetting to declare prefixes • Not specifying the subject of triples • Some RDFa attributes interact in unexpected ways • e.g. rel and typeof on the same HTML element • Complexity • For large pages, difficult to follow the scope of triples • Microformats • Sloppiness, complexity less of an issue due to simpler markup • Some keyword stuffing is already apparent
Data (pre)processing • Quality of data determines pre-processing needs • Quality ranges from well-curated data sets (e.g. Freebase) to microformats • Data integration and reasoning • Can be applied offline or at query time • Open question precisely what kinds of reasoning are useful for retrieval • Example: RDF/OWL reasoning may add noise to the data • Example: removing duplicate results may reduce the relevance of search results, but improve diversity
Data integration for Semantic Search • Ontology matching • Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org • Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies • Entity resolution • Logic-based approaches in the Semantic Web • Studied as record linkage in the database literature • Machine learning based approaches, focusing on attributes • Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data • Improvements over only attribute based matching • Blending • Merging objects that represent the same real world entity and reconciling information from multiple sources