Web Search Environments Web Crawling Metadata using RDF and Dublin Core

Web Search EnvironmentsWeb Crawling Metadata using RDF and Dublin Core Dave Beckett http://purl.org/net/dajobe/ Slides: http://ilrt.org/people/cmdjb/talks/tnc2002/

Introduction • Overview of SGs and Web Crawling • Why WSE, what’s new? Novel results • Future work (or stuff we didn’t do) and conclusions

Overview • Digital Library community • In UK, subject-specific gateways (SGs) • Want to improve: scope (more), timeliness (fresh), cost (less) • Stay professional – the Quality word • Compete with web search engines – the Google Test

Human Cataloguing of the Web • Pros: High quality, domain knowledge selection, subject-specialised, cataloguing done to well-known and developed standards • Cons: Expensive, slow, descriptions need to be reviewed regularly to keep them relevant

Software running web crawls • Pros: vastly comprehensive (Con: too much), can be very up-to-date • Cons: cannot distinguish “this page sucks” from “this page rocks”, indiscriminate, subject to spamming, very general (but…)

Combining Web Crawling and High Quality Description A solution • Seed the web crawl from high quality records • Crawl to other (presumably) good quality pages • Track the provenance of the crawled pages • Provenance can be used for querying and result ranking

Web Search Environments (WSE) Project • Research by ILRT and later Resource Discovery Network (RDN) • RDN funds UK SGs (ILRT also had DutchESS)

WSE Technologies • Simple Dublin Core (DC) records extracted from SGs • OAI protocol used to collect these records in one place (not required) • Combine Web Crawler • RDF framework to connect the resource descriptions together

Simple DC Records Really simple: • Title • Description • Identifier (URI of resource) • Source (URI of record)

Information model 1 • DC records describe all the resources • Web crawler reads these and returns crawled web pages • These generate a new web crawled resource

Information model 2 • Link back to original record(s), plus web page properties • RDF model lets these be connected via page, record URIs • Giving one large RDF graph of the total information

WSE graph

Novel Outcomes? It is obvious that: • Metadata gathering is not new (Harvest) • Web crawling is not new (Lycos) • Cataloguing is not new (1000s of years) So what is new?

WSE – Areas Not Focused I digress… • Gathering data together – not crucial, Combine is a distributed harvester • Full text indexing – not optimised • Web crawling algorithm – the routes through the web were not selected in a sophisticated way

WSE – General Benefits • Connecting separate systems (one less place needed to go) • RDF graph allows more data mixing (not fragile) • Leverages existing systems (Combine, Zebra), standards (RDF, DC)

WSE – Novel Searching • “game theory napster” – zero hits • Cross-subject searching in one system – “gmo” • Can navigate resulting provenance

WSE – Gains • Web crawling gains from high quality human description • SGs gain from increase in relevant pages • Fresher content than human-catalogued resource • More focused than a general search engine

WSE as a new tool • For subject experts • Which includes cataloguers • Gives fast, relevant search(no formal precision, recall analysis)

WSE – new areas • Cross-subject searching possible in subjects not yet catalogued, or that fall between SGs • Searching emerging topics is possible ahead of additions to catalogue standards • Helps indicate where new SGs, thesauri are needed

WSE - deploying • ILRT WSE • RDN WSE • RDN – investigating for the main search system

WSE for SGs Individual SGs – enhancing subject-specific searches: • Deep / full web crawling of high quality sites • Granularity of cataloguing and costIt is better for humans to describe entire sites (or large parts) and let the software do the detailed work of individual pages

Future • Improve and target the crawling • Use the SG information with result ranking • Add other relevant data to the graph such as RSS news • A Semantic Web application

Questions? • Thank You • Slides:http://ilrt.org/people/cmdjb/talks/tnc/2002/ • Project:http://wse.search.ac.uk/

References • Combine Web Crawler: http://www.lub.lu.se/combine/ • Dublin Core: http://dublincore.org/ • ILRT: http://ilrt.org/ • RDF: http://www.w3.org/RDF/ • Semantic Web: http://www.w3.org/2001/sw/

Web Search Environments Web Crawling Metadata using RDF and Dublin Core