1 / 24

Web Search Environments Web Crawling Metadata using RDF and Dublin Core

Web Search Environments Web Crawling Metadata using RDF and Dublin Core. Dave Beckett http://purl.org/net/dajobe/ Slides: http://ilrt.org/people/cmdjb/talks/tnc2002/. Introduction. Overview of SGs and Web Crawling Why WSE, what’s new? Novel results

aron
Download Presentation

Web Search Environments Web Crawling Metadata using RDF and Dublin Core

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search EnvironmentsWeb Crawling Metadata using RDF and Dublin Core Dave Beckett http://purl.org/net/dajobe/ Slides: http://ilrt.org/people/cmdjb/talks/tnc2002/

  2. Introduction • Overview of SGs and Web Crawling • Why WSE, what’s new? Novel results • Future work (or stuff we didn’t do) and conclusions

  3. Overview • Digital Library community • In UK, subject-specific gateways (SGs) • Want to improve: scope (more), timeliness (fresh), cost (less) • Stay professional – the Quality word • Compete with web search engines – the Google Test

  4. Human Cataloguing of the Web • Pros: High quality, domain knowledge selection, subject-specialised, cataloguing done to well-known and developed standards • Cons: Expensive, slow, descriptions need to be reviewed regularly to keep them relevant

  5. Software running web crawls • Pros: vastly comprehensive (Con: too much), can be very up-to-date • Cons: cannot distinguish “this page sucks” from “this page rocks”, indiscriminate, subject to spamming, very general (but…)

  6. Combining Web Crawling and High Quality Description A solution • Seed the web crawl from high quality records • Crawl to other (presumably) good quality pages • Track the provenance of the crawled pages • Provenance can be used for querying and result ranking

  7. Web Search Environments (WSE) Project • Research by ILRT and later Resource Discovery Network (RDN) • RDN funds UK SGs (ILRT also had DutchESS)

  8. WSE Technologies • Simple Dublin Core (DC) records extracted from SGs • OAI protocol used to collect these records in one place (not required) • Combine Web Crawler • RDF framework to connect the resource descriptions together

  9. Simple DC Records Really simple: • Title • Description • Identifier (URI of resource) • Source (URI of record)

  10. Information model 1 • DC records describe all the resources • Web crawler reads these and returns crawled web pages • These generate a new web crawled resource

  11. Information model 2 • Link back to original record(s), plus web page properties • RDF model lets these be connected via page, record URIs • Giving one large RDF graph of the total information

  12. WSE graph

  13. Novel Outcomes? It is obvious that: • Metadata gathering is not new (Harvest) • Web crawling is not new (Lycos) • Cataloguing is not new (1000s of years) So what is new?

  14. WSE – Areas Not Focused I digress… • Gathering data together – not crucial, Combine is a distributed harvester • Full text indexing – not optimised • Web crawling algorithm – the routes through the web were not selected in a sophisticated way

  15. WSE – General Benefits • Connecting separate systems (one less place needed to go) • RDF graph allows more data mixing (not fragile) • Leverages existing systems (Combine, Zebra), standards (RDF, DC)

  16. WSE – Novel Searching • “game theory napster” – zero hits • Cross-subject searching in one system – “gmo” • Can navigate resulting provenance

  17. WSE – Gains • Web crawling gains from high quality human description • SGs gain from increase in relevant pages • Fresher content than human-catalogued resource • More focused than a general search engine

  18. WSE as a new tool • For subject experts • Which includes cataloguers • Gives fast, relevant search(no formal precision, recall analysis)

  19. WSE – new areas • Cross-subject searching possible in subjects not yet catalogued, or that fall between SGs • Searching emerging topics is possible ahead of additions to catalogue standards • Helps indicate where new SGs, thesauri are needed

  20. WSE - deploying • ILRT WSE • RDN WSE • RDN – investigating for the main search system

  21. WSE for SGs Individual SGs – enhancing subject-specific searches: • Deep / full web crawling of high quality sites • Granularity of cataloguing and costIt is better for humans to describe entire sites (or large parts) and let the software do the detailed work of individual pages

  22. Future • Improve and target the crawling • Use the SG information with result ranking • Add other relevant data to the graph such as RSS news • A Semantic Web application

  23. Questions? • Thank You • Slides:http://ilrt.org/people/cmdjb/talks/tnc/2002/ • Project:http://wse.search.ac.uk/

  24. References • Combine Web Crawler: http://www.lub.lu.se/combine/ • Dublin Core: http://dublincore.org/ • ILRT: http://ilrt.org/ • RDF: http://www.w3.org/RDF/ • Semantic Web: http://www.w3.org/2001/sw/

More Related