1 / 21

Kulturarw³

Capturing the web The Swedish experience www.kb.se/kw3. Kulturarw³. Content. The Archive priorities storage what we save Development IIPC Tools, format conclusion. Background Kulturarw 3 goals strategy Sweden on the net? Harvesting Software Fimding links problem Statistics

tom
Download Presentation

Kulturarw³

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capturing the web The Swedish experience www.kb.se/kw3 Kulturarw³

  2. Content • The Archive • priorities • storage • what we save • Development • IIPC • Tools, format • conclusion • Background • Kulturarw3 • goals • strategy • Sweden on the net? • Harvesting • Software • Fimding links • problem • Statistics • What have we got?

  3. Background • Legal deposit, 1661 • Latest revision 1993 • Only electronic documents in fixed form • CD-ROM, diskettes • New law • juli 1:st, 2002, exception from personal privacy law. • First Swedish web news paper lost • Printed newspapers since 1645 • Kulturarw3 started 1996 • Still waiting for new legal deposit law

  4. Goals • All web pages in Sweden • pictures, video etc. • .se, .and other Top Level Domains • Electronic journals

  5. Strategy: two choices • Select what is importantHow to know what will be considered important in the future?Labour intense • Everything using automatic softwareGets everything (well, not really)Less labour intense

  6. Strategy • Take snapshots of the Swedish weba few times each year • Gets “all” • Needs less labour • Computer memory is cheap • However, large volumes makes quality control difficult • Selective harvestingabout 150 newspapers every day • In the future; events, eg electionsWith as little human intervention as possible.

  7. Sweden on the web? http://www.kb.se/kbstart.htm Only the domain part relevant • .se • .nu, Niue popular in Sweden. ”nu” means now in Swedish • Others if the server is geographically located in Sweden • Language?

  8. Harvesting software • A harvester (crawler, spider) collects web pages by automatically following links and saving pages • Open-source harvester: Heritrix • Main developer: Internet Archive (IA)‏ • Written in Java. Active community. • Designed for archiving. not indexing. • Earlier: Modified version of Combine • From NetLab, Lund university. • Important!Indexing isn't archiving and archiving isn't indexing! • Collects also pictures, sound etc.

  9. Problems‏ • …or challenges if you are an optimist… • Scripts • Interactive pages • Password protected • Video/streaming material • Social sites

  10. Statistics – what did we get? Bulk crawls (everything Swedish) • First sweep – 1997 , only .se- 6.8 million files- 160 GB data • A sweep 2007-2008 , .se and other tld:s- 270 million files- 11500 GB data

  11. Statistics – what did we get? • Periodika (newspapers) • Started june 2002 • 88 miljoner URLer • 4.0 TB • About 40 000 URLs every day

  12. More statistics Bulk (everything Swedish)‏ • 823 100 web servers (including inlines)‏ • 651 700 “swedish” - .se 50 % - .nu 21% - others 29% • 1549 different MIME-typer found. • Html about 50% • text/html + image/gif + image/jpeg + appl/pdf + text/plain about 97% of the documents. • A lot of garbage, miss-spellings etc.

  13. Trends • Html: stable, 50-60% . Increasing lately • Jpeg: increasing, 11% (-97), 27% (05)‏ • Gif: decreasing, 23% (-97), 11% (-05)‏ • Pdf: increasing, 9:th to 4:th position

  14. Accessing the archive Firsta priority is to access the archive using traditional web technologies. Surf, in “space” and time Free text search Nb, not using traditional library methods: cataloging etc.

  15. Development • International Internet Preservation Consortium (IIPC)‏ • Started by Internet Archive national libraries of: Sweden, Norway, Finland, Danmark, Iceland, UK, France, Italy, Canada, Australia och USA (LoC)Now many more‏ • Develop common standards, tools and methods for web archiving. • Raise awareness

  16. Development, standards • Archiving formats • Earlier formats ‏ • MIME (Multipart Mail Extension)‏ • ARC • NedLib • WARC (Web ARChive file format)‏ • File format for saving web materialeach web page is one record in a warc-fileA record contains metada and content • ISO 28500.

  17. Development, Tools • Tools • Harvesting: Heritrix • Designed for archiving (NOT a modified indexer)‏ • Open soure: Java, Linux etc. • Supported by IIPC • Mainly developed by Internet Archive with contributions • Will (is) support WARC. Supports ARC and MIME • Surfing tools • New Wayback Machine • WERA - surf with time line‏ • WAXToolbar – support when using new WM • NutchWax • Free text search (with time line)‏ • Curator tool • Possible for a new-technician to do collection and quality control

  18. Advices • Use Open standards, open source → IIPC • Get users of the archive • Think big. Hundreds of tera bytes, billions of files • Accept that what you do is a best effort

  19. Conclusion • The web is constantly changing  continuous development. • Possible to get a reasonable picture of the web. But never complete! • Do something now

  20. Questions? Comments? ? ? ?

  21. Links • IIPC: www.netpreserve.org • Kulturarw3: www.kb.se/kw3 • Internet Archive: www.archive.org

More Related