Background

Archiving the web: does whole-of-domain archiving = information overload?Dr Bob Pymm, Jake Wallis Charles Sturt University

Background • PANDORA - selective web archiving by the National Library of Australia (NLA) since 1996, c. 20,000 titles • NLA whole-of-domain (.au) web archiving annually since 2005, over 500 million files in 2007 – 19tb of data. • Undertaken by the US based Internet Archive • Simple keyword index created plus URL index.

Issues related to the crawl • No authorisation for gathering the files – restrictions on use • Complexities arising from the diverse nature of the Web • Australian content on overseas servers, sites not in the “.au” domain (eg. Blogs) • Difficulty in capturing dynamically created content • The size of the resulting dataset and indexes

Indexing and large datasets • Remember – whenever searching Web or harvest – it is the index being searched, not the actual contents • Thus importance of effective indexing - with research being done on how to improve • Google’s success with ranking and weighting in its indexes • Alternative methods include visual results sets which show links as well as straight results

Research approach • The focus was the 2007 “.au” Web crawl and its accessibility • Two topics from the 20/20 Summit chosen: • Indigenous health • Landcare programs • High profile topics, likely to be of long term interest

Searching • Searched across PANDORA and Crawl dataset • PANDORA – curated titles; multiple indexes • Crawl – keyword and URL index only • Simple search on each term – top five records from PANDORA seen as important and authoritative due to their careful selection and indexing. Thus seen as key resources on the topic. All ten were Federal Government sites.

Searching (cont.) • Same terms searched in Web crawl (note eventually all PANDORA sites were found in the Crawl)

The Long Tail • Like the Pareto Principle, the Long Tail paradigm suggests a small proportion of the available information meets the vast majority of needs.

Discussion • Searchers stop after small number of pages – the top of the TAIL – then a very long tail of hits not considered • Selective archives such as PANDORA deliver small numbers of pages of high relevance (curated), ie. The top of the tail • The Web Crawl gives the top and long tail all together - indexing decides

Discussion (cont.) • Cost/difficulty of creating effective indexes and display mechanisms for huge datasets • Issue of rights and privacy infringements when no permissions sought for data harvesting • BUT • Curated collections such as PANDORA may reflect a conservative paradigm (top 5 sites Federal Government for instance)

Discussion (cont.) • The Web Crawl broader, cultural milieu. • Online activities – political activism, public communication, social networking, audio and video sharing. • Collection development paradigm changing – mediated vs democratic – PANDORA vs Web Crawl

Conclusion • Mediated or curated selection, as in PANDORA, delivers data of quality and integrity, readily accessible, but in very limited areas. • The Web Crawl delivers a huge mass of data, collected without fear or favour, but very hard to access • What will best meet needs of future researchers – has to be a continuing debate

Background

Background

Presentation Transcript

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background.

Background

Background

Background

Background

Background

Background