20 likes | 111 Views
Search and Access Strategies for Web Archives. Sangchul Song and Joseph JaJa. 1. Background The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium.
E N D
Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa • 1. Background • The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium. • Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time. • We need effective and scalable access strategies for web archives covering significant temporal spans. • 4. Problems With Existing Methods • Inefficient handling of time-constrained search. • Ineffective delivery of search results • Inadequate relevancy scoring. • Scoring is performed over the entire history. • Ungrouped search results. • URL is not unique in web archives – time dependent. • Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL. • Users can want to focus more on a specific time-period within the results. • Lack of a group-scoring methodology. • What group to show on the top is not clear without a group-scoring methodology. “Find web pages that contain ‘September 11th’ before 2001” September 11 attacks - Wikipedia, the free encyclopediaThe September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks September 11 Digital ArchiveUses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/ 9/11 Tributes, September 11 Tributes and Memorials to the Victims …Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th. 2001. 9/11 World Trade Center, ... www.jontzen.com/tribute.htm - 132k National Commission on Terrorist Attacks Upon the United StatesCommission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist attacks, ...www.9-11commission.gov/ - 8k … and 4 million other pages pertaining to the September 11th Attack … Search all, and then Filter Very inefficient!! • 2. Our Goals: Development of • An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery. • A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user. • Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span. • A framework that allows effective search using keywords and time spans for large scale web archives. Ethiopian calendar - Wikipedia, the free encyclopediaThus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian), ...en.wikipedia.org/wiki/Ethiopian_calendar - 43k APOD: September 11, 1997 - Mars Global Surveyor: AerobrakingSeptember 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap970911.html - 5k … and only 560 other pages that are irrelevant to the September 11th Attack 3. Existing Access Methods Directory Chronological Listing • 5. Overview of our Approach: • Efficient time-constrained search by maintaining separate inverted lists for a given time window See Block 6. • Scoring within a temporal context by computing term weights as a function of time See Block 7. • Grouping similar search results, while scoring search results as a group See Block 9 and 10. Hybrid Text-Search
Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa • 6. Basic Techniques • Determine a snapshot of web contents covering a time windowSCk = { All web objects valid within a time interval [tk~tk+1) } • Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree. 8. Search User Interface SC1 SC2 SCK B-Tree Multi-version Tree 9. Grouping Search Results B-Tree Grouped by Time w1 w2 wN w3 wN w1 w2 PLSC1-w1 PLSC1-w2 PLSC1-w3 PLSC1-wN SC1 SC2 SCK SC1 SC2 SCK SC1 SC2 SCK w1 w2 wN w1 w2 Grouped by URL (expanded) PLSC1-w1 PLSC2-w1 PLSCK-w1 PLSC1-w2 PLSC2-w1 PLSC1-wN PLSC1-w1 PLSC1-w2 PLSC1-wN PLSC2-w1 PLSC2-w2 PLSC1-w2 PLSC1-w3 PLSC4-w1 Grouped by URL (collapsed) • 7. Scoring within a Temporal Context • Relevancy scoring is based onthe time that an web page wasarchived. • The same contents will have different relevancy scores whenthe temporal contexts are different. (e.g. one was archived several months before the other) First page polluted by the same URL • 10. Group-wide Scoring • Grouping is good, but now which group to place first on the result page? • Simple method : use average or highest score among members • More effective method: compute a relevancy score as a group. • Instead of tf(t), we use df(t), document frequency of t in group. • Instead of idf(t), we use igf(t), inverse group frequency . • We extend some of the best known IR technologies for group ranking. Same contents, different archive dates different scores!!