580 likes | 714 Views
WADL 2013 July 25 – 26, 2013 Indianapolis, Indiana USA. Temporal Spread In Archived Composite Resources (work in progress). Scott G. Ainsworth Michael L. Nelson Old Dominion University Computer Science. Contents. Motivation Related work Preliminary work Temporal Spread Future work
E N D
WADL 2013 July 25–26, 2013 Indianapolis, Indiana USA Temporal Spread In ArchivedComposite Resources(work in progress) Scott G. Ainsworth Michael L. Nelson Old Dominion University Computer Science
Contents • Motivation • Related work • Preliminary work • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson
A Fable from Wayback Scott G. Ainsworth • Michael L. Nelson
Temporal Spread 2005-05-14 01:36:08 +9 days +7 months +18 days +18 days +2.1 years Scott G. Ainsworth • Michael L. Nelson
Questions • How much temporal spread exists in composite mementos? • How can temporal spread be minimized? • What factors contribute, positively or negatively, to spread? • Does combining multiple archives produce better results? • Would users with differing goals benefit from different minimization policies and heuristics? • How can temporal coherence be displayed to users—simply? Scott G. Ainsworth • Michael L. Nelson
Contents • Motivation • Related work • Preliminary work • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson
Related Work • Control Crawl Data Quality, Future collections • Spaniolet al. – crawling strategy • Denev et al. – change rates by MIME type and depth • Ben Saad et al. – metadata from crawl used to select best results from archive • Our Focus: Existing Data Quality • Existing collections • Datetime selection policies Scott G. Ainsworth • Michael L. Nelson
Related Work • Use Patterns • AlNoamonyet al. – Archive Access Patterns • Humans vs. Robots • Dip, dive, slide, & skim • Identifying Duplicates • Simple identity – images, other binary formats • direct comparison • Hash comparison • HTML, CSS (text) • Shingling, Jaccard distances, etc. • SimHash ⃪ most promise Scott G. Ainsworth • Michael L. Nelson
Related work – Memento* • HTTP extension for datetime negotiation • Request • Response GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT… HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT… *https://datatracker.ietf.org/doc/draft-vandesompel-memento/ Scott G. Ainsworth • Michael L. Nelson
Contents • Motivation • Related work • Preliminary work • How much of the Web is archived • Temporal Drift • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson
How much is Archived? Internet Archive Search Engine Other 35 – 90% At least one archived copy 17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11 Scott G. Ainsworth • Michael L. Nelson
Contents • Motivation • Related work • Preliminary work • How much of the Web is archived • Temporal Drift • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson
Temporal Drift • Comparing two policies • Sliding –target datetime changes • Sticky – target datetime held steady Scott G. Ainsworth • Michael L. Nelson
SLIDING TARGET 2005-05-14 01:36:08 Scott G. Ainsworth • Michael L. Nelson
SLIDING TARGET 2005-04-22 00:17:52 Scott G. Ainsworth • Michael L. Nelson
SLIDING TARGET 2005-03-31 09:16:10 Scott G. Ainsworth • Michael L. Nelson
Temporal Drift What we Expected2005-05-14 @ 01:36:08 What we Got2005-03-31 @ 09:16:10 Scott G. Ainsworth • Michael L. Nelson
Sticky Target • What if the target is held steady? • (Enabled by Memento API) Scott G. Ainsworth • Michael L. Nelson
STICKY TARGET 2005-05-14 01:36:08 2005-05-14 MementoFox Extension Scott G. Ainsworth • Michael L. Nelson
STICKY TARGET 2005-04-22 00:17:52 Scott G. Ainsworth • Michael L. Nelson
STICKY TARGET 2005-05-14 01:36:08 Scott G. Ainsworth • Michael L. Nelson
Drift Comparison Scott G. Ainsworth • Michael L. Nelson
Median Drift by Step JCDL’13 ● Sliding ● Sticky Median Drift (months) Step Number Scott G. Ainsworth • Michael L. Nelson
Contents • Motivation • Related work • Preliminary work • How much of the Web is archived • Temporal Drift • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson
Temporal Spread Scott G. Ainsworth • Michael L. Nelson
Composite Memento Presentation STructure Scott G. Ainsworth • Michael L. Nelson
Temporal Spread 2005-05-14 01:36:08 +9 days +7 months +18 days +18 days +2.1 years Scott G. Ainsworth • Michael L. Nelson
Embedded Resources Scott G. Ainsworth • Michael L. Nelson
Representing Spread Composite Memento Temporal Spread Chart Root Embedded Diff. Domain Reused Scott G. Ainsworth • Michael L. Nelson
Temporal Spread – ODU CS Scott G. Ainsworth • Michael L. Nelson
First Experiment • 1,000 URIs from DMOZ (Open Directory) • Download all timemaps • Download all composite mementos • Download all embedded resources • Single and Multiple Archives • Four Heuristics Scott G. Ainsworth • Michael L. Nelson
Preliminary Results Scott G. Ainsworth • Michael L. Nelson
Single/Multi & Heuristics Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 1 Memento, Bracketed Root Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 1 Memento, Bracketed Root Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 1 Memento, Bracketed Root Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 1 Memento, Root Not Bracketed Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 1 Memento, Root Not Bracketed Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 1 Memento, No Last-Modified Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 1 Memento, Before Root Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 2 Mementos, Root Not Bracketed Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 2 Mementos, Root Not Bracketed Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 2 Mementos, Use Content – Similarity Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 2 Mementos, Contents Equal or Equivalent Scott G. Ainsworth • Michael L. Nelson
Temporal Coherence 2 Mementos, Contents Not Equal or Equivalent Scott G. Ainsworth • Michael L. Nelson
Current Experiment • 4,000 URIs from JCDL’11 “How Much…” paper • 1 URI/month vice all • Temporal coherence patterns • Target WSDM 2013 Scott G. Ainsworth • Michael L. Nelson
Current Experiment Scott G. Ainsworth • Michael L. Nelson
Contents • Motivation • Related work • Preliminary work • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson
Future Work • Timemaps, Redirection, Missing Mementos • Timemaps only tell part of the story • URI-R redirection (302 from source) • URI-M redirection (Archive action) • Mementos in timemaps but not accessible • Policies must consider user needs • Leave it missing • Show “best” substitute Scott G. Ainsworth • Michael L. Nelson
Future Work • Similarity & Duplication • Delta are currently | root – embedded | • If bracketing mementos are identical,should delta be zero? • HTML is usually modified by the archive • Can’t check for equality • Shingling? SimHash? +30d –30d 0 Scott G. Ainsworth • Michael L. Nelson