1 / 58

Temporal Spread In Archived Composite Resources (work in progress)

WADL 2013 July 25 – 26, 2013 Indianapolis, Indiana USA. Temporal Spread In Archived Composite Resources (work in progress). Scott G. Ainsworth Michael L. Nelson Old Dominion University Computer Science. Contents. Motivation Related work Preliminary work Temporal Spread Future work

iorwen
Download Presentation

Temporal Spread In Archived Composite Resources (work in progress)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WADL 2013 July 25–26, 2013 Indianapolis, Indiana USA Temporal Spread In ArchivedComposite Resources(work in progress) Scott G. Ainsworth Michael L. Nelson Old Dominion University Computer Science

  2. Contents • Motivation • Related work • Preliminary work • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson

  3. A Fable from Wayback Scott G. Ainsworth • Michael L. Nelson

  4. Temporal Spread 2005-05-14 01:36:08 +9 days +7 months +18 days +18 days +2.1 years Scott G. Ainsworth • Michael L. Nelson

  5. Questions • How much temporal spread exists in composite mementos? • How can temporal spread be minimized? • What factors contribute, positively or negatively, to spread? • Does combining multiple archives produce better results? • Would users with differing goals benefit from different minimization policies and heuristics? • How can temporal coherence be displayed to users—simply? Scott G. Ainsworth • Michael L. Nelson

  6. Contents • Motivation • Related work • Preliminary work • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson

  7. Related Work • Control Crawl Data Quality, Future collections • Spaniolet al. – crawling strategy • Denev et al. – change rates by MIME type and depth • Ben Saad et al. – metadata from crawl used to select best results from archive • Our Focus: Existing Data Quality • Existing collections • Datetime selection policies Scott G. Ainsworth • Michael L. Nelson

  8. Related Work • Use Patterns • AlNoamonyet al. – Archive Access Patterns • Humans vs. Robots • Dip, dive, slide, & skim • Identifying Duplicates • Simple identity – images, other binary formats • direct comparison • Hash comparison • HTML, CSS (text) • Shingling, Jaccard distances, etc. • SimHash ⃪ most promise Scott G. Ainsworth • Michael L. Nelson

  9. Related work – Memento* • HTTP extension for datetime negotiation • Request • Response GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT… HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT… *https://datatracker.ietf.org/doc/draft-vandesompel-memento/ Scott G. Ainsworth • Michael L. Nelson

  10. Contents • Motivation • Related work • Preliminary work • How much of the Web is archived • Temporal Drift • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson

  11. How much is Archived? Internet Archive Search Engine Other 35 – 90% At least one archived copy 17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11 Scott G. Ainsworth • Michael L. Nelson

  12. Contents • Motivation • Related work • Preliminary work • How much of the Web is archived • Temporal Drift • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson

  13. Temporal Drift • Comparing two policies • Sliding –target datetime changes • Sticky – target datetime held steady Scott G. Ainsworth • Michael L. Nelson

  14. SLIDING TARGET 2005-05-14 01:36:08 Scott G. Ainsworth • Michael L. Nelson

  15. SLIDING TARGET 2005-04-22 00:17:52 Scott G. Ainsworth • Michael L. Nelson

  16. SLIDING TARGET 2005-03-31 09:16:10 Scott G. Ainsworth • Michael L. Nelson

  17. Temporal Drift What we Expected2005-05-14 @ 01:36:08 What we Got2005-03-31 @ 09:16:10 Scott G. Ainsworth • Michael L. Nelson

  18. Sticky Target • What if the target is held steady? • (Enabled by Memento API) Scott G. Ainsworth • Michael L. Nelson

  19. STICKY TARGET 2005-05-14 01:36:08 2005-05-14 MementoFox Extension Scott G. Ainsworth • Michael L. Nelson

  20. STICKY TARGET 2005-04-22 00:17:52 Scott G. Ainsworth • Michael L. Nelson

  21. STICKY TARGET 2005-05-14 01:36:08 Scott G. Ainsworth • Michael L. Nelson

  22. Drift Comparison Scott G. Ainsworth • Michael L. Nelson

  23. Median Drift by Step JCDL’13 ● Sliding ● Sticky Median Drift (months) Step Number Scott G. Ainsworth • Michael L. Nelson

  24. Contents • Motivation • Related work • Preliminary work • How much of the Web is archived • Temporal Drift • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson

  25. Temporal Spread Scott G. Ainsworth • Michael L. Nelson

  26. Composite Memento Presentation STructure Scott G. Ainsworth • Michael L. Nelson

  27. Temporal Spread 2005-05-14 01:36:08 +9 days +7 months +18 days +18 days +2.1 years Scott G. Ainsworth • Michael L. Nelson

  28. Embedded Resources Scott G. Ainsworth • Michael L. Nelson

  29. Representing Spread Composite Memento Temporal Spread Chart Root Embedded Diff. Domain Reused Scott G. Ainsworth • Michael L. Nelson

  30. Temporal Spread – ODU CS Scott G. Ainsworth • Michael L. Nelson

  31. First Experiment • 1,000 URIs from DMOZ (Open Directory) • Download all timemaps • Download all composite mementos • Download all embedded resources • Single and Multiple Archives • Four Heuristics Scott G. Ainsworth • Michael L. Nelson

  32. Preliminary Results Scott G. Ainsworth • Michael L. Nelson

  33. Single/Multi & Heuristics Scott G. Ainsworth • Michael L. Nelson

  34. Temporal Coherence 1 Memento, Bracketed Root Scott G. Ainsworth • Michael L. Nelson

  35. Temporal Coherence 1 Memento, Bracketed Root Scott G. Ainsworth • Michael L. Nelson

  36. Temporal Coherence 1 Memento, Bracketed Root Scott G. Ainsworth • Michael L. Nelson

  37. Temporal Coherence 1 Memento, Root Not Bracketed Scott G. Ainsworth • Michael L. Nelson

  38. Temporal Coherence 1 Memento, Root Not Bracketed Scott G. Ainsworth • Michael L. Nelson

  39. Temporal Coherence 1 Memento, No Last-Modified Scott G. Ainsworth • Michael L. Nelson

  40. Temporal Coherence 1 Memento, Before Root Scott G. Ainsworth • Michael L. Nelson

  41. Temporal Coherence 2 Mementos, Root Not Bracketed Scott G. Ainsworth • Michael L. Nelson

  42. Temporal Coherence 2 Mementos, Root Not Bracketed Scott G. Ainsworth • Michael L. Nelson

  43. Temporal Coherence 2 Mementos, Use Content – Similarity Scott G. Ainsworth • Michael L. Nelson

  44. Temporal Coherence 2 Mementos, Contents Equal or Equivalent Scott G. Ainsworth • Michael L. Nelson

  45. Temporal Coherence 2 Mementos, Contents Not Equal or Equivalent Scott G. Ainsworth • Michael L. Nelson

  46. Current Experiment • 4,000 URIs from JCDL’11 “How Much…” paper • 1 URI/month vice all • Temporal coherence patterns • Target WSDM 2013 Scott G. Ainsworth • Michael L. Nelson

  47. Current Experiment Scott G. Ainsworth • Michael L. Nelson

  48. Contents • Motivation • Related work • Preliminary work • Temporal Spread • Future work • Conclusion Scott G. Ainsworth • Michael L. Nelson

  49. Future Work • Timemaps, Redirection, Missing Mementos • Timemaps only tell part of the story • URI-R redirection (302 from source) • URI-M redirection (Archive action) • Mementos in timemaps but not accessible • Policies must consider user needs • Leave it missing • Show “best” substitute Scott G. Ainsworth • Michael L. Nelson

  50. Future Work • Similarity & Duplication • Delta are currently | root – embedded | • If bracketing mementos are identical,should delta be zero? • HTML is usually modified by the archive • Can’t check for equality • Shingling? SimHash? +30d –30d 0 Scott G. Ainsworth • Michael L. Nelson

More Related