280 likes | 405 Views
The 36th European Conference on Information Retrieval . ECIR 2014, Amsterdam, Netherlands, 2014. Thumbnail Summarization Techniques For Web Archives. Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu . M ichael L. Nelson Old Dominion University
E N D
The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 Thumbnail Summarization Techniques For Web Archives Ahmed AlSum* Stanford University Libraries Stanford CA, USA aalsum@stanford.edu MichaelL. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu *Ahmed AlSum did this work while he was PhD student at Old Dominion University ECIR 2014 Amsterdam, Netherlands
What is a Web Archive? http://www.cs.odu.edu ECIR 2014 Amsterdam, Netherlands
Thumbnails in Web Archive Internet Archive UK Web Archive ECIR 2014 Amsterdam, Netherlands
Memento Terminology Original Resource URI-R, R http://www.amazon.com Memento URI-M, M http://web.archive.org/web/20110411070244/http://amazon.com TimeMap URI-T, TM ECIR 2014 Amsterdam, Netherlands
Thumbnails Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail for each memento using one hundred machines. • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento. • Page quality ECIR 2014 Amsterdam, Netherlands
Thumbnails Usage Challenges • This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com ECIR 2014 Amsterdam, Netherlands
From 10,500 Mementos to 69 Thumbnails. ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need? www.unfi.com on the live Web ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need? www.unfi.com on the live Web ECIR 2014 Amsterdam, Netherlands
40 Thumbnails are good. ECIR 2014 Amsterdam, Netherlands
Methodology ECIR 2014 Amsterdam, Netherlands
Visual Similarity and Text Similarity Similar Different HTML Text ECIR 2014 Amsterdam, Netherlands
Correlation between Visual Similarity and Text Similarity • Text Similarity • SimHash • DOM Tree • Embedded resources • Memento Datetime (Capture time) • Visual Similarity ECIR 2014 Amsterdam, Netherlands
Text SimilaritySimHash • Computes 64-bit SimHash fingerprints with k = 4 for two pages • Full HTML text ✔ • The main content from the web page • All the text • Templates including the text • The template excluding the text • Calculate the differences using Hamming Distance ECIR 2014 Amsterdam, Netherlands
Text SimilarityDOM Tree • Transfer each webpage to DOM tree • Calculate the difference usingLevenshteinDistance • Levenshtein distance: is the number of operations to insert, update, and delete. ECIR 2014 Amsterdam, Netherlands
Text SimilarityEmbedded resources • Extract the embedded resources for each page • Calculate the total number of new resources that have been added and the resources that have been removed. • For example, the difference between M1 and M2: • Addition of 5 resources (2 javascript files and 3 images) • Removal of 2 resources (1 javascript file and 1 image). ECIR 2014 Amsterdam, Netherlands
Text SimilarityMemento datetime • Calculate the difference between the record capture time for both pages in seconds. ECIR 2014 Amsterdam, Netherlands
Visual Similarity • Measurement: the number of different pixels between two thumbnails • To compare two thumbnails, • Resize them into different dimensions: 64x64, 128x128, 256x256, and 600x600. • Calculate the Manhattan distance and Zero distance between each pair ECIR 2014 Amsterdam, Netherlands
Correlation between Visual Similarity and Text Similarity SimHash DOM tree Embedded resources Memento Datetime SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands
Selection algorithms ECIR 2014 Amsterdam, Netherlands
Threshold Grouping ECIR 2014 Amsterdam, Netherlands
Threshold Grouping ECIR 2014 Amsterdam, Netherlands
Clustering technique • Input: • TimeMap with n mementos • A set of features. • For example, F = {SimHash, Memento-Datetime} • Task: • Cluster n mementos in K clusters. ECIR 2014 Amsterdam, Netherlands
Clustering technique SimHash and Datetime Features SimHash Feature Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands
Time Normalization ECIR 2014 Amsterdam, Netherlands
Selection Algorithms Comparison ECIR 2014 Amsterdam, Netherlands
Generalization outside the Web Archive • Get k thumbnails from website that has n pages ECIR 2014 Amsterdam, Netherlands
Conclusions • We explored the similarity between the text and visual appearance of the web page. • We found that SimHash and Levenshteindistance have the highest correlation • We presented three algorithms to select k thumbnails from n mementos per TimeMap. aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands