170 likes | 300 Views
Efficient Web Browsing on Handheld Devices Using Page and Form Summarization. O rkut Buyukkokten , O liver K aljuvee , Hector Garcia-Molina, A ndreas Paepcke and Terry Winograd ( 2002). An Overview by Shrenik Sadalgi. Introduction.
E N D
Efficient Web Browsing on Handheld DevicesUsing Page and Form Summarization Orkut Buyukkokten, Oliver Kaljuvee, Hector Garcia-Molina, Andreas Paepcke and Terry Winograd (2002) An Overview by ShrenikSadalgi
Introduction • A new approach for summarizing and browsing Web pages: - Page summarization: Each Web page is broken into text units that can each be hidden, partially displayed, made fully visible, or summarized - Form summarization: HTML forms are also summarized by displaying just the text labels that prompt the user for input
Page Summarization Macro-level’ summarization by structural analysis of Web pages - expand & contract pages based on their relative structural nesting ‘Micro-level’ summarization uses information retrieval techniques to outline portions of the text for the user
Page Summarization – Macro Level • Partition page into ‘Semantic Textual Units’ (STU) - STUs are page fragments – DOM Elements • The proxy uses font and other structural information to identify a hierarchy of STUs - Nesting of STUs • Does not require special formatting at the Web sources - significant advantage of this approach over schemes that rely on pages to be specially structured for PDAs
Page Summarization – Micro Level • In this two-level approach to Web browsing, users can initially get a good high-level overview of a Web page, and then “zoom into” the portions most relevant. • Simple to implement • Effectiveness is limited - first sentence of a paragraph is not necessarily the best representation • Five methods for micro-level summarization
Extracting Keywords • Evaluate each word’s importance - a word is important if it occurs frequentlywithin the text and infrequently in the larger collection • Wij = tfij*log2 (N/n) where, Wij - weight of term Tj in document Di tfij - frequency of term Tj in document Di N - number of documents in collection n - number of documents where term Tj occurs at least once
Extracting a Summary Sentence • Each sentence (S) in an STU is assigned a significance factor S with the highest significance factor becomes the summary sentence • Mark all the significant words in S - word is significant if its TF/IDF weight is higher than a previously chosen weight cutoff ‘W’ • find all “clusters” in S - the sequence starts and ends with a significant word - fewer than ‘D’ (distance cutoff) insignificant words must separate any two neighboring significant words within the sequence • Add weights of all significant words within a cluster & divide by the total number of words within the cluster
Results Task completion times for all methods and all tasks
Results I/O activity required for all methods over all tasks
Results Average completion time for each method across all tasks
Form Summarization Process • Algorithms for finding a matching label for form input fields from form text • Chunk Partitioning • small pieces of HTML code that are delimited by HTML tags (not the same as STU) • Label Matching • N-Gram “ants” and “grants” • Letter/Word “First Name” and “Fname” • Word/Letter “PhoneWork” and “PhoneW” • Substring “Password” and “pwd” • NULL Algorithm takes name of input tag • Tables • Previous / Following • check_box_label
Results Each Algorithm Matching 115 Input Fields Matching Performance for Algorithm Combination over 330 Input Elements