270 likes | 417 Views
Learning to remove Internet advertisements. Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland. Presented by Bo Zhang Department of Computer Science Michigan Technological University.
E N D
Learning to remove Internet advertisements Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland Presented by Bo Zhang Department of Computer Science Michigan Technological University
Overview • Background • Introduction of ADEATER • Design of ADEATER • Evaluation • Related Work • Conclusion and Future Work
Advertisement Image Advertisement Image Advertisement Image Background • Negative Impact of advertisement images on Internet • Slow down the speed of browsing • Consume resources of computer • Extra costs for users
Introduction of ADEATER • Definition: - A browsing assistant that automatically removes advertisement images from Internet pages. • Property: • Rules generated from learning algorithm
Introduction of ADEATER • Examples
Design of ADEATER • System Architecture
Design of ADEATER • Encoding instance • Fixed–width feature vector • Images enclosed in anchor tag <A> is a candidate advertisement • Geometric features of an image: -Height <IMG height=90> -Width <IMG width=90> -Aspect ratio (ratio of width to height) • Local feature: -Whether destination URL and image URL are in the same internet domain www.ee.mtu.edu/page.html www.cs.mtu.edu/image.jpg YES www.dell.com/notebook.html www.cs.mtu.edu/image.jpg No
Design of ADEATER • Encoding instance • Fixed–width feature vector • Caption feature: -Words occuring in enclosing <A> tag with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count • Alt Feature -Set of “alternate” words in the <IMG> tag (<IMG alt=“ad”>) with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count
Design of ADEATER • Encoding instance • Fixed–width feature vector • Ubase, Udest, Uimg -Words occuring in base URL, destination URL, image URL with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count • Stop list -Low-information terms (“http”, “www”, ”jpg”, etc.)
Design of ADEATER • Encoding instance • Samplesof HTML page
Design of ADEATER • Encoding of samples
Design of ADEATER • Encoding of samples (cont)
Design of ADEATER • Gathering examples • AD samples are generated by ADGRABBER browsing assistant • Identifier candidate advertisements • Generate vector encoding • NON-AD samples are generated by a custom-built Internet spider • Extract images from randomly-generated URLs.
Design of ADEATER • Learning rules • Algorithm - C4.5 decision tree learning algorithm • Properties - Quick on-line execution of classifier - Not be overly sensitive to missing features or noises - Scale well and insensitive to irrelevant features • Examples of rules -If aspect ratio > 4.5833, alt doesn’t contain “to” but does contain “click+here”,and Udest doesn’t contain “http+www”, then instance is an AD - If Ubase does not contain “messier”, and Udest contains the “redirect+cgi”, then instance is an AD
Design of ADEATER • Removing advertisements • Process - Fetch HTML pages from Internet - Identify candidate advertisements - Classify instances with learned rules - Replace the image’s URL with the URL of an inconspicuous low-bandwidth image • Implementation - Removal module as a proxy server
Evaluation • Speed and accuracy • Experiment setting • Total samples - AD: 459 examples - NON-AD: 2820 examples • 10-fold cross-validation - Training set: 90% examples - Test set: 10% examples • Off-line training phase: 5.8 minutes • On-line classification phase: 70 msec/image • Average accuracy: 97.1%
Evaluation • Learning curves • Simple methodology - Not recalculate feature set • Realistic methodology - Recalculate feature set
Evaluation • Alternative encodings
Related Work • Muffin: Filtering web pages • ImageKill Filter: Hand-crafted rules • ImageKill.minheight - Only remove images which are at least n pixels high • ImageKill.minwidth - Only remove images which are at least n pixels wide • ImageKill.ratio - Remove images which are more than n times as wide as they are high • ImageKill.exclude - Don't remove images that match the given string/regexp
Related Work • WebFilter: Filtering web pages • Solution - User provides a list of URL templates and corresponding filter scripts
Related Work • Junkbuster: Filtering web pages • Solution - User provides a block file
Related Work • Smokey: Detect abusive messages • Solution - Training samples and generate rules by training - Parse messages and generate feature vector - Classify the feature vector with rules generated
Conclusion and Future Work • Conclusion • High accuracy • Modest resource cost (processing time, training samples) • Future Work • Incremental learning algorithm • More efficient feature selection mechanism