340 likes | 470 Views
Adaptive Web Sites: Automatically Synthesizing Web Pages. Mike Perkowitz and Oren Etzioni www.cs.washington.edu/homes/map/adaptive/. Adaptive Web Sites. Web sites that automatically reconfigure their organization and presentation by learning from user access patterns.
E N D
Adaptive Web Sites:Automatically Synthesizing Web Pages Mike Perkowitz and Oren Etzioni www.cs.washington.edu/homes/map/adaptive/
Adaptive Web Sites Web sites that automatically reconfigure their organization and presentation by learning from user access patterns. (Perkowitz & Etzioni, IJCAI’97)
Adaptive Web Sites • Individual Customization: site learns you like sports • Group Transformation: site learns most sports lovers also read “Tank McNamara” and cross-links them
Group Transformations • Our approach: history-based • Previously: Simple transformations (Perkowitz & Etzioni, WWW6) • Goal: change in view
Index Page Synthesis Find groups of related documents at the site and create new pages linking to those documents. • Input: web site, access log • Output: pages of links to related pages
Questions • What links are on the index page? • How are the contents ordered? • What is the title? • How are links labeled? • How do we make the index comprehensive?
Outline • Motivation • Plausible approaches • Clustering • Frequent sets • Our approach: Cluster Mining • Algorithm: PageGather • Evaluation
Clustering Voorhees-86,Willet-88,Rasmussen-92 • Similarity metric over documents • Cluster: items close together, far from others Algorithms: • Hierarchical Agglomerative Clustering (HAC) • K-means clustering
Clustering Visit: set of pages accessed by an individual • Document = page • Similarity = co-occurrence in visits • Cluster index page contents
Clustering: Problems • Clustering induces a partition over data • Clustering can be slow
Frequent Sets Agrawal, Imielinski, & Swami-93 • Set of transactions: “basket” of items • Find all frequently-occurring itemsets Algorithm: • A priori
Frequent Sets Visit: set of pages accessed by an individual • Item = page • Transaction = visit • Frequent set index page contents
Frequent Sets: Problems • “Frequent Item Problem” • Finds many similar itemsets • low minimum frequency high running time
Idea: Cluster Mining • Find only high-quality clusters • Not a partition • Clusters may overlap
The PageGather Algorithm • Graph-based representation • Nodes: pages • Edges: if P(P1|P2) and P(P2|P1) is high • Fast and accurate
/96/Autumn/Final/ /96/Autumn/Final/ /97/Winter/Final/ /97/Winter/Final/ /96/Autumn/Midterm/ /96/Autumn/Midterm/ /97/Spring/Final/ /97/Spring/Final/ /97/Spring/Midterm/ /97/Spring/Midterm/ www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpider www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I) www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I) www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpider www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I) www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I) www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpider www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I) www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I) www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpider www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I) www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I) www.apache.org|r2d2.dd.dk|GET /docs/ HTTP/1.0|text/html|200|1997/07/03-23:59:11|-|2207|-|-|http://www.apache.org/|Mozilla/2.0 (compatible; MSIE 3.01; Windows 95) www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/oliver_lieb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:11|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|du5-ts1.lascruces.com|GET /~wally/epsilon.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:11|-|4002|-|-|http://www.hyperreal.org/music/artists/fsol/www/|Mozilla/2.0 (compatible; MSIE 3.02; Update a; Windows 95) www.hyperreal.org|du5-ts1.lascruces.com|GET /~wally/hyperreal.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:11|-|2525|-|-|http://www.hyperreal.org/music/artists/fsol/www/|Mozilla/2.0 (compatible; MSIE 3.02; Update a; Windows 95) www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/baked_beans.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:11|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|cc6145d.comm.sfu.ca|GET /music/machines/categories/effects/ HTTP/1.0|text/html|200|1997/07/03-23:59:12|-|3844|-|-|http://www.hyperreal.org/music/machines/categories/|Mozilla/2.02 (Macintosh; I Log Visits Co-occurrence New Page Clique/CC Graph
PageGather • Implement with Cliques or CCs • Find all candidates, return best • Clique: maximal cliques of size k • Clique and CC versions comparable in time and performance
Experiments machines.hyperreal.org • Site gets ~1200 visitors/day (10k hits) • Site contains ~2500 distinct documents • Training: a month of access data • Testing: ten days of data
Performance Metric Are index pages helpful to users? • How well do clusters predict user navigation? • Q(C) = Given that a user visits one page in cluster C, how likely is she to visit any other?
Cluster Mining vs. Clustering PageGather using • Clique 10 clusters 1:05 min • HAC 10 clusters 48+ hours • K-means 10 clusters 3:35 min
Cluster Mining vs. Clustering PageGather using • Clique 10 clusters 1:05 min • HAC 10 clusters 48+ hours • K-means 10 clusters 3:35 min • HAC* 8 clusters 21:55 min (threshold, less data, mining)
Cluster Mining vs. Clustering PageGather using • Clique 10 clusters 1:05 min • HAC 10 clusters 48+ hours • K-means 10 clusters 3:35 min • HAC* 7 clusters 293:08 min (threshold, less data, mining)
Cluster Mining vs. Clustering Q Top 10 Clusters
Cluster Mining vs. Clustering Q Top 10 Clusters
Cluster Mining vs. Clustering Q Top 10 Clusters
PageGather vs. Frequent Sets • PG/Clique 10 clusters 1:05 min • A priori 10 frequent sets 1:41 min
PageGather vs. Frequent Sets Q Top 10 Clusters
Contributions • Motivating problem: Web page synthesis • Method: Cluster mining • well suited for discovery of coherent sets • comparison to clustering, frequent sets • Algorithm: PageGather • graph-based, fast and accurate
Clique vs. Conn-component Q Top 10 Clusters
Clique vs. Conn-component • Comparable accuracy • Clique finds fewer, smaller clusters than CC • Clique: more accurate (at first) • Comparable running time (in practice)
Future Directions • Meta-Information to improve coherence • Conceptual clustering • Improve coherence • Naming pages • Cluster mining to generate association rules