180 likes | 191 Views
This article discusses web page classification techniques and user behavior modeling strategies for effective web mining. It covers topics such as supervised learning, anchor text analysis, and user access prediction.
E N D
Web Manipulation Languages • Web Definition Languages • Web Page Classification • Information Extraction • Simple Search • Meta-search • Refined Search Web Page Mining Searching Querying Filtering Web User Mining Prefetching Browsing • Content-based Filtering • Collaborative Filtering • Hybrid Filtering • Cache Management • Access Prediction • User Behavior Clustering • User Pattern Discovery • User Behavior Modeling • Automatic Crawlers • Navigation Guides Overview Web Mining Web Service WWW
Classifying feedback Refining rule base Mining metadata Web Page Classification • Supervised Learning WWW Sampling
a d class 1 0 0 1 2 1 1 1 3 0 1 2 4 0 1 2 Web Page Classification • A Hypertext-based approach • Use the anchorstext to determine the genres of the web pages pointed to by the anchors • Employ the rough set theory to eliminate dispensable words and derive compact rules • indiscernibility relation a b c d class 1 0 0 1 0 1 2 1 0 1 1 1 3 0 0 0 1 2 4 0 0 1 1 2
c page A a b c d class a c d 1 0 0 1 0 X d class X page B 2 1 0 1 1 X c d 3 0 0 0 1 X b c d 4 0 0 1 1 X page C a b c 5 0 1 1 1 Y class Y 6 1 1 1 0 Y a b c page D 7 1 1 1 0 Y a d 8 1 0 0 1 Z page E a b class Z 9 1 1 0 0 Z a b d page F 10 1 1 0 1 Z Web Page Classification
merge {6,7} eliminate [a] eliminate [d] a b c d class b c d class a b c class 1 0 0 1 0 X 1 0 1 0 X 1 0 0 1 X 2 1 0 1 1 X 2 0 1 1 X 2 1 0 1 X 3 0 0 0 1 X 3 0 0 1 X 3 0 0 0 X 4 0 0 1 1 X 4 0 1 1 X 4 0 0 1 X 5 0 1 1 1 Y 5 1 1 1 Y 5 0 1 1 Y 6,7 1 1 1 0 Y 6,7 1 1 0 Y 6,7 1 1 1 Y 8 1 0 0 1 Z 8 0 0 1 Z 8 1 0 0 Z 9 1 1 0 0 Z 9 1 0 0 Z 9 1 1 0 Z 10 1 1 0 1 Z 10 1 0 1 Z 10 1 1 0 Z Web Page Classification
reduct set a b c class merge {1,4}, {9,10} 1,4 0 0 - X core set 1,4 - 0 1 X 2 - 0 1 X a b c class a b c class 3 0 0 - X 1,4 0 0 1 X 1,4 - 0 - X 3 0 - 0 X 2 1 0 1 X 2 - 0 1 X 5 0 1 - Y 3 0 0 0 X 3 0 - - X 5 - 1 1 Y 5 0 1 1 Y 5 - 1 - Y 6,7 - 1 1 Y 6,7 1 1 1 Y 6,7 - 1 1 Y 8 1 - 0 Z 8 1 0 0 Z 8 1 - 0 Z 9,10 1 - 0 Z 9,10 1 1 0 Z 9,10 - - 0 Z 9,10 - 1 0 Z Web Page Classification
rule support ~a^~b => class X 3/10 ~b^c => class X 3/10 ~a^~c => class X 1/10 ~a^b => class Y 1/10 b^c => class Y 3/10 a^~c => class Z 3/10 b^~c => class Z 2/10 Web Page Classification merge duplicates a b c class 1,4,3 0 0 - X 1,4,2 - 0 1 X 3 0 - 0 X 5 0 1 - Y 5,6,7 - 1 1 Y 8,9,10 1 - 0 Z 9,10 - 1 0 Z
Web Page Prefetching User Pattern Discovery User Access Prediction
Web Page Prefetching • User Pattern Discovery • Generation of access paths • Storage of access paths • Discovery of access patterns • User Access Prediction • Indexing of access patterns • Prediction of user requests
URL Time A 1 variation limit = 3 B 3 ABCD, EF, G C 5 D 9 E 16 upper bound = 6 F 21 ABCDEF, G G 28 User Pattern Discovery • Path Generation • A time-based heuristic • upper bound, variation limit • Enumeration of suffix paths
User Pattern Discovery • A Tree-based Approach • Construct a path tree for each web-site • attach a counter to each node (for a specific path) • Select the important paths from the path tree • support and confidence
Total:200 path count ABCD 15 ABC 15 A:150 B:50 D:9 E:1 ADEFG 1 ADE 9 B:30 D:120 C:25 E:9 AD G:1 110 BCA 20 BC 5 C:30 E:10 A:20 F:4 G:3 B 25 DEFG 2 D:15 F:1 G:2 DEF 2 DEG 3 DE 2 G:1 EG 1 User Pattern Discovery minimum support=5% minimum confidence=10%
User Pattern Discovery • Path Tree • PV-B+ tree
Path Tree Index Generator Pattern Tree User Monitor User Log Storage Manager Newcome Request Path Collector Request Dispatcher Access Log Past Request IP URL Time Size 1 A 8:00 20K User Access Prediction
A 75% B 25% D 80% C 50% B 20% C 20% D 50% A 12% User Access Prediction • Pattern Tree • Bucket • Candidate • URL, confidence, b-pointer
IP : 2 IP : 1 A Time : 14:00 D Time : 11:00 Weight : 1.8 Weight : 0.8 Match Match User Access Prediction • Prediction of User Requests • User logs Access Log IP URL Time 1 A 8:00 2 B 10:00 C 1 11:00 C 2 14:00 1 B 18:00 Newcome IP : 1 IP : 2 D B C C Time : 10:00 User Monitor Time : 8:00 Weight : 1.6 Weight : 0.6