Web classification

Web classification Ontology and Taxonomy

References • Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu • Learning to Construct Knowledge Bases from World Wide Web. {M. Craven, D. DiPasquo, A. Mitchell, K. Nigam, S Slattery} Carnegie Mellon University-Pittsburg-USA; {D. Freitag A. McCallum} Just Reserch-Pittsburg-USA

Definitions • Ontology • An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. • Taxonomy • a classification of organisms into groups based on similarities of structure or origin etc

Goal • Capture and model behavioral patterns and profiles of users interacting with a web site. • Why? • Collaborative filtering • Personalization systems • Improve organization and structural of the site • Provide dynamic recommendations (www.recommend-me.com)

Algorithm 0 (by Rafa’s brother: Gabriel) • Recommend pages viewed by other users with similar page ranks. • Problems • New item problem • Doesn’t consider content similarity nor item-to-item relationships.

User session • User session s: <w(p1,s),w(p2,s),..,w(pn,s)> • W(pi,s) is a weight in session s, associated with page pi • Session clusters {cl1, cl2,…} • cli is a subset of the set of sessions • Usage profile prcl={<p, weight(p,prcl) : weight(p,prcl) ≥ μ} • Weight(p,prcl)=(1/|cl|) *∑w(p,s)

Algorithm 1 • For every session, create a vector containing the viewed pages and a weight for each page. • Each vector represent a point in a N-dimensional space, so we may identify the clusters. • For a new session, check to which cluster this vector/point belongs, and recommend high scores pages of this cluster • Problems • New item problem • Doesn’t consider content similarity nor item-to-item relationships

Algorithm 2: keyword search • Solves new item problem. • Not good enough • A page can contain info for more than 1 object. • A fundamental data can be pointed by the page, not included. • What exactly is a keyword. • Solution • Domain ontologies for objects

Domain Ontologies • Domain-Level Aggregate Profile: Set of pseudo objects each characterizing objects of different types occurring commonly across the user sessions. • Class - C • Attributes –a: < Da, Ta, ≤a, Ψa> • Ta type of attribute • DaDomain of the values for a (red, blue,..) • ≤a ordering relation among Da • Ψa combination function

Example – movie web site • Classes: • movies, actors, directors, etc • Attributes: • Movies: title, genre, starring actors • Actors: name, filmography, gender, nationality • Functions: • Ψactor(<{S,0.7; T, 0.2; U,0.1},1>, <{S,0.5; T,0.5),0.7>) = sumi(wi*wo)/ sumi(wi) • Ψyear({1991},{1994}) = {1991,1994} • Ψis_a({person,student},{person,TA})= {person}

Creating an Aggregated Representation of a usage profile • pr={<o1wo1>, …,<onwon>} • Oi object; woi=significance on the profile pr • Let assume all the object are instances of the same class • Create a new virtual object o’, with attributes ai’= Ψi(o1,…,on)

Item level usage profile

A real (estate property) example

Item Level Usage Profile

Algorithm 2 • Do not just recommend other items viewed by other users, recommend items similar to the class representative. • Advantages: • More accuracy • Need less examples • No new item problem • Consider also content similarity (item-to-item relationship).

Item Level Usage Profile

Final Algorithm • Given a web site • Classify it contents into classes and attributes. • Merge the objects of each user profile and create a pseudo object. • Recommend according to this pseudo-object.

Problems • A per-topic solution • Found patterns can be incomplete • User patterns may change with time (for movies) “I loved ET” problem. • Need cookies and other methods to identify users. • How is weight calculated? Can need many examples: “I loved American Beauty” problem. • How to automatically group the web-pages?

Hafsaka?

Constructing Knowledge Base from WWW • Goal: • Automatically create computer understandable knowledge base from the web. • Why? • To use in the previous described work, and similar • Find all universities that offer Java Programming courses • Make me hotel and flight arrangements for the upcoming Linux conference

…Constructing Knowledge Base from WWW • How? • Use machine learning to create information extraction methods for each of the desired types of knowledge • Apply it, to extract symbolic, probabilistic statements directly from the web: Student-of(Rafa, sdbi)= 99% • Used method • Provide an initial ontology (classes and relations) • Training examples –3 out of 4 university sites (8000 web pages, 1400 web-page pairs)

Example of web pages • Jim’s Home Page • I teach several courses: • Fundamental of CS • Intro to AI • My research includes • Intelligent web agents • Fundamentals of CS Home Page • Instructors: • Jim • Tom Classes: Faculty, Research-project, Student, Staff, (Person), Course, Department, Other Relations: instructor-of, members-of-project, department-of.

Ontology Web KB instances

Problem Assumption • Class instance one-instance/one-webpage • Multiple instances in one web-page • Multiple linked/related web-pages for instance • Elvis problem • Relation R(A,B) is represented by: • Hyperlinks AB or ACD…B • Inclusion in a particular context (I teach Intro2cs) • Statistical model of typical words

To Learn • Recognizing class instances by classifying bodies of hypertext • Recognizing relations instances by classifying chains of hyperlinks • Extract text fields

Recognizing class instances by classifying bodies of hypertext • Statistical bag-of-words approach • Full Text • Hyperlinks • Title/Head • Learning first order rules • Combine the previous 4 methods

Statistical bag-of-words approach • Context-less classification • Given a set of classes C={c1, c2,…cN} • Given a document consisting of n≤2000 words {w1, w2, ..,wn} • c*= argmaxc Pr(c | w1,…,wn)

actual predicted

Statistical bag-of-words approach: Pr(wi|c) log (Pr(wi|c)/Pr(wi|~c))

Accuracy/Coverage tradeoff for full-text classifiers

Accuracy/coverage tradeoff for hyperlinks classifiers

Accuracy/Coverage for title heading classifiers

Learning first order rules • The previous method doesn’t consider relations between pages • A page is a course home-page if it contains the word textbook and TA and point to a page containing the word assignment. • FOIL is a learning system that constructs Horn clause programs from examples

Relations • Has_word(Page). Stemmed words: computer= computing= comput. 200 occurrences but less than 30% in other class pages • Link_to(page,page) • m-estimate accuracy= (nc+(m*p))/(n+m) • nc: # of instances correctly classified by the rule • N: Total # of instance classified by the rule • m=2 • P: proportion of instances in trainning set that belongs to that class • Predict each class with confidence = best_match / total_#_of_matches

New learned rules • student(A) :- not(has_data(A)), not(has_comment(A)), link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)). • faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B). • course(A) :- has_instructor(A), not(has_good(A)), link_to(A,B), not(link_to(B, 1)),has_assign(B).

Accuracy/coverage for FOIL page classifiers

Boosting • The best prediction classification depends on the class • Combine the predictions using the measure confidence

Accuracy/coverage tradeoff for combined classifiers (2000 words vocabulary)

Boosting • Disappointing: Somehow it is not uniformly better • Possible solutions • Using reduced size dictionaries (next) • Using other methods for combining predictions (voting instead of best_match / total_#_of_matches)

Accuracy/coverage tradeoff for combined classifiers (200 words vocabulary)

Multi-Page segments • The group is the longest prefix (indicated in parentheses) • (@/{user,faculty,people,home,projects}/*)/*.{html,htm} • (@/{cs???,www/,*})/*.{html,htm} • (@/{cs???,www/,*})/ • … • A primary page is any page which URL matches: • @/index.{html,htm} • @/home.{html,htm} • @/%1/%1.{html,htm} • … • If no page in the group matches one of these patterns, then the page with the highest score for any non-other class is a primary page. • Any non-primary page is tagged as Other

Accuracy/coverage tradeoff for the full text after URL grouping heuristics

Conclusion- Recognizing Classes • Hypertext provides redundant information • We can classify using several methods • Full text • Heading/title • Hyperlinks • Text in neighboring pages • + Grouping pages • No method alone is good enough. • Combine predictions (classify methods) allows a better result.

Learning to Recognize Relation Instances • Assume: Relations are represented by hyper-links • Given the following background relations • Class (Page) • Link-to(Hyperlink,P1,P2) • Has-word (H) – the word is part of the Hyperlink • All-words-capitalized (H) • Has-alphanumeric-word (H) –I Teach CS2765 • Has-neighborhood-word (H) –Neighborhood= paragraph

… Learning to Recognize Relation Instances • Try to learn the following • Members-of-project(P1,P2) • Intsructors_of_course(P1,P2) • Department_of_person(P1,P2)

Learned relations • instructors of(A,B) :- course(A), person(B), link to(C,B,A). • Test Set: 133 Pos, 5 Neg • department of(A,B) :- person(A), department(B), link to(C,D,A), link to(E,F,D), link to(G,B,F), has neighborhood word graduate(E). • Test Set: 371 Pos, 4 Neg • members of project(A,B) :- research project(A), person(B), link to(C,A,D), link to(E,D,B), has neighborhood word people(C). • Test Set: 18 Pos, 0 Neg

Accuracy/Coverage tradeoff for learned relation rules

Learning to Extract Text Fields • Sometimes we want a small fragment of text, not the whole web-page or class (like Jon, Peter, etc) • Make me hotel and flight arrangements for the upcoming Linux conference

Predefined predicates • Let F= w1, w2, … wj be a fragment of text • length({<,>,=…}, N). • some(Var, Path, Feat, Value): some (A,[next_token, next_token], numeric, true) • position(Var, From, Relop, N): • relpos(Var1, Var2, Relop, N):

Web classification