230 likes | 370 Views
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL. Nearest Neighbor Classifiers. Basic intuition: similar documents should have the same class label We can use the vector space model and the cosine similarity Training is simple: Remember the class value of the initial documents
E N D
Nearest Neighbor Classifiers • Basic intuition: similar documents should have the same class label • We can use the vector space model and the cosine similarity • Training is simple: • Remember the class value of the initial documents • Index them using an inverted index structure • Testing is also simple: • Use each test document dt as a query • Fetch k training documents most similar to it • Use majority voting to determine the class of dt
Nearest Neighbor Classifiers • Instead of pure counts of classes, we can use the weights wrt the similarity: • If training document d has label cd, then cd accumulates a score of s(dq, d) • The class with maximum score is selected • Per-class offsets could be used and tuned later on:
Nearest Neighbor Classifiers • Choosing the value k: • Try various values of k and use a portion of the documents for validation. • Cluster the documents and choose a value of k proportional to the size of small clusters
Nearest Neighbor Classifiers • kNN is a lazy strategy compared to decision trees • Advantages • No-training needed to construct a model • When properly tuned for k, and bc, they are comparable in accuracy to other classifiers. • Disadvantages • May involve many inverted index lookups, scoring, sorting and picking the best k results takes time (since k is small compared to the retrieved documents, such types of queries are called iceberg queries)
Nearest Neighbor Classifiers • For better performance, some effort is spent during training • Documents are clustered, and only a few statistical parameters are stored per-cluster • A test document is first compared with the cluster representatives, then with the individual documents from appropriate clusters
Measures of accuracy • We may have the following: • Each document is associated with exactly one class • Each document is associated with a subset of classes • The ideas from precision and recall can be used to measure the accuracy of the classifier • Calculate the average precision and recall for all the classes
Hypertext Classification • An HTML document can be thought of a hierarchy of regions represented by a tree-structured Document Object Model (DOM). www.w3.org/DOM • DOM tree consists of: • Internal nodes : • Leaf nodes : segments of text • Hyperlinks to other nodes
Hypertext Classification • An example DOM in XML format • Is is important to distinguish the two occurrences of the term “surfing” which can be achieved by prefixing the term by the sequence of tags in the DOM tree. • resume.publication.title.surfing • resume.hobbies.item.surfing <resume> <publication> <title> Statistical models for web-surfing </title> </publication> <hobbies> <item> Wind-surfing </item> </hobbies> </resume>
Hypertext Classification • Use relations to give meaning to textual features such as: • contains-text(domNode, term) • part-of(domNode1, domNode2) • tagged(domNode, tagName) • links-to(srcDomNode, dstDomNode) • Contains-anchor-text(srcDomNode, dstDomNode, term) • Classified(domNode, label) • Discover rules from collection of relations such as: • Classifed(A, facultyPage) :- contains-text(A, professor), contains-text(A, phd), links-to(B,A), contains0text(B,faculty) • Where “:-” means if, and comma stands for conjunction.
Hypertext Classification • Rule Induction with two-class setting • FOIL (First order Inductive Learner by Quinlan, 1993) • a greedy algorithm that learns rules to distinguish positive examples from negative ones • Repeatedly searches for the current best rule and removes all the positive examples covered by the rule until all the positive examples in the data set are covered • Tries to maximize the gain of adding literal p to rule r • P is the set of positive and N is the set of negative examples • When p is added to r, then there are P* positive and N* negative examples satisfying the new rule
Hypertext Classification Let R be the set of rules learned, initially empty While D+ != EmptySet do // learn a new rule Let rbe true and be the new rule while some d in D- satisfies r // Add a new possibly negated literal to r to specialize it Add “best possible” literal p as a conjunction to r endwhile R <- R U {r} Remove from D+ all instances for which r evaluates to true Endwhile Return R
Hypertext Classification • Let R be the set of rules learned, initially empty • While D+ != EmptySet do • // learn a new rule • Let rbe true and be the new rule • while some d in D- satisfies r • // Add a new possibly negated literal to r to specialize it • Add “best possible” literal p as a conjunction to r • endwhile • R <- R U {r} • Remove from D+ all instances for which r evaluates to true • Endwhile • Return R • Types of literals explored: • Xi=Xj, Xi=c, Xi>Xj etc where Xi, Xj are variables and c is a constant • Q(X1,X2,..Xk) where Q is a relation and Xi are variables • Not(L), where L is a literal of the above forms
Hypertext Classification • With relational learning, we can learn class labels for individual pages, and relationships between them • Member(homePage, department) • Teaches(homePage, coursePage) • Advises(homePage, homePage) • We can also incorporate other classifiers such as naïve bayesian for rule learning
Retrieval Utilities • Relevance feedback • Clustering • Passage-based Retrieval • Parsing • N-grams • Thesauri • Semantic Networks • Regression Analysis
Relevance Feedback • Do the retrieval in multiple steps • User refines the query at each step wrt the results of the previous queries • User tells the IR system which documents are relevant • New terms are added to the query based on the feedback • Term weights may be updated based on the user feedback
Relevance Feedback • Bypass the user for relevance feedback by • Assuming the top-k results in the ranked list are relevant • Modify the original query as done before
Relevance Feedback • Example: “find information surrounding the various conspiracy theories about the assassination of John F. Kennedy” (Example from your text book) • IF the highly ranked document contains the term “Oswald” then this needs to be added to the initial query • If the term “assassination” appears in the top ranked document, then its weight should be increased.
Relevance Feedback in Vector Space Model • Q is the original query • R is the set of relevant and S is the set of irrelevant documents selected by the user • |R| = n1, |S| = n2
Relevance Feedback in Vector Space Model • Q is the original query • R is the set of relevant and S is the set of irrelevant documents selected by the user • |R| = n1, |S| = n2 • In general The weights are referred to as Rocchio weights
Relevance Feedback in Vector Space Model • What if the original query retrieves only non-relevant documents (determined by the user)? • Then increase the weight of the most frequently occurring term in the document collection.
Relevance Feedback in Vector Space Model • Result set clustering can be used as a utility for relevance feedback. • Hierarchical clustering can be used for that purpose where the distance is defined by the cosine similarity