1 / 23

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL. Nearest Neighbor Classifiers. Basic intuition: similar documents should have the same class label We can use the vector space model and the cosine similarity Training is simple: Remember the class value of the initial documents

Download Presentation

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL

  2. Nearest Neighbor Classifiers • Basic intuition: similar documents should have the same class label • We can use the vector space model and the cosine similarity • Training is simple: • Remember the class value of the initial documents • Index them using an inverted index structure • Testing is also simple: • Use each test document dt as a query • Fetch k training documents most similar to it • Use majority voting to determine the class of dt

  3. Nearest Neighbor Classifiers • Instead of pure counts of classes, we can use the weights wrt the similarity: • If training document d has label cd, then cd accumulates a score of s(dq, d) • The class with maximum score is selected • Per-class offsets could be used and tuned later on:

  4. Nearest Neighbor Classifiers • Choosing the value k: • Try various values of k and use a portion of the documents for validation. • Cluster the documents and choose a value of k proportional to the size of small clusters

  5. Nearest Neighbor Classifiers • kNN is a lazy strategy compared to decision trees • Advantages • No-training needed to construct a model • When properly tuned for k, and bc, they are comparable in accuracy to other classifiers. • Disadvantages • May involve many inverted index lookups, scoring, sorting and picking the best k results takes time (since k is small compared to the retrieved documents, such types of queries are called iceberg queries)

  6. Nearest Neighbor Classifiers • For better performance, some effort is spent during training • Documents are clustered, and only a few statistical parameters are stored per-cluster • A test document is first compared with the cluster representatives, then with the individual documents from appropriate clusters

  7. Measures of accuracy • We may have the following: • Each document is associated with exactly one class • Each document is associated with a subset of classes • The ideas from precision and recall can be used to measure the accuracy of the classifier • Calculate the average precision and recall for all the classes

  8. Hypertext Classification • An HTML document can be thought of a hierarchy of regions represented by a tree-structured Document Object Model (DOM). www.w3.org/DOM • DOM tree consists of: • Internal nodes : • Leaf nodes : segments of text • Hyperlinks to other nodes

  9. Hypertext Classification • An example DOM in XML format • Is is important to distinguish the two occurrences of the term “surfing” which can be achieved by prefixing the term by the sequence of tags in the DOM tree. • resume.publication.title.surfing • resume.hobbies.item.surfing <resume> <publication> <title> Statistical models for web-surfing </title> </publication> <hobbies> <item> Wind-surfing </item> </hobbies> </resume>

  10. Hypertext Classification • Use relations to give meaning to textual features such as: • contains-text(domNode, term) • part-of(domNode1, domNode2) • tagged(domNode, tagName) • links-to(srcDomNode, dstDomNode) • Contains-anchor-text(srcDomNode, dstDomNode, term) • Classified(domNode, label) • Discover rules from collection of relations such as: • Classifed(A, facultyPage) :- contains-text(A, professor), contains-text(A, phd), links-to(B,A), contains0text(B,faculty) • Where “:-” means if, and comma stands for conjunction.

  11. Hypertext Classification • Rule Induction with two-class setting • FOIL (First order Inductive Learner by Quinlan, 1993) • a greedy algorithm that learns rules to distinguish positive examples from negative ones • Repeatedly searches for the current best rule and removes all the positive examples covered by the rule until all the positive examples in the data set are covered • Tries to maximize the gain of adding literal p to rule r • P is the set of positive and N is the set of negative examples • When p is added to r, then there are P* positive and N* negative examples satisfying the new rule

  12. Hypertext Classification Let R be the set of rules learned, initially empty While D+ != EmptySet do // learn a new rule Let rbe true and be the new rule while some d in D- satisfies r // Add a new possibly negated literal to r to specialize it Add “best possible” literal p as a conjunction to r endwhile R <- R U {r} Remove from D+ all instances for which r evaluates to true Endwhile Return R

  13. Hypertext Classification • Let R be the set of rules learned, initially empty • While D+ != EmptySet do • // learn a new rule • Let rbe true and be the new rule • while some d in D- satisfies r • // Add a new possibly negated literal to r to specialize it • Add “best possible” literal p as a conjunction to r • endwhile • R <- R U {r} • Remove from D+ all instances for which r evaluates to true • Endwhile • Return R • Types of literals explored: • Xi=Xj, Xi=c, Xi>Xj etc where Xi, Xj are variables and c is a constant • Q(X1,X2,..Xk) where Q is a relation and Xi are variables • Not(L), where L is a literal of the above forms

  14. Hypertext Classification • With relational learning, we can learn class labels for individual pages, and relationships between them • Member(homePage, department) • Teaches(homePage, coursePage) • Advises(homePage, homePage) • We can also incorporate other classifiers such as naïve bayesian for rule learning

  15. RETRIEVAL UTILITIES

  16. Retrieval Utilities • Relevance feedback • Clustering • Passage-based Retrieval • Parsing • N-grams • Thesauri • Semantic Networks • Regression Analysis

  17. Relevance Feedback • Do the retrieval in multiple steps • User refines the query at each step wrt the results of the previous queries • User tells the IR system which documents are relevant • New terms are added to the query based on the feedback • Term weights may be updated based on the user feedback

  18. Relevance Feedback • Bypass the user for relevance feedback by • Assuming the top-k results in the ranked list are relevant • Modify the original query as done before

  19. Relevance Feedback • Example: “find information surrounding the various conspiracy theories about the assassination of John F. Kennedy” (Example from your text book) • IF the highly ranked document contains the term “Oswald” then this needs to be added to the initial query • If the term “assassination” appears in the top ranked document, then its weight should be increased.

  20. Relevance Feedback in Vector Space Model • Q is the original query • R is the set of relevant and S is the set of irrelevant documents selected by the user • |R| = n1, |S| = n2

  21. Relevance Feedback in Vector Space Model • Q is the original query • R is the set of relevant and S is the set of irrelevant documents selected by the user • |R| = n1, |S| = n2 • In general The weights are referred to as Rocchio weights

  22. Relevance Feedback in Vector Space Model • What if the original query retrieves only non-relevant documents (determined by the user)? • Then increase the weight of the most frequently occurring term in the document collection.

  23. Relevance Feedback in Vector Space Model • Result set clustering can be used as a utility for relevance feedback. • Hierarchical clustering can be used for that purpose where the distance is defined by the cosine similarity

More Related