1 / 34

Information Filtering

Information Filtering. LBSC 796/INFM 718R Session 9 October 26, 2011. Information Access Problems. Different Each Time. Retrieval. Information Need. Data Mining. Filtering. Stable. Stable. Different Each Time. Collection. Agenda. Information Filtering

lovey
Download Presentation

Information Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Filtering LBSC 796/INFM 718R Session 9 October 26, 2011

  2. Information Access Problems Different Each Time Retrieval Information Need Data Mining Filtering Stable Stable Different Each Time Collection

  3. Agenda • Information Filtering • Classification for Filtering • Netflix Prize • Project

  4. Information Filtering Diagram New Documents Recommendation Matching Rating User Profile

  5. Standing Queries • Use any information retrieval system • Boolean, vector space, probabilistic, … • Have the user specify a “standing query” • This will be the profile • Limit the standing query by date • Each use, show what arrived since the last use

  6. Information Filtering • An abstract problem in which: • The information need is stable • Characterized by a “profile” • A stream of documents is arriving • Each must either be presented to the user or not • Introduced by Luhn in 1958 • As “Selective Dissemination of Information” • Named “Filtering” by Denning in 1975

  7. What’s Wrong With That? • Unnecessary indexing overhead • Indexing only speeds up retrospective searches • Every profile is treated separately • The same work might be done repeatedly • Forming effective queries by hand is hard • The computer might be able to help • It is OK for text, but what about audio, video, … • Are words the only possible basis for filtering?

  8. Stream Search: “Fast Data Finder” • Boolean filtering using custom hardware • Up to 10,000 documents per second (in 1996!) • Words pass through a pipeline architecture • Each element looks for one word good great party aid NOT OR AND

  9. Profile Indexing (SIFT) • Build an inverted file of profiles • Postings are profiles that contain each term • RAM can hold 5 million profiles/GB • And several machines could run in parallel • Both Boolean and vector space matching • User-selected threshold for each ranked profile • Hand-tuned on a web page using today’s news

  10. Profile Indexing Limitations • Privacy • Central profile registry, associated with known users • Usability • Manual profile creation is time consuming • May not be kept up to date • Threshold values vary by topic and lack “meaning”

  11. Vector space example: query “canine” (1) • Source: • Fernando Díaz

  12. Similarity of docs to query “canine” • Source: • Fernando Díaz

  13. User feedback: Select relevant documents • Source: • Fernando Díaz

  14. Results after relevance feedback • Source: • Fernando Díaz

  15. Adaptive Content-Based Filtering New Documents Make Document Vectors Compute Similarity Vectors Documents, Vectors, Rank Order Select and Examine (user) Document, Vector Vector(s) Assign Ratings (user) Rating, Vector Initial Profile Terms Make Profile Vector Update User Model Vector

  16. Latent Semantic Indexing New Documents Sparse Vectors Dense Vectors Make Document Vectors Reduce Dimensions Compute Similarity Documents, Dense Vectors, Rank Order Matrix Representative Documents Sparse Vectors Make Document Vectors SVD Select and Examine (user) Document, Dense Vector Dense Vector(s) Assign Ratings (user) Matrix Rating, Dense Vector Initial Profile Terms Sparse Vector Dense Vector Make Profile Vector Reduce Dimensions Update User Model

  17. Singular Value Decomposition

  18. Agenda • Information Filtering • Classification for Filtering • Netflix Prize • Project

  19. Statistical Classification • Represent documents as vectors • Usual approach based on TF, IDF, Length • Build a statistical models of rel and non-rel • e.g., (mixture of) Gaussian distributions • Find a surface separating the distributions • e.g., a hyperplane • Rank documents by distance from that surface

  20. Linear Separators • Which of the linear separators is optimal? Original from Ray Mooney

  21. Maximum Margin Classification • Implies that only support vectors matter; other training examples are ignorable. Original from Ray Mooney

  22. Soft Margin SVM ξi ξi Original from Ray Mooney

  23. Non-linear SVMs Φ: x→φ(x) Original from Ray Mooney

  24. Training Strategies • Overtraining can hurt performance • Performance on training data rises and plateaus • Performance on new data rises, then falls • One strategy is to learn less each time • But it is hard to guess the right learning rate • Usual approach: Split the training set • Training, DevTest for finding “new data” peak

  25. Agenda • Information Filtering • Classification for Filtering • Netflix Prize • Project

  26. Ratings Per User • 10% of users rated 16 or fewer movies • 25% rated 36 or fewer • The median is 93 • Two users rated over 17,000 of the 17,700 movies

  27. Effect of Inventory Costs

  28. NetFlix Challenge

  29. Agenda • Information Filtering • Classification for Filtering • Netflix Prize • Project

  30. Project Milestone 3 I • In this milestone you will be further developing your use cases and refining your plan for evaluating your system. • For each of 3 or 4 use cases • Manually do a search of the corpus using Lucene • review the search results • Explain how you'll automate the search to employ search terms from the user profiles or other sources of information

  31. Project Milestone 3 II • Explain how you'll evaluate the performance in general • Do the use cases require high precision and/or high recall? • How you'll select documents to evaluate the performance • How you'll do relevance judgments

  32. Project Milestone 3 III • Summarize and present initial use cases, search results and performance on Nov 9th • 10-15 minute presentations, 5-10 slides • Each project team member should speak about the part they worked on • Talk to me BEFOREHAND about alternate arrangements if you can't • Each member should understand what's presented by the other members • OK to pass questions to the expert

  33. Remaining Project Milestones • Part 3: Go through once manually • Present in 2 weeks • Part 4: Automate and establish a baseline • Discuss at meetings on Mon 11-21 or Tue 11-22 • Part 5: Iterate and attempt to improve upon baseline • Present as part of project reports on 12-7

More Related