530 likes | 553 Views
CS533 Information Retrieval. Dr. Michal Cutler Lecture #21 April 25, 2000. Information Filtering. The filtering problem The users and the user profiles The relationship between information retrieval and filtering
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #21 April 25, 2000
Information Filtering • The filtering problem • The users and the user profiles • The relationship between information retrieval and filtering • Implementations (cluster based, information retrieval based, knowledge based, semi-automatic, collaborative)
The filtering problem • The task of a filtering system is to send interesting and useful data items to users • Basic assumptions: • Users must receive information in a timely fashion • There is a large number of users • New data items arrive frequently.
The filtering problem • Other terms used: • Information dissemination systems, • Alerting systems, • Routing systems
The users • Researchers discuss proactive and casual users. • A proactive user has specific information needs which may be expressed in a profile • The casual user may not have specific needs and thus find it difficult to provide a profile
The users • Proactive users also vary between those who • need high recall - cannot afford to miss a relevant item (army analysts), • need high precision - prefer very few and good items
User profile • Who will create it? • What will it contain? • How much data?
User profile - who will create it? • “Experts” interviewing new users • Users experimenting and updating • What help should be provided to users for generating good profiles? • choosing good terms • making the profile more specific and narrow to enhance precision • making the profile more general to enhance recall
User profile - who will create it? • Semi - Automatic • the user supplies seed document/s which are used by the system to generate a profile • The user’s behavior is monitored and used to create a profile
User profile - who will create it? Automatic • Users may click on some pages and ignore others • Spend more time reading some pages than others • Save/print certain clicked pages • Follow links on clicked pages to reach more pages • This behavior can be used to automatically learn and update a user’s profile
Morita et al • Monitor user behavior and derive interesting and uninteresting papers • Time spent on reading is used to acquire information • when a user spends more than t seconds to read an article, it is concluded the user considers article interesting
Morita et al. • Checked relation between article and follow up articles and found usually follow up interesting if article was and vice versa
User profile - what will it contain? • Boolean “queries” and/or natural language descriptions with or without importance weights • Seed document/s • Search domain knowledge + text pattern rules with evidence values • Menus
User profile - how much information? • Filtering systems provide users with some ability to control the amount of information they receive • Number of docs - N • Because database keeps changing, many good documents at one time, and very few at another. • Returning to the user N documents in both cases is not a good solution
User profile - how much information? • A similarity threshold - T • It is not clear how the user decides on a threshold. • Why .5 and not .3. • Is a document with .48 similarity not important?
Comparison with IR • Filtering systems assume repeated use of queries, versus a one time query assumed in IR • Creating a good profile is essential • Since user interests change, profile modification is also very important in filtering
Comparison with IR • The timeliness issue is more important for filtering than for IR • parallelism
Comparison with IR • IR assumes a relatively static database. • Filtering is mainly interested in selecting text from a dynamic data stream
Comparison with IR • IR takes advantage of collection statistics to generate stop words, an indexing vocabulary, and to compute good weights for document and query terms (tf*idf) • These statistics may not be available for filtering systems
Comparison with IR • Filtering systems tend to create an inverted index for the user profiles and not the data base
Comparison with IR • Users of filtering systems may not have a specific purpose (entertainment) • Both IR and filtering systems deal with the query/profile vocabulary issues • In filtering some users may not be motivated or able to specify a profile
Types of filtering systems • User profile used for filtering • Profile provided by users • Profile learned automatically or semi automatically from user behavior • User profile and opinions of other users used for filtering
Implementations • Profile provided by users: • Cluster based (NetNews) • IR based (SIFT, Individual) • Knowledge based (Rubric, Topic by Verity)
Implementations • Profile learned from behavior (LSI, Autodesk) • Collaborative filtering and recommendation systems
Cluster based filtering (NetNews) • http://www.switch.ch/netnews/ • News are classified into categories • A user subscribes to some categories, and from then on receives copies of all new items • Millions of users • Users may be interested in a much finer filtering capability
SIFT (Garcia Molina) • Based on Wais (free software on the Internet) • The database of user profiles is indexed • A profile is a list of (term, importance) pairs +a relevance threshold
SIFT (Garcia Molina) • Uses (.5+.5tf/max tf), idf and inner product • Threshold used to increase efficiency • http://sift.stanford.edu www.reference.com • Offers two ways to assist users with profile construction
Assistance provided by SIFT for creating a profile • User can apply candidate profile against present day articles. Use iterative refinement of profile to force good documents to the top • To help maintain profiles over time words which contributed to selection of an article are highlighted. Users can select additional words which should not appear with the profile word
Individual (Commercial) • http://www.individual.com/ • Based on SMART • Domain and SMART experts manage • customer profiles • the company's extensive Topic Library collection.
Filtering stages • Thesaurus • The core SMART engine • The Post Processor.
Thesaurus stage • Adds semantic equivalents of important profile words • Can recognize highly relevant words that may be used infrequently in a story and give them more weight • Thesaurus represents hundreds of thousands of person-hours of data entry and analysis
The core SMART engine • The database is a set of vectors associated with query topics • Each document is sent to SMART as a query. • The similarity of the document to the query topics is computed • Relevance feedback is used to improve new query topics
Post Processing. • Subject specialists add fuzzy Boolean rules to customer profiles • P-Norm is used to compare story to fuzzy Boolean rules in customer’s profile
Learning user profile from relevance feedback • Profile is learned from: • A set of old queries and a user’s selection of good documents • A collection of old documents • Profile is generated by using relevance feedback
Learning user profile from relevance feedback • In relevance feedback terms in good and possibly bad documents and old queries are used to generate a new query • In LSI a weighted sum of relevant documents is used as the user profile (expanded query) • The smaller number of concepts used in LSI helps in the feedback process
Recommendation systems • Systems that recommend restaurants, movies, etc. • Here the recommendation systems will recommend: • good documents, • good URL or • authors of documents, etc
Recommendation systems • From (Miller 96): Collaborative filtering systems make use of the reactions and opinions of people who have already seen a piece of information to make predictions about the value of that piece of information for people who have not yet seen it.
Recommendation systems • Collaborative filtering systems often recommend documents to a user (a query) that are liked (found useful) by similar users (e.g., users who have similar profiles) (for similar queries).
Contents of a recommendation • Can be a numeric value assigned by users to rate a document (explicit) • Mention of a person, a URL, or a citation of a document (mining) • Value derived automatically by observing user behavior (monitoring)
Learning interesting documents by monitoring • When many users read, or save, or print a document there is evidence that it is interesting • When a great deal of users ignore, or click and spend a short time on a document this indicates an uninteresting document
The privacy issue • A lot can be learned about users by observing their behavior • Users may not want other users to know which material they read • Users may not want authors to know who evaluated their work • Some systems allow the usage of pseudonymous
The privacy issue • The credibility of a recommendation can be enhanced by containing the names of the users who recommended or rejected material • In this case recommendations are attributed
The use of a recommendation • Some systems display the recommendations alongside articles • Other systems use the recommendations is order to select the documents which will be returned to a user
Aggregation of recommendation • Combining multiple recommendations into a useful measure. • Personalized weighting based on past agreement among recommenders • Personalized weighting combined with content analysis • Count number of recommneders, or the frequency of mention of URLs or documents
Collaborative (Tapestry) • First system to use the notion of collaboration for filtering • Developed at Xerox Palo Alto to control volume of email sent to users • Innovation is in the use of user reactions to messages (stored as annotations) for selecting messages for other users
Collaborative (Tapestry) • Messages are stored in a relational database • User knows that Smith keeps track of documents in some area of interest • System allows to filter on “documents replied to by Smith” • This means that outgoing email messages become part of the selection process • Filtering becomes iterative process
Collaborative (Tapestry) • A filter can contain some keywords with the added condition of 3 or more endorsements • Users can write ad-hoc queries or filter queries to receive data • A user can ask to use someone else's filter • Uses its own query language which is similar to 1st order logic
PHOAKS (People Helping One Another Know Stuff) • Recommends URLs. • Mining: Mention of a URL in a news article is used except for: • URLs in headers and quoted sections. • Articles posted to too many newsgroups. • URLs in announcements or ads. • Aggregation: number of distinct recommenders of each URL.
GroupLens • Collaborative filtering for Usenet news • Used for rec.humor, rec.food.recipes, rec.arts.movies.current-films, etc. • Recommendations are both explicit by providing a rate of 1-5, and implicit by monitoring reading time • Recommendations are displayed along a reference
GroupLens • Pseudonym are used • Selects a group of people to act as personal moderators • The moderators are users with whom you have substantial agreement on part articles • When a user fetches articles from a newsgroup evaluation predictions are displayed • The user may enter ratings • The ratings serve as input for predicting the value for other users and for correlating the user with other users