Personal Search

UMass Amherst CS646 Lecture Personal Search Retrieval Model and Evaluation Jinyoung Kim

Outline • Personal Search Overview • Retrieval Models for Personal Search • Evaluation Methods for Personal Search • Associative Browsing Model for Personal Info. • Experimental Results (optional)

Personal Search Overview

Personal Search • What • Searching over user’s own personal information • Desktop search is most common form • Why • Personal information has grown over the years • In terms of the amount and heterogeneity • Search can help users access their information • Q : Is it the only option? How about browsing?

Typical Scenarios • I'm looking for an email about my last flight • I want to retrieve all I've read about Apple iPad • I need to find a slide I wrote for IR seminar • Q : Anything else?

Personal Search Example • Query : James Registration

Personal Search Example • User-defined ranking for type-specific results • Can't we do better than this?

Characteristics & Related Problems • People mostly do ‘re-finding’ • Known-item search • Many document types • Federated Search (Distributed IR) • Unique metadata for each type • Semi-structured document retrieval

Research Issues • How can we exploit the document structure (e.g. metadata) for retrieval? • How can we evaluate personal search algorithms overcoming privacy concerns? • What are other methods for personal information access? • e.g. Associative Browsing Model

Retrieval Model for Personal Search

Design Considerations • Each type has different characteristics • How can we exploit type-specific features? • e.g. email has a thread structure • Knowing the document type the user is looking for will be useful • How can we make this prediction? • Users want to see the combined result • How would you present the result?

Retrieval-Merge Strategy • Type-specific Ranking • Use most suitable algorithm for each type • Type Prediction • Predict which document type user is looking for • Combine into the Final Result • Rank list merging

Type-specific Ranking • Document-based Retrieval Model • Score each document as a whole • Field-based Retrieval Model • Combine evidences from each field f1 f2 ... fn q1 q2 ... qm q1 q2 ... qm f1 f1 w1 w1 f2 f2 w2 w2 ... ... fn fn wn wn Document-based Scoring Field-based Scoring

Type-specific Ranking • Document-based Methods • Document Query-likelihood (DQL) • Field-based Methods • Mixture of Field Language Models (MFLM) • wjis trained to maximize retrieval performance • e.g. <subject> : 1 / <content> : 0.5 / ...

Type-specific Ranking • Example • Query : james registration • Document fields : <subject> <content> <to> • Term distribution • DQL vs. MFLM DQL1 : (1+1)/112 * (5+1)/112 DQL2 : 5/112 * 20/112 DQL1 (0.105) < DQL2 (0.877) MFLM1 : (1/100+1/2) * (1/10+5/100) MFLM2 : 5/100 * 20/100 MFLM1 (0.077) > MFLM2 (0.01)

Type-specific Ranking • Probabilistic Retrieval Model for Semi-structured data (PRM-S)[KXC09] • Basic Idea • Use the probabilistic mapping between query-words and document fields for weighting q1 q2 ... qm f1 f1 P(F1|q1) P(F1|qm) f2 f2 P(F2|q1) P(F2|qm) ... ... fn fn P(Fn|q1) P(Fn|qm)

Type-specific Ranking • PRM-S Model [KXC09] • Estimate the implicit mapping of each query word to document fields • Combine field-level evidences based on mapping probabilities Fj: field of collection fj: field of each document

Type-specific Ranking • MFLM vs. PRM-S q1 q2 ... qm q1 q2 ... qm f1 f1 f1 f1 w1 w1 P(F1|q1) P(F1|qm) f2 f2 f2 f2 w2 w2 P(F2|q1) P(F2|qm) ... ... ... ... fn fn fn fn wn wn P(Fn|q1) P(Fn|qm)

Type-specific Ranking • Why does PRM-S work? • Relevant document has query-terms in many different fields • PRM-S boosts PQL(q|f) when query-term is found in ‘correct’ field(s)

Type-specific Ranking • PRM-S Model [KXC09] • Performance in TREC Email Search Task • W3C mailing list collection • 150 known-item queries • Q : Will it work for other document types? • e.g. webpages and office documents (Mean Reciprocal Rank)

Predicting Document Type • A look on Federated Search (aka Distributed IR) • There are many information silos (resources) • Users want to search over all of them • Three major problems • Resource representation • Resource selection • Result merging

Predicting Document Type • Query-likelihood of Collection [Si02] • Get query-likelihood score for each collection LM • Treat each collection as a big bag of words • Best performance in recent evaluation [Thomas09] • Q : Can we exploit the field structure here?

Predicting Document Type • Field-based collection Query-Likelihood [KC10] • Calculate QL score for each field of a collection • Combine Field-level scores into a collection score • Why it works? • Terms from shorter fields are better represented • e.g. ‘James’ from <to>, ‘registration’ from <subject> • Recall why MFLM worked better than DQL

Merging into Final Rank List • What we have for each collection • Type-specific ranking • Type score • CORI Algorithm for Merging[Callan95] • Use normalized collection and document score

Evaluation Methods for Personal Search

Challenges in Personal Search Evaluation • Hard to create a ‘test-collection’ • Each user has different documents and habits • Privacy concerns • People will not donate their documents and queries for research • Q : Can’t we just do some user study?

Problems with User Studies • It’s costly • A ‘working’ system should be implemented • Participants should be using it for a long time • Big barrier for academic researchers • Data is not reusable by third parties • The findings cannot be repeated by others • Q : How can we perform a cheap & repeatable evaluation?

Pseudo-desktop Method [KC09] • Collect documents of reasonable size and variety • Generate queries automatically • Randomly select a target document • Take terms from the document • Validate generated queries with manual queries • Collected by showing each document and asking: • ‘What is the query you might use to find this one?’

DocTrack Game [KC10] • Basic Idea • The user is shown a target document • The user is asked to find the document • Score is given based on user’s search result

DocTrack Game [KC10] • Benefits • Participants are motivated to contribute the data • Resulting queries and logs are reusable • Free from privacy concern • Much cheaper than doing a traditional user study • Limitations • Artificial data& task

Experimental Results

Experimental Setting • Pseudo-desktop Collections • Crawl of W3C mailing list & documents • Automatically generated queries • 100 queries / average length 2 • CS Collection • UMass CS department webpages & emails & etc. • Human-formulated queries from DocTrack game • 984 queries / average length 3.97 • Other details • Mean Reciprocal Rank was used for evaluation

Collection Statistics • Pseudo-desktop Collections • CS Collection (#Docs (Length))

Type Prediction Performance • Pseudo-desktop Collections • CS Collection • FQL improves performance over CQL • Combining features improves the performance further (% of queries with correct prediction)

Retrieval Performance • Pseudo-desktop Collections • CS Collection Best : use best type-specific retrieval method Oracle : predict correct type perfectly (Mean Reciprocal Rank)

Associative Browsing Model for Personal Information

Motivation • Keyword search doesn’t always work • Sometimes you don’t have ‘good’ keyword • Browsing can help here, yet • Hierarchical folder structure is restrictive • You can’t tag ‘all’ your documents • Associative browsing as a solution • Our mind seems to work by association • Let’s use a similar model for personal info!

Data Model

Building the Model • Concepts are extracted from metadata • e.g. senders and receivers of email • Concept occurrences are found in documents • The link between concepts and documents • We still need to find the link between concepts and between documents • There are many ways of doing that • Let’s build a feature-based model, where weights are adjusted by user’s click feedback

Summary

Summary • Retrieval Model • Retrieval-merge strategy works for personal search • Exploiting field structure is helpful both for retrieval and type prediction • Evaluation Method • Evaluation itself is a challenge for personal search • Reasonable evaluation can be done by simulation or game-based user study • Associative Browsing Model • Search can be combined with other interaction models to enable better information access

More Lessons • Resembling user’s mental process is the key for the design of a retrieval model • The ‘mapping’ assumption of PRM-S model • Language models are useful for many tasks • e.g. Document LM / Field LM / Collection LM / ... • Each domain requires specialized retrieval model and evaluation method • Search is never a solved problem!

Major References • [KXC09] • A Probabilistic Retrieval Model for Semi-structured Data • Jinyoung Kim, XiaobingXue and W. Bruce Croft in ECIR'09 • [KC09] • Retrieval Experiments using Pseudo-Desktop Collections • Jinyoung Kim and W. Bruce Croft in CIKM'09 • [KC10] • Ranking using Multiple Document Types in Desktop Search • Jinyoung Kim and W. Bruce Croft in SIGIR'10 • [KBSC10] • Building a Semantic Representation for Personal Information • Jinyoung Kim, Anton Bakalov, David A. Smith and W. Bruce Croft in CIKM'10

Further References • My webpage • http://www.cs.umass.edu/~jykim • Chapter in [CMS] • Retrieval Models (Ch7) / Evaluation (Ch8) • Chapter in [MRS] • XML Retrieval (Ch10) / Language Model (Ch12)

Personal Search

Personal Search

Presentation Transcript

Search

Protecting Personal Identity Records: Policy and Search Tools

The Internship Search: Personal Preparation and Strategies for Success

Literature search Search Engines

Psychology and Religion in the Search for Personal Wholeness: Synthesis and Personal View

Psychology and Religion in the search for Personal Wholeness

Enterprise and Desktop Search Lecture 5: Desktop Search and Personal Information Management

Registering to Personal Profiles in the OARE Search Engines

Distributed Search by Agents with Personal Preferences

Independent Personal Search Agents Conference March 2010

Using Personal-Characteristic and Friend-Ranking in Blog Search

Search Algorithms Sequential Search (Linear Search) Binary Search

IdeationIP - Novelty Search, Knockout Search, Invalidity Search

Search Group Personal Training Melbourne

How to Search and Hire Personal Care Assistant?

The Professional Personal Development Advice You Search for Has Arrived

Search Healthy Personal Chef - Best Full Time Personal Chef For Hire

Search form Search

Search: Binary Search Trees

Making use of Search Engine Optimisation For Your Personal Business

Invalidity Search | Novelty Search

Deceive-Evidence Suggestions For Your Personal Search Engine Optimisation