1 / 38

Search and Ye Shall Find (maybe)

Search and Ye Shall Find (maybe). Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard. What Do People Search For?. Searchers often don’t clearly understand The problem they are trying to solve What information is needed to solve the problem

idania
Download Presentation

Search and Ye Shall Find (maybe)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search and Ye Shall Find(maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard

  2. What Do People Search For? • Searchers often don’t clearly understand • The problem they are trying to solve • What information is needed to solve the problem • How to ask for that information • The query results from a clarification process • Dervin’s “sense making”: Need Gap Bridge

  3. Process/System Co-Design

  4. Design Strategies • Foster human-machine synergy • Exploit complementary strengths • Accommodate shared weaknesses • Divide-and-conquer • Divide task into stages with well-defined interfaces • Continue dividing until problems are easily solved • Co-design related components • Iterative process of joint optimization

  5. Human-Machine Synergy • Machines are good at: • Doing simple things accurately and quickly • Scaling to larger collections in sublinear time • People are better at: • Accurately recognizing what they are looking for • Evaluating intangibles such as “quality” • Both are pretty bad at: • Mapping consistently between words and concepts

  6. Predict Nominate IR System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Document Examination Document Source Reselection Delivery Supporting the Search Process Source Selection Choose

  7. IR System Query Formulation Query Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery Supporting the Search Process Source Selection

  8. End-user Search Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Intermediated Search Q3 Formalized Need Q4 Compromised Need (Query)

  9. Search Component Model Utility Human Judgment Information Need Document Query Formulation Query Document Processing Query Processing Representation Function Representation Function Query Representation Document Representation Comparison Function Retrieval Status Value

  10. Relevance • Relevance relates a topic and a document • Duplicates are equally relevant, by definition • Constant over time and across users • Pertinence relates a task and a document • Accounts for quality, complexity, language, … • Utility relates a user and a document • Accounts for prior knowledge

  11. “Bag of Terms” Representation • Bag = a “set” that can contain duplicates • “The quick brown fox jumped over the lazy dog’s back”  {back, brown, dog, fox, jump, lazy, over, quick, the, the} • Vector = values recorded in any consistent order • {back, brown, dog, fox, jump, lazy, over, quick, the, the}  [1 1 1 1 1 1 1 1 2]

  12. Bag of Terms Example Document 1 Stopword List Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. for aid 0 1 is all 0 1 of back 1 0 the brown 1 0 to come 0 1 dog 1 0 fox 1 0 Document 2 good 0 1 jump 1 0 lazy 1 0 Now is the time for all good men to come to the aid of their party. men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1

  13. Advantages of Ranked Retrieval • Closer to the way people think • Some documents are better than others • Enriches browsing behavior • Decide how far down the list to go as you read it • Allows more flexible queries • Long and short queries can produce useful results

  14. Counting Terms • Terms tell us about documents • If “rabbit” appears a lot, it may be about rabbits • Documents tell us about terms • “the” is in every document -- not discriminating • Documents are most likely described well by rare terms that occur in them frequently • Higher “term frequency” is stronger evidence • Low “document frequency” makes it stronger still

  15. Document Length Normalization • Long documents have an unfair advantage • They use a lot of terms • So they get more matches than short documents • And they use the same words repeatedly • So they have much higher term frequencies • Normalization seeks to remove these effects

  16. “Okapi” Term Weights TF component IDF component

  17. Summary • Goal: find documents most similar to the query • Compute normalized document term weights • Some combination of TF, DF, and Length • Optionally, get query term weights from the user • Estimate of term importance • Compute inner product of query and doc vectors • Multiply corresponding elements and then add

  18. Some Questions for Indexing • How long will it take to find a document? • Is there any work we can do in advance? • If so, how long will that take? • How big a computer will I need? • How much disk space? How much RAM? • What if more documents arrive? • How much of the advance work must be repeated? • Will searching become slower? • How much more disk space will be needed?

  19. The Indexing Process Postings Term Inverted File Doc 3 Doc 1 Doc 2 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 aid 0 0 0 1 0 0 0 1 4, 8 AI A all 0 1 0 1 0 1 0 0 2, 4, 6 AL back 1 0 1 0 0 0 1 0 1, 3, 7 BA B brown 1 0 1 0 1 0 1 0 1, 3, 5, 7 BR come 0 1 0 1 0 1 0 1 2, 4, 6, 8 C dog 0 0 1 0 1 0 0 0 3, 5 D fox 0 0 1 0 1 0 1 0 3, 5, 7 F good 0 1 0 1 0 1 0 1 2, 4, 6, 8 G jump 0 0 1 0 0 0 0 0 3 J lazy 1 0 1 0 1 0 1 0 1, 3, 5, 7 L men 0 1 0 1 0 0 0 1 2, 4, 8 M now 0 1 0 0 0 1 0 1 2, 6, 8 N over 1 0 1 0 1 0 1 1 1, 3, 5, 7, 8 O party 0 0 0 0 0 1 0 1 6, 8 P quick 1 0 1 0 0 0 0 0 1, 3 Q their 1 0 0 0 1 0 1 0 1, 5, 7 TH T time 0 1 0 1 0 1 0 0 2, 4, 6 TI

  20. The Finished Product Term Inverted File Postings aid 4, 8 AI A all 2, 4, 6 AL back 1, 3, 7 BA B brown 1, 3, 5, 7 BR come 2, 4, 6, 8 C dog 3, 5 D fox 3, 5, 7 F good 2, 4, 6, 8 G jump 3 J lazy 1, 3, 5, 7 L men 2, 4, 8 M now 2, 6, 8 N over 1, 3, 5, 7, 8 O party 6, 8 P quick 1, 3 Q their 1, 5, 7 TH T time 2, 4, 6 TI

  21. Building an Inverted Index • Simplest solution is a single sorted array • Fast lookup using binary search • But sorting large files on disk is very slow • And adding one document means starting over • Tree structures allow easy insertion • But the worst case lookup time is linear • Balanced trees provide the best of both • Fast lookup and easy insertion • But they require 45% more disk space

  22. (Uncompressed) Index Size • Very compact for Boolean retrieval • About 10% of the size of the documents • If an aggressive stopword list is used • Not much larger for ranked retrieval • Perhaps 20% • Enormous for proximity operators • Sometimes larger than the documents!

  23. Index Compression • CPU’s are much faster than disks • A disk can transfer 1,000 bytes in ~20 ms • The CPU can do ~10 million instructions in that time • Compressing the postings file is a big win • Trades decompression time for fewer disk reads • Key idea: reduce redundancy • Trick 1: store relative offsets (some will be the same) • Trick 2: use an optimal coding scheme

  24. Problems with “Free Text” Search • Homonymy • Terms may have many unrelated meanings • Polysemy (related meanings) is less of a problem • Synonymy • Many ways of saying (nearly) the same thing • Anaphora • Alternate ways of referring to the same thing

  25. Controlled Vocabulary Searcher Free-Text Searcher Indexer Choose appropriate concept descriptors Construct query from available concept descriptors Construct query from terms that may appear in documents Content-Based Query-Document Matching Metadata-Based Query-Document Matching Query Terms Document Terms Document Descriptors Query Descriptors Retrieval Status Value Two Ways of Searching Author Write the document using terms to convey meaning

  26. f1 f2 f3 f4 … fN v1 v2 v3 v4 … vN Cv w1 w2 w3 w4 … wN Cw Cw Labelled training examples Learner Classifier Cx x1 x2 x3 x4 … xN New example Supervised Learning

  27. Example: kNN Classifier

  28. Problems with Controlled Vocabulary • New concepts • Users and indexers may think differently • Using thesauri effectively requires training

  29. Index Spam • Goal: Manipulate rankings of an IR system • Multiple strategies: • Create bogus user-assigned metadata • Add invisible text (font in background color, …) • Alter your text to include desired query terms • “Link exchanges” create links to your page

  30. Some Observable Behaviors

  31. Behavior Category

  32. Minimum Scope Behavior Category

  33. Some Examples • Read/Ignored, Saved/Deleted, Replied to • (Stevens, 1993) • Reading time • (Morita & Shinoda, 1994; Konstan et al., 1997) • Hypertext Link • (Brin & Page, 1998)

  34. Estimating Authority from Links Hub Authority Authority

  35. Collecting Click Streams • Browsing histories are easily captured • Make all links initially point to a central site • Encode the desired URL as a parameter • Build a time-annotated transition graph for each user • Cookies identify users (when they use the same machine) • Redirect the browser to the desired page • Reading time is correlated with interest • Can be used to build individual profiles • Used to target advertising by doubleclick.com

  36. 180 160 140 120 Reading Time (seconds) 100 80 60 58 50 43 40 32 20 0 No Interest Low Interest Moderate Interest High Interest Rating Full Text Articles (Telecommunications)

  37. Problems with Observed Behavior • Protecting privacy • What absolute assurances can we provide? • How can we make remaining risks understood? • Scalable rating servers • Is a fully distributed architecture practical? • Non-cooperative users • How can the effect of spamming be limited?

  38. Putting It All Together

More Related