210 likes | 329 Views
WIRED Week 3. Key Concepts in IR Mozilla & Firefox Projects & Papers. Key Concepts in IR. Understanding the System Can’t read users’ minds Can’t know “about” documents Evaluation is key Information Needs “More like this” Starting points, guides Topics, Subjects Documents Images
E N D
WIRED Week 3 • Key Concepts in IR • Mozilla & Firefox • Projects & Papers
Key Concepts in IR • Understanding the System • Can’t read users’ minds • Can’t know “about” documents • Evaluation is key • Information Needs • “More like this” • Starting points, guides • Topics, Subjects • Documents • Images • Text, Natural Languages • A query as a text • Not just (simple) question answering
Aboutness & Subject Indexing • What is “aboutness”? • Meaning of a document • Abstract or Topic(s) of a document • How you (or someone else) uses the document • Kinds of questions the document can answer • How can we uncover aboutness? • Author(s), Time, Date, Location, Format • Relationships, Sturctures, Markup, Metadata • Use, Recall, Popularity
Subtleties of Aboutness • Can we characterize a whole document • Parts of a document • Each part, different descriptions (& uses) • Do you need the document if you’ve got a good summary? • Not just text summaries • Use & origination data • How do you extract key information? • Understand the context • Frequency & Rarity • NLP, Genres, Keyword indicators • Sentence diagrams to the extreme? • Novelty of informaiton, expectations for education • Politics of description
Aboutness & the Web • Rapid & broad analysis • Let users define aboutness • Different users = more descriptions • Lots of users, lots to select from • A system to average & rank aboutness descriptions? • World Wide - means different cultures • More older documents with many more very new documents • Differences and “it’s like that one” • Internal consistency vs. flexiblity & context
Testing Index Language Devices • What are the different ways to represent documents? • Systems are faster, but designs differ • Can you represent them in more than one way? • At once? • By audience? • Not just terms, but relationships between terms • What language do you use to represent docs? • Structure & Flexible • Consistent & Understandable (human & computer) • Dewey, LoC, Dublin Core • Data structures, XML, Situational-Temporal • What if you indexed documents by terms & queries? • Can you get too complex? • Good for the user vs. good for the system
Indexers & Issues • Staff for evaluation • How is the system used? • Card catalogs • Search engine results pages • Natural language queries & NL answers • Vocabulary of document, index or user impacts? • Syntactic indexing • “use of headings which display the relationship between the various elements, as distinct from those which merely show existence of several attributes relevant to the subject indexed.” p 98
Preparation of an Index • Assess document subject • Related to users • Concepts & keywords • Translate assessment into index language • Add to index • Make concept analysis for answering questions • Will users understand & find document • How helpful (ranking) • Match concepts to index (to document) • Rebuild & enable updated index
Index Language parts • Controlled vocabulary (p 99) • Specific terms for relevance (p100) • Measuring for performance • Precision • Recall • (Relevance) • With the Web, we don’t know how many total documents for a subject or how many are correct • With the Web, we don’t know how documents are described or indexed • Metatags • Keywords • Indexing databases • Crawling & updating
Thesaurus • “Theory of Clumps” • Treasury of words • How deep are the relationships? • Can relationships & relevance be measured? • How specific can one be? • Not just alphabetical, topical • Purposes of a Thesaurus (p 112) • Which are most important? • What’s missing?
Variety of Thesaurus formats • Roget’s • Alphabetical with cross indexing • Subject categories (as numbers) • Ordering • Sub-ordering • Relationships • Language issues, syntax & completeness (phrases) • Shifted, inverted & rotated • “complications -- IR systems”
Terms • Number of terms • Singular, plural • Phrases, quotes, cliches • Desciptive, contextual • Symbols • Homographs & Thesaurofacets • Just a few ways to impose formats & structure • What are some other methods?
Layouts & Display of Thesauri • Most dynamic area • Making it easier to build thesauri • Get whole or specific picture • Expose structure to users • For understanding • For approval • Graphical displays • Browsing • Trees, Flowcharts, Maps • Colors, shapes, sizes
Revising, Adding & Relations • Most issues in reading minor in systems now • New problems in issues of scale • Generate new vs. add to existing? • Where do the experts fit in? • Building a set of rules • Beyond formats • Testing for internal consistency • How do you link or merge two thesauri? • Little merges into larger? • More detailed encompasses less? • Can you ever get agreement?
Problem Structures & HCI • A call to make IR systems more usable • Let users search systems themselves • Make systems work more like users think they should (for what year?) • Is a search like a dialogue? • Person to person • Person to machine • Multiple questions & answers to get to the point • Understanding language & behavior • “Do what I mean, not what I say” • Indentifying the problem • Focusing the question (related to the available documents) • User familiarity with system
Interaction, step 1 for Evaluation • Benchmarks for evaluation • How would a person ask this question? • What kind of answers are received? • How are subtle expectations met? • How long or comprehensive is the question or the answer? • How is this different for Web IR? • What advantages do both physical & virtual search systems have?
Relevance: Review & Framework • Finding the needle in a haystack • A few documents in a collection • Possible that no documents are perfectly relevant • Not just a content match • Dependent on the user & situation
Relevance & the system • Relevance as a point of measurement • Different fields gague relevance differently • Scientific communication • Communication (Theory) • Psychology • Information Systems • False Drops vs. Completeness • Rarity & value of information • Precision & Recall probabilities of finding relevance • Tests were numerical, binary & structured
Relevance is “no good”? • Very hard to define, should be ignored? • Too human centered • A gradual process moving towards the correct information • Cooper & Utility • Quality, novelty, importance, credibility • Wilson’s Situational Relevance • Psychological & Logical relevance • Matching vs. Satisfying • Situational • “Relevance numbers”
Relevance Future Work • Knowledge and (the) knower • Selection • Inference • Mapping • Dynamics • Association • Redundancy • p161
How can (Web) IR be better? Better IR models Better User Interfaces More to find vs. easier to find Scriptable applications New interfaces for applications New datasets for applications Projects and/or Papers Overview