330 likes | 411 Views
Information Access I Interactive Information Search. GSLT, Göteborg, October 2003. Barbara Gawronska, Högskolan i Skövde. 2nd intensive week: Interactivity (Th 8-12 BG, 13-15 MM) Multilingual systems and resources (Fr 8-10 MM, 10-12 BG) Evaluation (Fr 13-15 BG).
E N D
Information Access IInteractive Information Search GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde
2nd intensive week: • Interactivity (Th 8-12 BG, 13-15 MM) • Multilingual systems and resources (Fr 8-10 MM, 10-12 BG) • Evaluation (Fr 13-15 BG)
Some repetition...: Data Retrieval vs. IR (2)(the German IR Research Group) IR systems have to handle ”uncertain knowledge” (”unsicheres Wissen”): • Vague queries; reformulation frequently required • The problem of the user’s own understanding of his/hers information need • Limitations of knowledge representations This implies interaction need.
A document from different perspectives(Meghini et al. 91, modified)
How to diagnose the need of interaction refinement? • User studies (still to sparse): • User in contact with existing systems: • Free task choice • Predefined tasks • Wizard-of-Oz experiments • Relevance feedback (”real” och ”pseudo”)
Wizard-of-Oz experiments(Dahlbäck, Jönson...) • Users tend to spontaneously produce a kind of ”controlled” language: • written language syntax (complete sentences, elipsis avoided) • ”reparations” not frequent • pronominal anaphora less frequent than in human-human communication
Wizard-of-Oz experiments (3) • ”Controlled” language in users (3) • A psycholinguistic reflection: it is not unlike”baby-talk” (i.e. the way of talking to young children or unskilled/unidiomatic speaker of a language) • This can make human-computer NLP-dialogue a less complicated task than e.g. translating human-human dialogue • Theree seem to be age related differences in the way of inteeracting with computer systems
But: • If the system makes an impression of being too smart, the user normally becomes more natural in his/her linguistic behaviour, which causes problem to the system... Should the systems responses remain a little ”stupid”???
Now, back from wizards to existing systems. Let’s think about IR-models again.
Information request level: Common Problems: • Spelling errors (recall Hercules´ lecture) • Connector interpretation: Natural Language conjunctions vs. logical connectors; conjuction symbols in IR systems may be ambiguous: ”Food for cats and dogs”
Information request level (2) • Negation (examples inspired by Fuhr 1995): ”Drugs and sedatives without relation to aging” ”Drugs and sedatives, not related to aging” ”Drugs and sedatives, no aging” ”Drugs and sedatives, not age”
Information request (3) • What kind of feedback would be useful on this level? (Feedback, definition (Meadow et al. 2000: 246, Mc GrawHill 1971): Feedback = information derived from the output of a process and used to control the process in the future
Possible feedback format on the infromation request level (?) • Predicate logic? • For(food,cat) & for(food,dog) Or For (food, cat) or for(food,dog) Or For(food,cat) & dog Generate NLP questions? Leave everything to the user? Or? How to present the feedback? Menu choice?
Between information request level and formal query level Meadow et at 2000: 179ff: examples from Dialog: • SSELECT CAT interpreted as: SS (=SELECT SETS) CAT • SELECTiON (wrongly used instead of the standard command SELECT) interpreted as: S(=SELECT) ION What kind of feedback would be useful on this level?
Between the information request/formal query level and database objects If the request/query is ambiguous: • Give some feedback and try to resolve the ambiguity before searching the database, or after the search, before presenting the documents (”Delayed disambiguation”) ? • What search stage is most suitable for feedback/dialog? What factors should be taken into account?
Search stages, or ”states” in searchers(Penniman & Dominick 1980, Chapman 1981) • Database selection • Exploration of individual terms (looking up terms in a thesaurus or an inverted file in order to decide which terms are to be used in the query) • Record search by term combinations • Record browsing and display • Record evaluation ( for possible iteration)
Levels of search activities(Bates 1990, Fuhr 1995) • Strategy (= a plan for an entire information search, e.g. Find relevant literature for a course in IA) • Strategem: e.g. journal run, citation search... • Tactic: one or several moves made to further the search • Move: a single action
Levels of system involvement(Bates 1990) • No system inolvement: All search activities human generated and executed • Displays possible activities: system lists search activities when asked. Some of the activities may be executable by system, some may not. • Monitors search and recommendssearch activities: • Only when searcher asks for suggestions • Always when it indentifies a need • Executes desired actions automatically
Query modification by relevance feedback (picture from M.A. Hearst, http://www.sims.berkeley.edu/courses/is202/f98/Lecture25/sld005.htm)
How to utilize terms extracted from relevant documents? • The extracted terms may be added to the query • They may be presented for the user, who makes the decision about modification • They can be used for re-weighting the terms in the query
A standard method for re-weighting: Rocchio’s Algorithm(Rocchio 1971) • Goal: to achieve an optimal query An optimal query maximizes the difference between average relevant vector and average nonrelevant vector
A standard method for re-weighting: Rocchio’s Algorithm(Rocchio 1971; many modifications, e.g. Salton & McGill 1983; Picture from Srinivasan 2003, http://mingo.info-science.uiowa.edu:16080/courses/230/Lectures/Vector.html#1c) Qnew = a Q old + b Average Relevant Vector - c Average Nonrelevant Vector
Rocchio’s Algorithm (2)(Rocchio 1971; many modifications, e.g. Salton & McGill 1983;a more formal way of expressing the same thing – Meadow et al. 2000:258) QW: the initial query vector QW’: the vector of the modified query R= the number of the relevant retrieved documents N= the number of the not relevant retrieved documents DW = the document vector , = coefficients that must be determined experimentally ( often about 0.75, about 0.25)
Future? According to several studies, Machine Learning methods perform better than different variants of Rocchio’s algorithm. Your experience?
Future? Future users – a preliminary case study (age:12-13) First observations: • most frequent search goals: to DO things, not to read documents. ”Download movies”, ” Prenumerate X”, ”Translate X” etc.
Future? (young users) • Queries in English dominate (specific for Swedish kids, or? What does it mean for multilinguality?) • Narrow terms dominate, specific terms more frequent than general • Quite aware of the danger of information overload • Short queries, 2-3 words per query • ”No idea to search for subcategories” (!)
Future? (young users) Consequences for system design and feedback planning?