170 likes | 178 Views
Explore the functions, components, and challenges of Information Retrieval systems, including vocabulary, relevance, and evaluation. Learn about different IR system types and problems to overcome in designing them. Discover the dimensions of variety in databases, documents, and queries.
E N D
I N F S 2 7 6I N F O R M A T I O N R E T R I E V A L S Y S T E M SWeek 1: Vocabulary, relevance,and evaluation Jonathan Furner
COURSE THEME • user-oriented focus ... • ... on system design
TODAY’S THEMES • what is the function of an IR system? • what are the components of an IR system? • what kinds of IR system are there? • what are the kinds of problem that need to be overcome when designing an IR system?
FUNCTION: 1 World 2 World 3 THE SITUATION - domain, tasks, needs, goals THE INFO-SEEKER 1 knowledge structure = K(S) - about World 1 - domain, situation - task, need, goal - about World 2 - about World 3 - info sources - info services 2 cognitive abilities 3 cognitive styles info-as-process INFO-SOURCES (DOCUMENTS) - info-as-thing INFO SERVICES - info workers - ref/bib sources - IR systems INFO-as- knowledge = I K(S) + I = K (S +S)
FUNCTION: 2 • to identify all and only those documents in a collection that (individually or collectively) satisfy the needs of the information seeker • i.e., to identify relevant documents • needs are key
people seekers authors intermediaries catalogers indexers designers funders things needs documents & collections queries records & databases terms systems money COMPONENTS: 1
COMPONENTS: 2 SEEKER AUTHORS relevance? (feedback) NEED INFORMATION subj. display REQUEST DOCUMENTS query formulation database creation obj. RECORDS QUERY RECORDS query analysis document analysis output QUERY REP DOCUMENT REPS input matching / ranking
document collection data type: text vs. image vs. multimedia coverage: subject matter, language, currency, etc. size: hundreds vs. billions of records location: congregated vs. distributed DIMENSIONS OF VARIETY : 1
database storage location: remote vs. local storage medium: optical vs. magnetic record type: full-text vs. bibliographic vs. numeric field structure: unstructured vs. semi-structured vs. highly structured representation of inter-document relationships: implicit vs. explicit DIMENSIONS OF VARIETY : 2
document analysis (indexing) mechanism agent: automatic vs. manual unit: word vs. phrase origin: derived vs. assigned vocabulary: controlled vs. natural language coverage: full-text vs. field limitation normalization: term vs. document vs. collection weighting syntagmatic co-ordination: at index time vs. at search time DIMENSIONS OF VARIETY : 3
query analysis mechanism query type: Boolean vs. unstructured vs. natural language support for user profiling DIMENSIONS OF VARIETY : 4
matching / ranking (retrieval) mechanism status: operational vs. experimental model: exact-match (Boolean search) vs. best-match (similarity or ranked-output search) type of similarity measure type of ranking algorithm DIMENSIONS OF VARIETY : 5
user interface degree and type of support for query formulation ... ... and for query re-formulation / expansion using a thesaurus, or user feedback, or “blind” automatic or interactive mode of presentation of search results supportive of relevance judgment? DIMENSIONS OF VARIETY : 6
refinements natural language processing (NLP) techniques e.g., for phrase identification passage retrieval relevance feedback and query expansion techniques query-by-example / “more like this” / “related records” social feedback e.g., recommender systems hypertext / bibliometric techniques link / citation analysis, of document relationships data fusion: multiple sources of evidence DIMENSIONS OF VARIETY : 7
the vocabulary (objective relevance) problem different people use different terms to refer to the same things solution? vocabulary control and thesauri encoding semantic knowledge (i) about terms, and (ii) about paradigmatic relationships between terms providing user and indexer access to this knowledge base providing interactive/automatic support for thesaurus-based query expansion PROBLEMS: 1
the (subjective) relevance problem subjective relevance doesn’t depend simply on semantic content, but on other characteristics ... ... of the document: e.g., perceived ‘quality’ ... of contexts of need and use ... of search history, due to dynamic nature of information need solutions? use of descriptive (non-topical) metadata providing interactive/automatic support for feedback-based query expansion automatic clustering (classification): clustered docs are assumed to be equally relevant plus all the refinements listed a couple of slides ago PROBLEMS: 2
the evaluation problem finding alternatives to ... precision: ratio of number of relevant records retrieved to total number of records retrieved recall: ratio of number of relevant records retrieved to total number of relevant records outside laboratory setting, recall can only be estimated different people use different criteria of ‘goodness’ PROBLEMS: 3