270 likes | 458 Views
Information Access I An Introduction to Information Retrieval. GSLT, Göteborg, September 2003. Barbara Gawronska, Högskolan i Skövde. Topics:. 1st intensive week: Introduction Knowledge representations for IA Text categorization, indexing (theory + labs) Text summarization User aspects
E N D
Information Access IAn Introduction to Information Retrieval GSLT, Göteborg, September 2003 Barbara Gawronska, Högskolan i Skövde
Topics: 1st intensive week: • Introduction • Knowledge representations for IA • Text categorization, indexing (theory + labs) • Text summarization • User aspects 2nd intensive week: • Interactivity • Multilingual systems and resources • Evaluation
Schedule and content Thursday 11/9 8-10 BG: An Introduction to Information Retrieval Central notions: • Information/knowledge/data/metadata • Information Retrieval vs. Data Retrieval • Information Extraction, abstracting, summarization A survey of the history of Information Retrieval (Standard IR-models)
Schedule and content 2 10-12 BG: Representation of information and identification of significant text features (Standard IR-models) Different types of information and knowledge representation: • classical knowledge representation methods ( top-down and bottom-up hierarchical classifications, thesauri) • weighting techniques and co-occurrence-based techniques The notion of Retrieval Status Value (rsv) and methods for rsv-evaluation.
Schedule and content 3 15-17 HD: User aspects in information retrieval and text categorization • presentation of search results using KWIC, Key-word-in-context, • marking up words in documents, • automatic dynamic spell checking of the search query, • term expansion and synonym search • categorization and clustering of texts
Schedule and content 4 Friday 12/9 8-10 HD: Information extraction and automatic text summarization • information extraction techniques for text summarization (statistics, linguistics and heuristics) • a demonstration of SweSum • evaluation of automatic text summarization systems. 10-12, 13-15 ED: Testing automatic indexing with predefined categories (labs)
Information Access • Information Recovery • Information Retrieval (Information recovery /= Information Retrieval ; Information Retrieval includes a selection process) • Information Refinement: • Multilingual Information Retrieval • Information Extraction • Abstracting, summarization
Data Retrieval vs. IR (2)(the German IR Research Group) IR systems have to handle ”uncertain knowledge” (”unsicheres Wissen”): • Vague queries; reformulation frequently required • The problem of the user’s own understanding of his/hers information need • Limitations of knowledge representations
IR – Main Issues How to • Represent • Interpret • Categorize • Evaluate
The history of IR in brief • The early IR – ”a history of how indexes were created and searched” (Meadow et al 2000:20) • Index – a broad definition: a systematic scheme that places like material together Thus, books arranged in alphabetical order can also be seen as an index
The history of IR in brief (2) • Pre-alphabetical systems: • ca 2700 A.C., the Sumerian culture: grouping by similarity among initial ideograms (birds, bowls, trees...) • ca 1500 A.C. – first phonetic based systems (syllable-based, later – phoneme-based)
The history of IR in brief (3) • First attempts to utilize letter frequency in search: Arabs, 9th century • Categorization of documents in ancient libraries: • Babylonian ”libraries” 12th-7th century A.C.: categories like astronomy, geography, history, mathematics, natural science, laws and...linguistics • The Alexandrian library; Callimachus (310-240 A.C): 8 categories; subject matter and genre as criterions (history, laws, medicine, philosophy, lyric poetry, oratory, tragedy – a catalogue in 120 scrolls - pinakes )
The history of IR in brief (4) • CALLIMACHUS in Áitia: ”A big book is a big disaster...”
The history of IR in brief (5) • 1876 – Melvil Devey, USA – Devey Decimal Classification (DDC) Universal Decimal Classification (UDC, Otlet & Lafontaine) 10 main classes hierarchical organization max 10 branches from 1 node today 130 000 classes
The history of IR in brief (6) • Universal Decimal Classification, an example: 3Social science, laws, administration 33 National economics 336 Finances 336.7 Banking 336.76 Stock exchange 336.763 Share market
The history of IR in brief (7) • Card catalogues – 18th/19th century • How to represent the content of a document in an index? Precoordination vs. postcoordination of index terms
The history of IR in brief (8) • 1950 – • M. Taube – the Uniterm systems • W.E. Batten – the optical coincidence system (in both systems, the TERM serves as the starting point) • C. Moores – the Zatocode system (cards represent DOCUMENTS and are provided with descriptors, coded as series of holes at the edge of the card)
The history of IR in brief (9) • 1950- • First computerized IR systems (special purpose computers): • The Western Reserve Rapid Searching Selector (1957, Shera, Kent & Berry) • Based on human-created telegraphic abstracts • Aimed at technical texts • Semantic categories like product, process, material... • The Minicard Selector (1959, Kessel & DeLucia)
The history of IR in brief (10) • Late 1950s - early 1970s: • first IR-systems on general purpose computers (Bracken & Tillit 1957) • computerized IR should become more than simple string matching: the idea of utilizing word frequencies and inverse document frequency (idf) – Luhn, Bar Hillel • first online IR services (MAC at MIT, MEDLINE, Lexis/Nexis) • The Internet (4 hosts 1967)
IR today and in the future • From simple string matching towards NLP-techniques (statistics/heuristics/morphology/semantics/pragmatics) • Natural language in queries • Integrating speech technology • Multilingual retrieval and extraction • Multimedial retrieval
A document from different perspectives (Meghini et al. 91, modified)