100 likes | 263 Views
Cost Trends: Hardware cost < Software cost < Information cost < People time Virtuality (transcend space) Timeliness (minimize time) Interactivity Multimedia Trends: Resource Sharing, Collaboration, Dynamic Representation, The WWW Critical Need for Text and Multimedia Management Systems !.
E N D
Cost Trends: Hardware cost < Software cost < Information cost < People time Virtuality (transcend space) Timeliness (minimize time) Interactivity Multimedia Trends: Resource Sharing, Collaboration, Dynamic Representation, The WWW Critical Need for Text and Multimedia Management Systems ! Electronic Environments
Information seeking is a human-centered process Analytical <------------------> Browse continuum of strategies and tactics Close coupling of queries, results, and usage Interactive, iterative process Information retrieval has focused on documents (not concepts or answers) Information Seeking Perspective
1. Text retrieval is more complex than data retrieval from DBMS. 2. Distinguish searching for word matches from concept matches. 3. Distinguish subject from keyword search: Subject:-->Search on a controlled vocabulary (e.g., LC subject headings). The results point to documents. Keyword-->Search all words in particular fields/text fragments. The results point to documents. 4. Distinguish exact match from partial match retrieval Electronic Text Retrieval
1. Surrogate Search: Search a set of predefined words that point to related documents. Requires indexing via some controlled vocabulary. pros: natural transition from paper systems; computationally cheap cons: limited access; human indexing required 2. Full-Test Search: Search every word in every document. pros: broaden access; possible to automate indexing cons: computationally expensive; word rather than concept 3. Knowledge-Based Search: Search a set of concepts that are related to concepts in documents. pros: improved retrieval cons: computationally expensive; theoretical at present Approaches to Text Retrieval
Full-Text Search: Search every word (or variant)in the document except stop words. Methods: Text Scanning Indexes (inverted files) Vectors Signatures Full-Text Search
Words point to word number, offset, surrogate, or document: aardvark *Doc3, Doc 7, Doc45, Doc 67..... abacus Doc2, Doc16, Doc33, Doc 45, Doc 67, ..... . . . . zygote Doc 7, Doc 33, Doc 67, Doc 123, .... Find all Documents and then apply logical operators to combine Query either matches or does not match * actually Doc3,Para5,Word45 Inverted File
Each document (or surrogate) is represented by a vector defined by every word in the collection. Doc 1 0 0 1 1 0 0 ..... 0 Doc 2 0 0 0 0 1 1 ..... 0 . Doc 7 1 0 0 1 0 0 ..... 1 (has aardvark and zygote) . Doc 33 0 1 0 0 0 0 ..... 1 (has abacus and zygote) . Doc 67 1 1 0 0 0 0 ..... 1 (has aardvark, abacus and zygote) . Doc N Queries are expressed as vectors and matched to document vectors. Degrees of matching are possible. Vectors
Paragraphs, passages SGML codes Related problems: text summarization/auto abstracting auto categorization Document Alternatives
Linguistic surrogates Images color, texture, luminosity, shape Video same as stills but add motion Sound speaker attributes, pitch, duration Multimedia
1. More full text databases (e.g., The Web!) 2. More statistical engines for ranking results (e.g., PLS, Inquiry, RetrievalWare, Topic) 3. Evolution in traditional markets (e.g., Dialog's Target, West's WIN, Mead's Freestyle) 4. WWW engines and services (Yahoo, Alta Vista, etc.) 5. Relevance feedback added 6. Multimedia developments Retrieval Trends