220 likes | 235 Views
Explore the evolution of Information Retrieval from manual catalog systems to modern digital libraries. Discover key search models, vendors, and primary users. Learn about advancements in text understanding techniques and Boolean search operations. Understand the transition to full-text retrieval and the changing role of IR in satisfying user information needs.
E N D
InformationRetrieval(1955-1992) • Primary Users • Law Clerks • Reference Librarians • (Some) News organizations, product research, congressional committees, medical/chemical abstract searches • Primary Search Models • Boolean keyword searches on Abstract, Title, keyword • Vendors • Mead Data Central(Lexis – Nexis) • Dialog • Westlaw • Total searchable online data : O(10 terabytes)
Information Retrieval(1993+) • Primary users • 1st time computer users • novices • Primary search modes • Still Boolean keyword searches with limited probabilistic models • But FULL TEXT Retrieval • Vendors • Lycos, Infoseek, Yahoo, Excite, AltaVista, Google • Total online data : ???
Growth of the Web # of web sites or Volume of web traffic ? Mosaic Netscape Exponential Growth 1992 1993 1994 1995 1996 1997 1998 Volume doubling every 6 months
Observation • Early IR system basically extended library catalog systems, allowing • Keyword searches, • Limited abstract searches in addition to Author/Title/Subject and including Boolean combination functionality • IR was seen as reference retrieval (full documents still had to be ordered/delivered by hand)
In Contrast • Today, IR has a much wider role in the age of digital libraries • Full document retrieval • (hypertext, postscript or optical image(TIFF) • representations) • Question answering
Old View • Function of IR : • Map queries to relevant documents • New View • Satisfy user’s information need • Infer goals/information need from: • query itself • past user query history • User profiling(aol.com vs. CS dept.) • Collective analysis of other user feedback on similar queries … AND … OR … 15 1 8
In addition, return information in a format useful/intelligible to the user • weighted orderings • clusterings of documents by different attributes • visualization tools • ** Text Understanding techniques to extract answer to questions or at least subregion of text • Who is the current mayor of Columbus, Ohio? • don’t need full AP/CNN article on city scandals, • just the answer(and available source for proof)
Boolean Systems Function #1 : Provide a fast, compact index into the database (of documents or references) • (granularity) • Index options • Doc number • Page number in Doc • Actual word offset Chihuahua Nanny Data structure: Inverted file
Boolean Operations Chihuahua AND Nanny g Join ( ) Chihuahua OR Nanny g Union ( ) Proximity searches Chihuahua W/3 Nanny
Vector IR model d1 d2 Find optimal f( ) f( ) f( ) V1 V2 Query Sim (Vi , VQ) = Sim’ (Di , Q) Sim (V1, V2) »Sim’ (d1, d2) Cosine distance
Vector models V1 Bit vector capturing essence/meaning of D1 D1 V2 Sim (V1 , Q1) D2 Q1 Query Find maxSim (Vi , Q1)
^ V1 Dimensionality Reduction d1 f( ) V1 Initial (term) vector representation Dimensionality Reduction(SVD/LSI) More compact/reduced dimensionality model of d1
V1 D1 3 Japanese Japan Nippon Japanese Nihon Japanese Japanese .. 5 Japan * 1 The 192 1 1 Clustering words Offset K - hash(w) - hash(cluster(w)) - hash(cluster(stem(w))) Raw Term Vector Condensed Vector Stem : books book computer comput computation comput
Soap Opera Soap opera 0 0 1 d1 Collocation (Phrasal Term) Soap Opera Soap opera 1 1 0 d2
m1 m2 f(d1) f(d2) document document Vector g Abstractly it is a compressed document(meaning preserving) A meaning or context vector representation …… …… …… …… …… …… Compression : m1 = m2 iff d1 = d2 f( ) must be invertible Summarization : m1 = m2 iff d1 and d2 are about the same thing (mean the same thing)
What is the optimal method for meaning preserving compression? • Issues • size of representation(ideally size(Vi) << size(Di)) • cost of computation of vectors • one time cost at model creation • cost of similarity function • must be computed for each query • crucial to speed that this be minimized
header processing • retain/model cross references to cited(Xref) articles/mail:
Body Processing/Term Weighting • 1. remove (most) function words • (but must treat words such as ‘NOT’ specially) • 2. downweight by frequency • 3. use text analysis to • decide which function words carry meaning: • Mr. Dong Ving The of the Hanoi branch. • The Who • (use named entity analysis and • phrasal dictionaries)
recognizer recognizer recognizer recognizer Supervised Learning/Training Collective Discrimination Project #1 Input data stream A Chihuahua Breeding Club B Training In Real time (ongoing) Personal C B A C J Labeled (routed) output Junk mail J
Other related problems: Mail/News Routing and Filtering 119 Data Stream 121 Project #1 at work Project #2 at work Chihuahua breeding Scuba club Personal Junk mail 125 Inboxes (prioritize/partition mail reading) 131 Typically model long-term information needs (People put effort into training and user feedback that they aren’t willing to invest for single query-based IR)
Features for classification • Subject line • Source/Sender • X-annotations • Date/time • Length • Other recipients • Message content (regions weighted differently)
TDE TDB Probabilistic IR models – Intermediate Topic models/detectors S 0 0 0 1 0 0 Q 1 0 0 0 0 0 V1 TV1 0 1 0 1 0 0 V2 Topic Detectors (Topic Models) TDA V1 V2 f( ) f( ) d1 d2