260 likes | 385 Views
Modern Information Retrieval Chapter 1: Introduction. Ricardo Baeza-Yates Berthier Ribeiro-Neto. Motivation. Example of the user information need Topic: NCAA college tennis team
E N D
Modern Information RetrievalChapter 1: Introduction Ricardo Baeza-Yates Berthier Ribeiro-Neto
Motivation • Example of the user information need • Topic: NCAA college tennis team • Description: Find all the pages (documents) containing information on college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. • Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.
IR Research • Information retrieval vs Data retrieval • Research • information search • information filtering (routing) • document classification and categorization • user interfaces and data visualization • cross-language retrieval
IR History • 1970 • 1990, WWW
The User Task • Retrieval (Searching) • classic information search process where clear objectives are defined • Browsing • a process where one’s main objectives are not clearly defined and might change during the interaction with the system
Logical View of the Documents • Text Operations • reduce the complexity of the document representation • a full text a set of index terms • Steps 1. Stopwords removing 2. Stemming 3. Noun groups 4. ...
Past, Present, and Future • Early Development • Index • Library • Author name, title, subject headings, keywords • The Web and Digital Libraries • Hyperlinks
Resources • Journals • Journal of American Society of Information Sciences • ACM Transactions on Information Systems • Information Processing and Management • Information Systems (Elsevier) • Knowledge and Information Systems (Springer) • Conferences • ACM SIGIR, DL, CIKM, CHI, etc. • Text Retrieval Conference (TREC)
Conventional Text-Retrieval SystemsAutomatic Text Processing G. Salton, Addison-Wesley, 1989. (Chapter 9)
Data Retrieval • A specified set of attributes is used to characterize each record.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) • Exact match between the attributes used inquery formulationsandthose attached to the document. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’
Text-Retrieval Systems • Content identifiers (keywords, index terms, descriptors) characterize the stored texts. • Degrees of coincidence between the sets of identifiers attached to queries and documents content analysis query formulation
Possible Representation • Document representation • unweighted index terms (term vectors) • weighted index terms • … • Query • unweighted or weighted index terms • Boolean combinations (or, and, not) • … • Search operation must be effective
File Structures • Main requirements • fast-access for various kinds of searches • large number of indices • Alternatives • Inverted Files • Signature Files • PAT trees
Inverted Files • File is represented as an array of indexed documents.
Inverted-file process • The document-term array is inverted (transposed).
Inverted-file process (Continued) • Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers. • Ex: Query= (term2 and term3) term21 1 0 0term3 0 1 1 1------------------------------------------------------ 1 <-- D2
List-merging for two ordered lists • The inverted-index operations to obtain answers are based on list-merging process. • ExampleT1: {D1, D3}T2: {D1, D2}Merged(T1, T2): {D1, D1, D2, D3}
Extensions of Inverted Index Operations(Distance Constraints) • Distance Constraints • (A within sentence B)terms A and B must co-occur in a common sentence • (A adjacent B)terms A and B must occur adjacently in the text
Extensions of Inverted Index Operations(Distance Constraints) • Implementation • include term-location in the inverted indexesinformation: {P345, P348, P350, …}retrieval: {P123, P128, P345, …} • include sentence-location in the indexes information: {P345, 25; P345, 37; P348, 10; P350, 8; …}retrieval: {P123, 5; P128, 25; P345, 37; P345, 40; …}
Extensions of Inverted Index Operations(Distance Constraints) • Include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {P345, 2, 3, 5; …}retrieval: {P345, 2, 3, 6; …} • Query examples(informationadjacentretrieval)(informationwithin five wordsretrieval) • Cost: the size of indexes
Term Weights • Term WeightsDi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6} • Issues • How to generate the term weights? • How to apply the term weights? • Sum the weights of all document terms that match the given query. • Rank the output documents in the descending order of term weight.
Boolean Query with Term Weights • Transform a Boolean expression into disjunctive normal form.T1 and (T2 or T3) = (T1 and T2) or (T1 and T3) • For each conjunct, compute the minimum term weight of any document term in that conjunct. • The document weight is the maximum of all the conjunct weights.
Boolean Query with Term Weights • Example: Q=(T1 and T2) or T3Document Conjunct QueryVectors Weights Weight(T1 and T2) (T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6) 0.2 0.6 0.6D2=(T1,0.7;T2,0.2;T3,0.1) 0.2 0.1 0.2D1 is preferred.
Stemming • Term Truncation • Remove suffixes and/or prefixes from context terms. • ExamplePSYCH*: psychiatrist, psychiatry, psychiatric,psychology, psychological, …