5.33k likes | 5.53k Views
Text-retrieval Systems. NDBI010 Lecture Slides KSI MFF UK http://www.ms.mff.cuni.cz/~kopecky/teaching/ndbi010/ Version 10.0 5 .12.1 3 .3 0.en. Literature (textbooks). Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
E N D
Text-retrievalSystems NDBI010 Lecture Slides KSI MFF UK http://www.ms.mff.cuni.cz/~kopecky/teaching/ndbi010/ Version 10.05.12.13.30.en
Literature (textbooks) • Introduction to Information Retrieval • Christopher D. Manning, Prabhakar Raghavanand Hinrich Schütze • Cambridge University Press, 2008 • http://informationretrieval.org/ • Dokumentografické informační systémy • Pokorný J., Snášel V., Kopecký M.: • Nakladatelství Karolinum, UK Praha, 2005 • Pokorný J., Snášel V., Húsek D.: • Nakladatelství Karolinum, UK Praha, 1998 • Textové informační systémy • Melichar B.: • Vydavatelství ČVUT, Praha, 1997 NDBI010 - DIS - MFF UK
Further links (books) • Computer Algorithms - String Pattern Matching Strategies, • Jun Ichi Aoe, • IEEE Computer Society Press 1994 • Concept Decomposition for Large Sparse Text Data using Clustering • Inderjit S. Dhillon, Dharmendra S. Modha • IBM Almaden Research Center, 1999 NDBI010 - DIS - MFF UK
Further links (articles) • The IGrid Index: Reversing the Dimensionality Curse For Similarity Indexing in High Dimensional Space for Large Sparse Text Data using Clustering • Charu C. Aggrawal, Philip S. Yu • IBM T. J. Watson Research Center • The Pyramid Technique: Towards Breaking the Curse of Dimensionality • S. Berchtold, C. Böhm, H.-P. Kriegel: • ACM SIGMOD Conference Proceedings, 1998 NDBI010 - DIS - MFF UK
Further links (articles) • Affinity Rank: A New Scheme for Efficient Web Search • Yi Liu, Benyu Zhang, Zheng Chen, Michael R. Lyu, Wei-Ying Ma • 2004 • Improving Web Search Results Using Affinity Graph • Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen1, Wei-Ying Ma • Efficient computation of pagerank • T.H. Haveliwala • Technical report,Stanford University, 1999 NDBI010 - DIS - MFF UK
Further links (older) • Introduction to Modern Information Retrieval • Salton G., McGill M. J.: • McGRAW-Hill 1981 • Výběr informací v textových bázích dat • Pokorný J.: • OVC ČVUT Praha 1989 NDBI010 - DIS - MFF UK
Lecture No. 1 Introduction Overview of the problem informativeness measurement
Retrieval system origin • 50th of 20th century • The gradual automation of the procedures used in libraries • Now a separate subsection of IS’s • Factual IS • Processing of information having defined internal structure (usually in the form of tables) • Bibliographic IS • Processing of information in form of the text written in natural language without strict internal structure. NDBI010 - DIS - MFF UK
Query formulation Comparison Hit-list obtaining Query tuning/reformulation Document request Document obtaining DIS 1 2 3 4 5 6 Interaction with TRS NDBI010 - DIS - MFF UK
Document disclosure system Returns secondary information Author Title ... Document delivery system Need not to be supported by the SW TRS Structure I) 1 2 3 4 II) 5 6 NDBI010 - DIS - MFF UK
Direct comparisonis time-consuming Query Comparison Doc1 Doci1 Query Evaluation NDBI010 - DIS - MFF UK
Document model is used to compare Lossy process,usually based on presence of words in documents Produces structured data suitable for effective comparison Indexation Doc1 X1 Query Evaluation NDBI010 - DIS - MFF UK
Query is processed to obtain needed form Processed queryis compared against the index Query Comparison Doci1 X1 Query Evaluation NDBI010 - DIS - MFF UK
Text preprocessing • Searching is more effective using created (structured) model of documents, but it can use only information stored in the model, not in documents. • The goal is to create model, preserving as much information form the original documents as possible. • Problem: lot of ambiguity in text. • Still exist many not resolved tasks concerning document understanding. NDBI010 - DIS - MFF UK
Text understanding • Writer: • Text = sequence of words in natural language. • Each word stands for some idea/imagination of writer. • Ideas represent real subject, activity, etc. • Reader: folows (not necessary exactly the same mappings) from left to right ... NDBI010 - DIS - MFF UK
Text understanding • Synonymy of words • More words can have the same meaning for the writer • car = automobile • sick = ill ... NDBI010 - DIS - MFF UK
Text understanding • Homonymy of words • One word can have more than one different meanings • fluke: fish, anchor, … • crown: currency, treetop, jewel, … • class: year of studies, kategory in set theory, … ... NDBI010 - DIS - MFF UK
Text understanding • Word meanings need not be exactly the same. • Hierarchical overlapping • animal > horse > Stallion • Associativity among meanings • calculator ~ computer ~ processor ... NDBI010 - DIS - MFF UK
Text understanding • Mapping between subjects, ideas and words can depend on individual persons – readers and writers. • Two people can assign partly or completely different meaning to given term. • Two people can imagine different thing for the same word. • mother, room, ... • In result, by reading the same text two different readers can obtain different information • Each from other • In comparison with author’s intention NDBI010 - DIS - MFF UK
Text understanding • Homonymy and ambiguities grows with transition form words/terms to sentences and bigger parts of the text. • Example of English sentence with more grammatically correct meanings (in this case a human reader probably eliminates the nonsense meaning) • SeePodivné fungování gramatiky,http://www.scienceworld.cz/sw.nsf/lingvistika • In the sentence „Time flies like an arrow“ either flies (fly) or like can be chosen for the predicate, what produces two significantly different meanings. NDBI010 - DIS - MFF UK
Text preprocessing • Inclusion of linguistic analysis into the text processing can partially solve the problem • Disambiguation • Selection of correct meaning of the term in the sentence • According to grammar (Verb versus Noun etc.) • According to context (more complicated, can distinguish between two Verbs, two Nouns, etc). NDBI010 - DIS - MFF UK
Text preprocessing • Inclusion of linguistic analysis into the text processing can partially solve the problem • Lemmatization • For each term/word in the text – after its proper meaning is found – assigns • Type of word, plural vs. singular, present time vs. preterite, etc. • Base form (singular for Nouns, infinitive for Verbs, …) • Information obtained by sentence analysis(subject, predicate, object, ...) NDBI010 - DIS - MFF UK
Text preprocessing • Other options, that can be more or less solved are • Identification of collocations • World war two, ... • Assigning of Nouns for Pronouns, used in the text (very complex and hard to solve, sometimes even for human reader) NDBI010 - DIS - MFF UK
Precision and Recall • As a result of ambiguities there exists no optimal text retrieval system • After the answer of the query is obtained, following values can be evaluated • Number of returned documents in the list:Nv • The system supposed them to be relevant – useful – according to their math with the query • Number of returned relevant documents:Nvr • The questioner find them to be really relevant as they fulfill its information needs • Number of all relevant documents in the system:Nr • Very hard to guess for large and unknown collections NDBI010 - DIS - MFF UK
Two TRS’s can (and do) return two different result for the same query, that can be partly or completely unique. How to compare quality of those systems? Documents in the database ReturnedbyTRS2 Relevantdocuments ReturnedbyTRS1 Precision and Recall NDBI010 - DIS - MFF UK
Two questioners can suppose another documents to be relevant for their equally formulated query How to meet both subjective expectations of questioners? Documents in the database Relevant Returneddocs. Relevant Precision and Recall NDBI010 - DIS - MFF UK
Precision and Recall • Quality of result set of documents is usually evaluated according to numbers Nv, Nr, Nrv • Precision • P = Nvr/Nv • Probability of returned document to be relevant to the user • Recall • R = Nvr/Nr • Probability of relevant document to be returned to the user NDBI010 - DIS - MFF UK
Precision and Recall • Both coefficients depend on the feeling of the questioner • The same document can fulfill information needs of first questioner while at the same time fail to meet them for the second one. • Each user determines different values of Nr and Nrv coefficients • Both measures P and R depend on them NDBI010 - DIS - MFF UK
In optimal case P=R=1 There are all and only relevant documents in the response of the system Usually The answer in the first iteration is neither precise nor complete Precision and Recall Optimalanswer 1 Typical initial answer 0 0 1 NDBI010 - DIS - MFF UK
Query tuning Iterative modification of the query targeted to increase the quality of the response Theoretically it is possible to reach the optimum sooner or later … Precision and Recall R 1 Optimum 0 P 0 1 NDBI010 - DIS - MFF UK
… due to (not only) ambiguities both measures depend indirectly each on the other,ie. P*R const. < 1 In order to increase Pthe absolute number of relevant documents in the response is decreased. In order to increase Rthe number of irrelevant documents rapidly grows. The probability to reach quality above the limit is low. Přesnost a úplnost R 1 Optimum 0 P 0 1 NDBI010 - DIS - MFF UK
Prediction Criterion • In time of query formulation the questioner has to guess correct term (words) the author used for expression of given idea • Problems are caused e.g. by • Synonyms(author could use different synonym not remembered by the user) • Overlapping meanings of terms • Colorful poetical hyperboles • … NDBI010 - DIS - MFF UK
Prediction Criterion • The problem can be partly suppressed by inclusion of thesaurus, containing • Hierarchies of terms and their meanings • Sets of synonyms • Definitions of associations between terms • Questioner can use it during query formulation • System can use it during query evaluation NDBI010 - DIS - MFF UK
Prediction Criterion • The user often tends to tune its own query in conservative way • He/she tends to fix terms used in the first iteration (they must be the best because I remembered them immediately) and vary only additional terms at the end of the query • It is useful to support the user to (semi)automatically eliminate wrong terms and replace them with useful ones, that describe really relevant documents NDBI010 - DIS - MFF UK
Maximal Criterion • The questioner is usually not able or willing to go through exhaustive number of hits in the response to find out the relevant one • Usually max. 20-50 documents according to their length • Need to not only sort out documents not matching the query but order the answer list according to supposed relevancy in descendant order – the supposedly best documents at the begining NDBI010 - DIS - MFF UK
„better“ „worse“ Vr. Rel. Rel. Vr. Maximal Criterion • Due to maximal criterion, the user usually tries to increase the Precision of the answer • Small amount of resultingdocuments in the answer, containing as high ratio of relevant documents as possible • Some problematic domains requires both high precision and recall • Lawyers, especially in territories having case law based on precedents (need to find out as much similar cases as possible) NDBI010 - DIS - MFF UK
Why to Search for Patterns in Text • Due to index documents or queries • To involve only given set of terms (lemmas) • To omit given set of meaningless terms (lemmas) as conjunctions, numerals, pronouns, … • To highlight given terms in documents, presented to users • … NDBI010 - DIS - MFF UK
I - Brute-force algorithm II - Others (suitable for TRS) Further divided according to Number of simultaneously matched patterns 1, N, Direction of comparison Left to right Right to left Algorithms classificationby preprocessing NDBI010 - DIS - MFF UK
Class II Algorithms NDBI010 - DIS - MFF UK
Exact Pattern Matching Searching of One Pattern Within Text
Text a b c c b a b c a b b c a a b c c b a b c b b b a b c c Bef. shift a b c c b a b c b b b Aft. shift a b c c b a b c b b b Brute-force Algorithm • Let m denotes length of text t,let n denotes length of pattern p. • If i-th position in text doesn’t match j-thposition in pattern • Shift of pattern one position to the right,restart comparison at first (leftmost) position in the pattern • Average time complexity: o(m*n), e.g.in search of „an-1b“ in „am-1b“ • For natural language text/pattern m*const ops, i.e. o(m)constissmallnumber (<10), dependent on the language NDBI010 - DIS - MFF UK
Lectureno. 2 Knuth-Morris-Pratt Algorithm • Left to rightsearching for one pattern • In comparison with brute-force algorithm KMP eliminates repeated comparison of already successfully compared characters of text • Pattern is shifted as less as possible to align own prefix of examined part of pattern below equal fragment of text NDBI010 - DIS - MFF UK
Brute-force algorithm Text a b c c b a b c a b b c a a b c c b a b c b b b a b c c Bef. shift a b c c b a b c b b b Aft. shift a b c c b a b c b b b KMP Text a b c c b a b c a b b c a a b c c b a b c b b b a b c c Bef. shift a b c c b a b c b b b Aft. shift a b c c b a b c b b b KMP Algorithm NDBI010 - DIS - MFF UK
Text a b c c b a b c a b b c a a b c c b a b c b b b a b c c Před pos. a b c c b a b c b b b Po posunu a b c c b a b c b b b KMP Algorithm • In front of mismatch position is left own prefix already examined part of pattern • It has to be equal to the postfix of already examined part of pattern • The longest such a prefix determines the smallest shift NDBI010 - DIS - MFF UK
Text Bef. shift Aft. shift KMP algoritmus NDBI010 - DIS - MFF UK
KMP algoritmus • If • j-th position of pattern pdoesn’t match to i-th position of text t • The longest own prefix of already examined part of pattern equal to the postfix of already examined part of pattern is of length k • then • After the shift kcharacters remain before the mismatch position • Comparison restarts from k+1st position of the pattern • Restart positions are pre-computed and stored in auxiliary arrayA • In this case A[j] = k+1 NDBI010 - DIS - MFF UK
KMP algoritmus begin{KMP} m := length(t); n := length(p); i := 1; j := 1;while (i <= m) and (j <= n) dobeginwhile (j > 0) and (p[j] <> t[i]) do j := A[j]; inc(i); inc(j);end; {while}if (j > n)then{pattern found at position i-j+1}else {not found}end; {KMP} NDBI010 - DIS - MFF UK
Obtaining of array A for KMP search • A[1] = 0 • If all values are known for positions1 .. j-1, it is easy to compute correct value for j-th position • Let A[j-1] contains corrections for j-1st position.I.e., A[j-1]-1 chars at the beginning of pattern are the same as equivalent number of chars before j-1st positon NDBI010 - DIS - MFF UK
Pattern A Obtaining of array A for KMP search NDBI010 - DIS - MFF UK