1 / 22

The Term Vocabulary and Posting Lists

The Term Vocabulary and Posting Lists. Chap. 2. Manning et al., Introduction to Information Retrieval. Contents. Document delineation and character sequence decoding. Obtaining the character sequence in a document Decoding Byte sequence → character sequence

chyna
Download Presentation

The Term Vocabulary and Posting Lists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Term Vocabulary and Posting Lists Chap. 2 Manning et al., Introduction to Information Retrieval

  2. Contents

  3. Document delineation and character sequence decoding • Obtaining the character sequence in a document • Decoding • Byte sequence → character sequence • Considering file format(e.g., pdf, doc, zip etc…) • Extraction of textual parts from document • Then, is it so simple?

  4. Document delineation and character sequence decoding • Choosing a document unit • Indexing granularity • A collection of books? • Chapter or paragraph as mini-document? • Sentences as mini-documents? • Precision / recall tradeoff • Short document: High precision, low recall • Long document: Low precision, high recall

  5. Determining the Vocabulary of Terms • Tokenization • Tokenization • The task of chopping a character sequence into pieces(tokens) • Token/Type/Term • Token: an instance of a sequence of characters that are grouped together as a useful semantic unit • Type: the class of all tokens containing the same character sequence • Term: Type that is included in the IR system’s dictionary • “to sleep perchance to dream” • Five tokens • Four types

  6. Determining the Vocabulary of Terms • Problems • Apostrophe • Desired tokes of O’Neill? • neill, oneill, o’neill, o’ + neill, o + neill • aren’t? • aren’t, arent, are + n’t, aren + t • Terms in current society • Computer: C++, C# • Aircraft: B-52 • Title of TV show: M*A*S*H • Email address: ida89@snu.ac.kr • Web URL: http://www.snu.ac.kr • IP address: 147.46.161.29 • Package tracking number: 1Z9999W99845399981 • Language specific issues • Hyphenation • Co-education: splitting up vowels in words • Hewlett-Packard: joining nouns as names • The hold-him-back-and-drag-him-away maneuver: word grouping

  7. Determining the Vocabulary of Terms • White space • Generally split on white space • “San Francisco”, “Los Angeles” • foreign phrases: “au fait” • Sometimes separated: “white space” vs. “whitespace” • Phone number: 02 766-3421 • Date: Mar 11, 1983 • “York University” as one token • Different from “New York University” or “New” | “York” | “University” • White space and hyphenation • “over-eager”, “over eager”, “overeager” • Definite article in French: la, le, l’ensemble • Imperative and question in French: “donme-moi”(give me) • Compounds in German • Computerlinguistik(computational linguistics) • Levensversicherungsgesellschaftsangestellter(life insurance company employee) • Compound splitter for word segmentation is necessary • Not just German, but Chinese, Japanese … • N-gram based indexing

  8. Determining the Vocabulary of Terms • Dropping common terms: stop words • Stop word: extremely common word appearing frequently in a collection • Usually discarded during indexing • Stop list: list of stop words to discard • “a”, “an”, “and”, “are”, “as”, “at”, “be”, “by”, “for”, “from”, “he”, “in”, “is”, “it”, “its”, “of”, “on”, “that”, “the”, “to”, “was”, “were”, “will”, “with” • But, should they always be discarded? • “President of the United States” • “Flight to London” • “To be or not to be” • “Let it be” • Web search engines generally do not use stop lists

  9. Determining the Vocabulary of Terms • Other issues in English • “color” vs. “colour”, “theater” vs. “theatre” • Other languages • French • Definite article: la, le, l’, les • German • Schütze = Schuetze • Japanese • No white space in a sentence • Mixture of Hiragana and Katakana • Sometimes Chinese characters are used instead of Hiragana

  10. Determining the Vocabulary of Terms • ノベル平和賞を受賞したワンガリ․マータイさんが名誉会長を努めるMOTTAINAIキャンペーンの一環として、毎日新聞社とマガジンハウスは「私の、もったいない」を募集します。皆様が日ごろ「もったいない」と感じて実践していることや、それにまつわるエピソードを800字以內の文章にまとめ、簡単な写真、イラスト、図なとを添えて10月20日まてにお送りください。大賞受賞者には、50万円相当の旅行券とエコ製品2点の副賞が贈られます。 • ノベルへいわしょうをじゅしょうしたワンガリ․マータイさんがめいよかいちょうをつとめるMOTTAINAIキャンペーンのいっかんとして、まいにちしんぶんしゃとマガジンハウスは「私の、もったいない」をぼしゅうします。みなさまがにごろ「もったいない」とかんじてじっせんしていることや、それにまつわるエピソードを800じいないのぶんしょうにまとめ、かんたんなしゃしん、イラスト、ずなとを添えて10月20日まてにおおくりください。たいしょうじゅしょうしゃには、50まんえんそうとうのりょこうけんとエコせいひん2てんのふくしょうがおくられます。 • 노벨평화상을 수상한 왕가리 마타이씨가 명예회장을 맡은 MOTTAINAI 캠페인의 일환으로, 마이니치신문사와 매거진하우스는, ‘나의, 모타이나이’를 모집합니다. 여러분이 평소 ‘모타이나이’라고 느끼고 실천하고 있는 것이나, 그에 관련된 에피소드를 800자 이내의 문장에 모아, 간단한 사진, 일러스트, 그림 등을 붙여 10월 20일까지 보내주십시오. 대상 수상자에게는, 50만엔 상당의 여행권과 에코제품 2개를 부상으로 드립니다.

  11. Determining the Vocabulary of Terms • Stemming and lemmatization • Goal of stemming and lemmatization • To reduce the inflectional forms and derivationally related forms(sometimes) of a word to a common base form • am, are, is → be • car, cars, car’s, cars’ → car • the boy’s cars are different colors → the boy car be differ color • Stemming • Crude heuristic process that chops off the ends of words • Removal of derivational affixes are included • Lemmatization • In addition to stemming, • Use of a vocabulary and morphological analysis of words • Return the base or dictionary form of a word(lemma)

  12. Determining the Vocabulary of Terms • Commonly used stemming algorithm • Porter’s algorithm • Composed of 5 phases • http://www.tartarus.org/~martin/PorterStemmer/ • Lovins stemmer • http://www.cs.waikato.ac.nz/~eibe/stemmers/ • Paice/Husk stemmer • http://www.comp.lancs.ac.uk/computing/research/stemming/ • Benefit from lemmatization • Unremarkable • Is stemming or lemmatization effective? • More pragmatic issues than a formal issue of linguistic morphology

  13. Determining the Vocabulary of Terms • Sample text: Such as analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation • Lovins stemmer: such an alays can revefeatur that ar not easvis from thvari in thindividu gen and can lead to a pictur of expres that is morbiologtranspar and access to interpres • Porter stemmer: such an analysi can reveal featur that ar not easilivisibl from the variat in the individu gene and can lead to a pictur of express that is more biologtranspar and access to interpret • Paice stemmer: such an analys can rev feat that are not easy vis from the vary in the indifid gen and can lead to a pict of express that is morbiologtransp and access to interpret

  14. Faster postings list intersection via skip pointers • Recall postings list and intersection mentioned in Chap. 1 Brutus 1 2 4 11 31 45 173 174 ① ② ④ ⑤ ⑥ ⑧ ⑪ ⑫ Calpurnia 2 31 54 101 ③ ⑦ ⑨ ⑩ Intersection 2 31

  15. Faster postings list intersection via skip pointers • Skip list: postings lists with skip pointers • Shortcuts to certain point of postings list • To reduce the cost of the algorithm lower than O(m+n) • m: length of the first list, n: length of the second list 72 16 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60 71 19 and 23 were skipped

  16. Faster postings list intersection via skip pointers • Trade off • More skips • Shorter skip spans • Lots of comparisons to skip pointers • Lots of spaces to store skip pointers • Conventional counts of skip pointers • At the length of P, evenly spaced skip pointers • Skip lists are effective only when postings list is stable

  17. IntersectWithSkips(p1, p2) 1 answer ← <> 2 while p1 ≠ NIL and p2 ≠ NIL 3 do if docID(p1) = docID(p2) 4 then ADD(answer, docID(p1)) 5 p1 ← next(p1) 6 p2 ← next(p2) 7 else if docID(P1) < docID(p2) 8 then ifhasSkip(p1) and (docID(skip(p1)) ≤ docID(p2)) 9 then while hasSkip(p1) and (docID(skip(p1)) ≤ docID(p2)) 10 do p1 ← skip(p1) 11 else p1 ← next(p1) 12 else if hasSkip(p2) and (docID(skip(p2)) ≤ docID(p1)) 13 then while hasSkip(p2) and (docID(skip(p2)) ≤ docID(p1)) 14 do p2 ← skip(p2) 15 else p2 ← next(p2) 16 return answer

  18. Positional postings and phrase queries • Phrase queries • Double quotes for phrase queries like “Stanford University” • “The inventor Stanford Ovshinsky never went to university.” is not relevant. • A search engine should not only support phrase queries, but implement them efficiently. • Biword indexes • “Friends, Romans, Countrymen” • friends romans • Romans countrymen • Nouns and noun phrases have special status in a possible queries • Tokenize the text and perfor POS tagging • Group terms into nouns • Noun(N)andfunctional words(X) • NX*N can be extended biword

  19. Positional postings and phrase queries • Positional indexes • For each term • docID (, term frequency in the document): <position1, position2, …> • “to be” proximity can be judged with the comparison of position value of each term in the same document • to: <…; 4:<…, 429, 433>;…> • be: <…;4:<…, 430, 434>;…> • K word proximity searches can be possible with this method

  20. Positional postings and phrase queries • Positional index size • The number of items to check is now not number of documents N but the total number of tokens in the document collection T. • So, the complexity of a Boolean query is Θ(T) rather than Θ(N)

  21. Positional Intersect(p1, p2, k) 1 answer ← <> 2 while p1 ≠ NIL and p2 ≠ NIL 3 do if docID(p1) = docID(p2) 4 thenl ← <> 5 pp1 ← positions(p1) 6 pp2 ← positions(p2) 7 while pp1 ≠ NIL 8 do while pp2 ≠ NIL 9 do if |pos(pp1) - pos(pp2)| ≤ k 10 then Add(l, pos(pp2)) 11 else if pos(pp2) > pos(pp1) 12 then break 13 pp2 ← next(pp2) 14 whilel ≠ <> and |l[0] - pos(pp1)| > k 15 do Delete(l[0]) 16 for each ps ∈ ㅣ 17 do Add(answer, <docID(p1), pos(pp1), ps>) 18 pp1 ← next(pp1) 19 p1 ← next(p1) 20 p2 ← next(p2) 21 else if docID(p1) < docID(p2) 22 then p1 ← next(p1) 23 else p2 ← next(p2) 24 return answer

  22. Thank You !

More Related