1 / 17

N-gram Search Engine on Wikipedia

N-gram Search Engine on Wikipedia. Satoshi Sekine (NYU) Kapil Dalwani (JHU). Hammer : Fast and multi-functional n-gram search engine. Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text. ngrams. 2. Characteristics. Search up to 7 grams with wildcards

levana
Download Presentation

N-gram Search Engine on Wikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)

  2. Hammer : Fast and multi-functional n-gram search engine Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text ngrams Lexical Knowledge from Ngrams 2

  3. Characteristics • Search up to 7 grams with wildcards • Multi-level input • Token, POS, chunk, NE, combinations • NOT, OR for POS, chunk, NE • Multi-level output • Token, POS, chunk, NE • document information • Original sentences, KWIC, ngram • Display • Show the results in the order of frequency • Running Environment • Single CPU, PC-Linux, 400MB process, 500GB disk Lexical Knowledge from Ngrams 3

  4. Demo http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2 Lexical Knowledge from Ngrams

  5. Available for you Web system At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive Lexical Knowledge from Ngrams

  6. Implementation: Overview 3. Display 2. Filtering 1. Search candidates Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams

  7. Implementation: Overview 1. Search candidates Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams

  8. From n-gram to Inverted Index • Example: 3-grams • Posting list A pos=1 A pos=2 B pos=1 B pos=2 B pos=3 C pos=3 Lexical Knowledge from Ngrams

  9. Posting list Wide variation of posting list size (in 7-gram: 1.27B) “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672) conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list) List of ngramID Encoded into pointer (freq=1) C pos=3 C pos=3 5 Lexical Knowledge from Ngrams

  10. Search Given an n-gram request (A B C) Get posting lists for A, B and C Search intersections of posting lists Use “look ahead” to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) SKIP Lexical Knowledge from Ngrams

  11. Implementation: Overview 2. Filtering Suffix array for text 1 Search candidates. N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams

  12. Filtering Not all candidate ngramID’s match the request We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID Reduce the index more than 200GB A B Freq=123 NN PERSON VB Freq=10 LOC Freq=5 Lexical Knowledge from Ngrams

  13. Implementation: Overview 3. Display 1. Search candidates 2. Filtering Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams

  14. Display N-gram will be displayed in the descending order of frequency N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC Lexical Knowledge from Ngrams

  15. Size of data 8 GB Text 1.7 G words 200M sentences 2.4M articles Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Total 530GB Suffix array For text 260 GB N-gram data 108 GB Inverted index for n-gram data 8 GB Wikipedia text 100 GB POS, chunk, NE for N-gram data 6 GB Wikipedia POS, chunk, NE 40 GB Others Lexical Knowledge from Ngrams

  16. Future Work Other information (ex: parse, coref, relation, genre, discourse…) Longer n-gram Compress index, dictionary Ease the indexing load Now we need a big memory machine Distributing indexing Union operation for tokens Lexical Knowledge from Ngrams

  17. Available for you Web demo At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive Lexical Knowledge from Ngrams

More Related