190 likes | 381 Views
N-gram Search Engine on Wikipedia. Satoshi Sekine (NYU) Kapil Dalwani (JHU). Hammer : Fast and multi-functional n-gram search engine. Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text. ngrams. 2. Characteristics. Search up to 7 grams with wildcards
E N D
N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)
Hammer : Fast and multi-functional n-gram search engine Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text ngrams Lexical Knowledge from Ngrams 2
Characteristics • Search up to 7 grams with wildcards • Multi-level input • Token, POS, chunk, NE, combinations • NOT, OR for POS, chunk, NE • Multi-level output • Token, POS, chunk, NE • document information • Original sentences, KWIC, ngram • Display • Show the results in the order of frequency • Running Environment • Single CPU, PC-Linux, 400MB process, 500GB disk Lexical Knowledge from Ngrams 3
Demo http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2 Lexical Knowledge from Ngrams
Available for you Web system At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive Lexical Knowledge from Ngrams
Implementation: Overview 3. Display 2. Filtering 1. Search candidates Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams
Implementation: Overview 1. Search candidates Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams
From n-gram to Inverted Index • Example: 3-grams • Posting list A pos=1 A pos=2 B pos=1 B pos=2 B pos=3 C pos=3 Lexical Knowledge from Ngrams
Posting list Wide variation of posting list size (in 7-gram: 1.27B) “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672) conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list) List of ngramID Encoded into pointer (freq=1) C pos=3 C pos=3 5 Lexical Knowledge from Ngrams
Search Given an n-gram request (A B C) Get posting lists for A, B and C Search intersections of posting lists Use “look ahead” to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) SKIP Lexical Knowledge from Ngrams
Implementation: Overview 2. Filtering Suffix array for text 1 Search candidates. N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams
Filtering Not all candidate ngramID’s match the request We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID Reduce the index more than 200GB A B Freq=123 NN PERSON VB Freq=10 LOC Freq=5 Lexical Knowledge from Ngrams
Implementation: Overview 3. Display 1. Search candidates 2. Filtering Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams
Display N-gram will be displayed in the descending order of frequency N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC Lexical Knowledge from Ngrams
Size of data 8 GB Text 1.7 G words 200M sentences 2.4M articles Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Total 530GB Suffix array For text 260 GB N-gram data 108 GB Inverted index for n-gram data 8 GB Wikipedia text 100 GB POS, chunk, NE for N-gram data 6 GB Wikipedia POS, chunk, NE 40 GB Others Lexical Knowledge from Ngrams
Future Work Other information (ex: parse, coref, relation, genre, discourse…) Longer n-gram Compress index, dictionary Ease the indexing load Now we need a big memory machine Distributing indexing Union operation for tokens Lexical Knowledge from Ngrams
Available for you Web demo At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive Lexical Knowledge from Ngrams