270 likes | 435 Views
Recognition and Classification of Noun Phrases in Queries for Effective Retrieval. Wei Zhang 1 Shuang Liu 2 Clement Yu 1 wzhang@cs.uic.edu shuang.liu@ask.com yu@cs.uic.edu
E N D
Recognition and Classification ofNoun Phrases in Queries for Effective Retrieval Wei Zhang1 Shuang Liu2 Clement Yu1 wzhang@cs.uic.edu shuang.liu@ask.com yu@cs.uic.edu Chaojing Sun3 Fang Liu4 Weiyi Meng5 chaojing@gmail.com fangliu@microsoft.com meng@cs.binghamton.edu 1 Department of Computer Science, University of Illinois at Chicago 2 Ask.com 3 Broadcom Corporation 4 Microsoft 5 Department of Computer Science, Binghamton University CIKM 2007 1
Outline • Motivation • Our definitions of the phrases • Proper noun and dictionary phrase recognition • Simple and complex phrase recognition • Experimental results CIKM 2007 2
Motivation • Terms in a query are related semantically • “John Smith” • Recognize this relationship • Partition the query terms to groups (phrases) • Document retrieval using phrases • Adding phrases into searching and ranking
Types of Noun Phrases • Phrases that have fixed writing formats • Names of Locations, people, companies, … • Well defined concepts. E.g. “computer science” • Freely written phrases • Not formally defined but used in the real language
Four Types of Noun Phrases • Proper Noun (PN) • A noun phrase that names a specific person, place or thing. • First letters of the content words are capitalized • E.g. “John Smith”, “Atlantic Ocean” • Dictionary Phrase (DP) • A phrase that has a definition in a dictionary, excluding PN • These two types may overlap • “Atlantic Ocean” • They can not replace each other • E.g. “Lina’s Pizza”, “public transportation”
Four Types of Noun Phrases • Simple Noun Phrase (SNP) • A grammatically valid noun phrase other than PN and DP • 2 words • E.g. “white car”, “good hotel” • Complex Noun Phrase (CNP) • A grammatically valid noun phrase other than PN and DP • 3 or more words • May contain PN/DP/SNP • E.g. “small white car”, “city public transportation”
Noun Phrase Recognition • General procedure • Recognize PN and dictionary phrases first • Then simple and complex noun phrases • A n-word query • Check the original query • Check the 2 (n-1)-term arrays • … • Check the (n-1) 2-term arrays • Totally n*(n-1)/2 candidates • E.g. “World Trade Organization” • “World Trade” and “Trade Organization”
Noun Phrase Recognition • Tools for phrase recognition • Dictionaries (Wikipedia, WordNet) • Large text corpus (Google for experiments) • Parsers (Minipar, Collins parser) and POS tagger
PN and DP Recognition • Wekipedia • For proper nouns and dictionary phrases • DP: existence of the entry page • PN: content words in the first instance of the phrase in the main text should be capitalized
PN and DP Recognition • WordNet • For PN and DP recognition • DP: defined in a dictionary • PN: has a hypernym of city, province, country, organization, geographic area, person, syndrome, region, building, or nation.
PN and DP Recognition • Minipar • For PN recognition only • “PN” label in the parse tree • Semantic label of person, country, corpname, location, corpdesig, fname, gname, or date
PN and DP Recognition • List of first names, last names and rules • First_initial last_name • First_initial mid_initial last_name • First_name middle_initial last_name • First_name last_name
PN and DP Recognition • Text corpus • For less well-known PNs • Three instances, first letters of the content words capitalized • Not a sub-phrase of a longer PN • “if you choose windows by Vista Window Company, …” • “if you choose windows by Super Vista Window Company, …”
PN and DP Recognition • Overlapped phrases • Search all words together • Count the instances of each phrase in the returned documents • e.g. “Native American Casino” • “Native American” and “American Casino” • Compare ( Count(“Native American”), Count(“American Casino”) )
SNP and CNP Recognition • Only check the phrase candidates that • are not sub-phrases of a recognized PN/DP • do not overlap with a recognized PN/DP
SNP and CNP Recognition • Implicit phrases • “and” / “or” • “main and contributing factor” • “main factor” • “contributing factor”
SNP and CNP Recognition • Head word replacement • Replace the whole phrase by its head word • Collins parser • Label the noun phrases NP/sedan(head word) NP/sedan(head word) Best/JJS Compact/JJ Sedan/NN
SNP and CNP Recognition • Phrase verification • To verify that a phrase is used in the world • For CNP: it also means to find all the words in a text window • “Colin Farrell wallpaper” and “wallpaper of Colin Farrell”
SNP and CNP Recognition • Overlapped phrases • Two potential SNP/CNP: Search all words, compare the numbers of the instances. • “sony dvd handyam” “sony dvd” and “dvd handycam”
Document Retrieval Using Phrases • Search a phrase in a document • Exact match: PN/DP • Search all words in a text window: SNP/CNP
Document Retrieval Using Phrases • Sim(Query, Doc) = <Sim_P, Sim_T> • Phrase similarity • Sim_P(P_i) = idf(P_i) • Sim_P = sum ( sim_P(P_i) ) • Term similarity • Okapi/BM-25 similarity • Document ranking • D1 is ranked higher than D2, if • (Sim_P1>Sim_P2) OR (P1=P2 AND T1>T2)
Experimental Results • Phrase recognition experiments • Tuned by using TREC queries
Experimental Results • Phrase recognition experiments • Tested by using Web queries
Experimental Results • Performance of individual tools • Wikipedia is better than WordNet and Minipar • Need for a complete dictionary • Collins parser alone is not enough for SNP/CNP recognition • Lack of real world usage information
Experimental Results • Document retrieval experiments • Ad-hoc TREC 6, 7 and 8, robust TREC 12, 13 and 14 • Retrieval without using phrases • Using Wikipedia for PN/DP and just collins parser for SNP/CNP • Using phrases from the full recognition algorithm • 33% MAP increase and 44.27% GMAP increase from 1 to 2 • 5.8% MAP increase and 12.58% GMAP increase from 2 to 3
Conclusions • Our algorithm can effectively recognize the four types of phrases in the short Web queries • The recognized phrases help improve the retrieval effectiveness
Questions? • wzhang@cs.uic.edu • http://www.cs.uic.edu/~wzhang/