270 likes | 297 Views
This paper discusses the recognition and classification of noun phrases in queries for effective retrieval, outlining the motivation, definitions of phrases, and experimental results. It covers proper noun and dictionary phrase recognition, as well as simple and complex phrase recognition. The study categorizes four types of noun phrases and explains the recognition process using various tools such as dictionaries, text corpora, and parsers. The paper also delves into noun phrase verification and document retrieval using recognized phrases.
E N D
Recognition and Classification ofNoun Phrases in Queries for Effective Retrieval Wei Zhang1 Shuang Liu2 Clement Yu1 wzhang@cs.uic.edu shuang.liu@ask.com yu@cs.uic.edu Chaojing Sun3 Fang Liu4 Weiyi Meng5 chaojing@gmail.com fangliu@microsoft.com meng@cs.binghamton.edu 1 Department of Computer Science, University of Illinois at Chicago 2 Ask.com 3 Broadcom Corporation 4 Microsoft 5 Department of Computer Science, Binghamton University CIKM 2007 1
Outline • Motivation • Our definitions of the phrases • Proper noun and dictionary phrase recognition • Simple and complex phrase recognition • Experimental results CIKM 2007 2
Motivation • Terms in a query are related semantically • “John Smith” • Recognize this relationship • Partition the query terms to groups (phrases) • Document retrieval using phrases • Adding phrases into searching and ranking
Types of Noun Phrases • Phrases that have fixed writing formats • Names of Locations, people, companies, … • Well defined concepts. E.g. “computer science” • Freely written phrases • Not formally defined but used in the real language
Four Types of Noun Phrases • Proper Noun (PN) • A noun phrase that names a specific person, place or thing. • First letters of the content words are capitalized • E.g. “John Smith”, “Atlantic Ocean” • Dictionary Phrase (DP) • A phrase that has a definition in a dictionary, excluding PN • These two types may overlap • “Atlantic Ocean” • They can not replace each other • E.g. “Lina’s Pizza”, “public transportation”
Four Types of Noun Phrases • Simple Noun Phrase (SNP) • A grammatically valid noun phrase other than PN and DP • 2 words • E.g. “white car”, “good hotel” • Complex Noun Phrase (CNP) • A grammatically valid noun phrase other than PN and DP • 3 or more words • May contain PN/DP/SNP • E.g. “small white car”, “city public transportation”
Noun Phrase Recognition • General procedure • Recognize PN and dictionary phrases first • Then simple and complex noun phrases • A n-word query • Check the original query • Check the 2 (n-1)-term arrays • … • Check the (n-1) 2-term arrays • Totally n*(n-1)/2 candidates • E.g. “World Trade Organization” • “World Trade” and “Trade Organization”
Noun Phrase Recognition • Tools for phrase recognition • Dictionaries (Wikipedia, WordNet) • Large text corpus (Google for experiments) • Parsers (Minipar, Collins parser) and POS tagger
PN and DP Recognition • Wekipedia • For proper nouns and dictionary phrases • DP: existence of the entry page • PN: content words in the first instance of the phrase in the main text should be capitalized
PN and DP Recognition • WordNet • For PN and DP recognition • DP: defined in a dictionary • PN: has a hypernym of city, province, country, organization, geographic area, person, syndrome, region, building, or nation.
PN and DP Recognition • Minipar • For PN recognition only • “PN” label in the parse tree • Semantic label of person, country, corpname, location, corpdesig, fname, gname, or date
PN and DP Recognition • List of first names, last names and rules • First_initial last_name • First_initial mid_initial last_name • First_name middle_initial last_name • First_name last_name
PN and DP Recognition • Text corpus • For less well-known PNs • Three instances, first letters of the content words capitalized • Not a sub-phrase of a longer PN • “if you choose windows by Vista Window Company, …” • “if you choose windows by Super Vista Window Company, …”
PN and DP Recognition • Overlapped phrases • Search all words together • Count the instances of each phrase in the returned documents • e.g. “Native American Casino” • “Native American” and “American Casino” • Compare ( Count(“Native American”), Count(“American Casino”) )
SNP and CNP Recognition • Only check the phrase candidates that • are not sub-phrases of a recognized PN/DP • do not overlap with a recognized PN/DP
SNP and CNP Recognition • Implicit phrases • “and” / “or” • “main and contributing factor” • “main factor” • “contributing factor”
SNP and CNP Recognition • Head word replacement • Replace the whole phrase by its head word • Collins parser • Label the noun phrases NP/sedan(head word) NP/sedan(head word) Best/JJS Compact/JJ Sedan/NN
SNP and CNP Recognition • Phrase verification • To verify that a phrase is used in the world • For CNP: it also means to find all the words in a text window • “Colin Farrell wallpaper” and “wallpaper of Colin Farrell”
SNP and CNP Recognition • Overlapped phrases • Two potential SNP/CNP: Search all words, compare the numbers of the instances. • “sony dvd handyam” “sony dvd” and “dvd handycam”
Document Retrieval Using Phrases • Search a phrase in a document • Exact match: PN/DP • Search all words in a text window: SNP/CNP
Document Retrieval Using Phrases • Sim(Query, Doc) = <Sim_P, Sim_T> • Phrase similarity • Sim_P(P_i) = idf(P_i) • Sim_P = sum ( sim_P(P_i) ) • Term similarity • Okapi/BM-25 similarity • Document ranking • D1 is ranked higher than D2, if • (Sim_P1>Sim_P2) OR (P1=P2 AND T1>T2)
Experimental Results • Phrase recognition experiments • Tuned by using TREC queries
Experimental Results • Phrase recognition experiments • Tested by using Web queries
Experimental Results • Performance of individual tools • Wikipedia is better than WordNet and Minipar • Need for a complete dictionary • Collins parser alone is not enough for SNP/CNP recognition • Lack of real world usage information
Experimental Results • Document retrieval experiments • Ad-hoc TREC 6, 7 and 8, robust TREC 12, 13 and 14 • Retrieval without using phrases • Using Wikipedia for PN/DP and just collins parser for SNP/CNP • Using phrases from the full recognition algorithm • 33% MAP increase and 44.27% GMAP increase from 1 to 2 • 5.8% MAP increase and 12.58% GMAP increase from 2 to 3
Conclusions • Our algorithm can effectively recognize the four types of phrases in the short Web queries • The recognized phrases help improve the retrieval effectiveness
Questions? • wzhang@cs.uic.edu • http://www.cs.uic.edu/~wzhang/