250 likes | 326 Views
Using term informativeness for named entity detection. Advisor : Dr. Hsu Reporter : Chun Kai Chen Author : Jason D. M. Rennie and Tommi Jaakkola. 2005.SIGIR 353-360. Outline. Motivation Objective Introduction Mixture Models Experiment Summary. Motivation.
E N D
Using term informativeness for named entity detection Advisor :Dr. Hsu Reporter:Chun Kai Chen Author:Jason D. M. Rennie and Tommi Jaakkola 2005.SIGIR 353-360
Outline • Motivation • Objective • Introduction • Mixture Models • Experiment • Summary
Motivation • Informal communication (e-mail, bulletin boards) poses a difficult learning environment • because traditional grammatical and lexical information are noisy • timely information can be difficult to extract • Interested in the problem of extracting information from informal, written communication.
Objective • Introduced a new informativeness score that directly utilizes mixture model likelihood to identify informative words.
Mixture Models • Identified informative words • looking at the difference in log-likelihoodbetween a mixture model and a simple unigram model • The simplest model • ni for the number of flips per document • hi for the number of heads • θ = 0.5 • mixture model • Mixture score
Mixture Models(example1) • Example • Keyword “fish” , D1={fish fish fish} D2={I am student} • four short “documents”: {{HHH},{TTT},{HHH},{TTT}} • simple unigram model {{HHH},{TTT},{HHH},{TTT}} ={0.53(1-0.5)(3-3)}×{0.50(1-0.5)(3-0)}×{0.53(1-0.5)(3-3)}×{0.50(1-0.5)(3-0)} = 0.53× 0.53× 0.53× 0.53 = 0.000244140625=2-12 • mixture model {HHH}= {0.5 × 13 ×(1-1)(3-3)+(1-0.5) × 03 ×(1-0)(3-3)} = 0.5 +0 {TTT}= {0.5 × 10 ×(1-1)(3-0)+(1-0.5) × 00 ×(1-0)(3-0)} = 0 +0.5 {{HHH},{TTT},{HHH},{TTT}}=0.5 × 0.5 × 0.5 × 0.5=0.0625=2-4
Mixture Models(example2) • Example • four short “documents”: {{HTT},{TTT},{HTT},{TTT}} • simple unigram model {{HTT},{TTT},{HTT},{TTT}} ={0.51(1-0.5)(3-1)}×{0.50(1-0.5)(3-0)}×{0.51(1-0.5)(3-1)}×{0.50(1-0.5)(3-0)} = 0.53× 0.53× 0.53× 0.53 = 2-12 • mixture model {HTT}= {0.5 × 0.331 ×(1-0.33)(3-1)+(1-0.5) × 0.661 ×(1-0.66)(3-1)} = (0.5 × 0.33 × 0.662)+(0.5 × 0.66 ×0.332 )=0.071874+0.035937=0.107811 {HTT},{TTT},{HTT},{TTT}}=0.107811 × 0.5 × 0.107811 × 0.5=0.0029058
Mixture Models(example3) • Example • four short “documents”: {{HTTTT},{TTT},{HTT},{TTT}} • simple unigram model {{HTTTT},{TTT},{HTT},{TTT}} ={0.51(1-0.5)(5-1)}×{0.50(1-0.5)(3-0)}×{0.51(1-0.5)(3-1)}×{0.50(1-0.5)(3-0)} = 0.55× 0.53× 0.53× 0.53 = 2-14 • mixture model {HTTTT}={0.5 × 0.21 ×(1-0.2)(5-1)+(1-0.5) × 0.81 ×(1-0.8)(5-1)} =(0.5 × 0.2 × 0.84)+(0.5 × 0.8 ×0.24 ) = 0.04096+0.00064=0.0416 {{HTTTT},{TTT},{HTT},{TTT}}=0.0416 × 0.5 × 0.107811 × 0.5=0.0011212344
Mixture Models(Mixture score) • {{HHH},{TTT},{HHH},{TTT}} =0.0625 / 2-12 • {{HTT},{TTT},{HTT},{TTT}} = 0.0029058 /2-12 • {{HTTTT},{TTT},{HTT},{TTT}} = 0.0011212344 / 2-14
Introduction(1/4) • The web is filled with information, • but even more information is available in the informal communications people send and receive on a day-to-day basis • We call this communication informal because structure is not explicit and the writing is not fully grammatical. • We are interested in the problem of extracting information from informal, written communication.
Introduction(2/4) • Newspaper text is harder to deal with. • But, newspaper articles have proper grammar with correct punctuation and capitalization; • part-of-speech taggers show high accuracy on newspaper text • Informal communication • even these basic cues are noisy—grammar rules are bent, capitalization may be ignored or used haphazardly and punctuation use is creative
Introduction(3/4) • Restaurant bulletin boards • contain information about new restaurants almost immediately after they open • a temporary closure, new management, better service or a drop in food quality. • This timely information can be difficult to extract. • An important sub-task of extracting information from restaurant bulletin boards is identifying restaurant names.
Introduction(4/4) • If we had a good measure of how topic-oriented, or “informative,” • we would be better able to identify named entities • It is well known that informative words have “peaked” or “heavy-tailed” frequency distributions. • Many informativeness scores have been introduced • Inverse Document Frequency (IDF) • Residual IDF • xI • the z-measure • Gain
Mixture Models • Exhibiting two modes of operation: • A high frequency mode • when the document is relevant to the word • A low (or zero) frequency mode • when the document is irrelevant • Identified informative words • by looking at the difference in log-likelihoodbetween a mixture model and a simple unigram model
Mixture Models • Example • Consider the following four short “documents”:{{HHH},{TTT},{HHH},{TTT}} • The simplest model for sequential binary data is the unigram. • ni for the number of flips per document • hi for the number of heads • θ = 0.5 • The unigram is a poor model for the above data. • The unigram has no capability to model the switching nature of the data. • the data likelihood is 2−12
Mixture Models • Example • Consider the following four short “documents”:{{HHH},{TTT},{HHH},{TTT}} • The likelihood for a mixture of two unigrams is: • 各取一半的比例 • A mixture is a composite model. • data likelihood is 2−4
Mixture Models • The two extra parameters of the mixture allow for a much better modeling of the data. • Mixture score is then the log-odds of the two likelihoods: • Interested in knowing the comparative improvement of the mixture model over the simple unigram. • Using EM to maximize the likelihood of the mixture model.
Experimental Evaluation • The Restaurant Data • Using the task of identifying restaurant names in posts to a restaurant discussion bulletin board. • Collected and labeled six sets of threads of approximately 100 posts each from a single board. • Used Adwait Ratnaparkhi’s MXPOST and MXTERMINATOR software to determine sentence boundaries, tokenize the text and determine part-of-speech. • Handlabeled each token as being part of a restaurant name or not. • 56,018 token,1968 tokens were labeled as a restaurant name • 5,956 unique tokens. Of those, 325 were used at least once as part of a restaurant name
Summary • Introduced a new informativenss measure, the Mixture score, and compared it against a number of other informativeness criteria. • Found the mixture score to be an effective restaurant word filter. • IDF*Mixture score is a more effective filter than either individually.
Personal Opinion • Advantage • Disadvantage