200 likes | 433 Views
NYU. Cross-Domain Bootstrapping for Named Entity Recognition . Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011. NYU. Outline. Named Entity Recognition (NER) Domain Adaptation Problem for NER Cross-domain Bootstrapping
E N D
NYU Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011
NYU Outline • Named Entity Recognition (NER) • Domain Adaptation Problem for NER • Cross-domain Bootstrapping 3.1 Feature Generalization with Word Clusters 3.2 Instance Selection Based on Multiple Criteria • Conclusion
NYU 1. Named Entity Recognition (NER) • Two missions • U.S.Defense Secretary Donald H. Rumsfeld discussed the resolution … Identification Classification NAME NAME NAME GPE ORG PERSON
NYU 2. Domain Adaptation Problem for NER • NYU NER system performs well on in-domain data (F-measure 83.08) • But performs poorly on out-of-domain data (F-measure 65.09) Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq …
NYU 2. Domain Adaptation Problem for NER • No annotated data from the target domain • Many words are out-of-vocabulary • Naming conventions are different: • Length: shortvslong • source: George Bush; Donald H. Rumsfeld • target: Abdul Sattar al-Rishawi; Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud • Capitalization: weaker in target • Name variation occurs often in target • Shaikh, Shaykh, Sheikh, Sheik, … We want to automatically adapt the source-domain tagger to the target domain without annotating target domain data
NYU 3. Cross-domain Bootstrapping • Train a tagger from labeled source data • Tag all unlabeled target data with current tagger • Selectgood tagged words and add these to labeled data • Re-train the tagger Labeled Source data Instance Selection President Assad Multiple Criteria Feature Generalization Unlabeled target data Trained tagger
NYU 3.1 Feature Generalization with Word Clusters • The source model • Sequential model, assigning name classes to a sequence of tokens • One name type is split into two classes • B_PER (beginning of PERSON) • I_PER (continuation of PERSON) • Maximum Entropy Markov Model (McCallum et al., 2000) • Customary features 3. Cross-domain Bootstrapping
NYU 3.1 Feature Generalization with Word Clusters • The source/seed model • Customary features • Extracted from context window (ti-2, ti-1, ti, ti+1, ti+2) 3. Cross-domain Bootstrapping
NYU 3.1 Feature Generalization with Word Clusters • Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm • Represent each word as a bit string
NYU 3.1 Feature Generalization with Word Clusters • Add an additional layer of features that include word clusters • currentToken = John • currentPrefix3 = 100fires also for target words • To avoid commitment to a single cluster: cut word hierarchy at different levels
NYU 3.1 Feature Generalization with Word Clusters • Performance on the target domain • Test set contains 23K tokens • PERSON/ORGANIZATION/GPE 771/585/559 instances • All other tokens belong to not-a-name class • 4 points improvement of F-measure
NYU 3.2 Instance Selection Based on Multiple Criteria • Single-domain bootstrapping uses a confidence measure as the single selection criterion • In a cross-domain setting, the most confidently labeled instances • are highly correlated with the source domain • contain little information about the target domain. • We propose multiple criteria • Criterion 1: Novelty– prefer target-specific instances • Promote Abdul instead of John
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 2: Confidence -prefer confidently labeled instances • Local confidence: based on local features I := instance v := feature vector for I ci := name class i minimum: 0. when one name class is predicted with probability 1, e.g., p(ci|v) = 1 maximum: when the predictions are evenly distributed over all the name classes. The lower the value, the more confident the instance is.
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 2: Confidence • Global confidence: based on corpus statistics P( Abdul is a PER) = 0.9
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 2: Confidence • Global confidence • Combined confidence: product of local and global confidence The lower the entropy, the more confident the instance is.
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 3: Density - prefer representative instances which can be seen as centroid instances Jaccard Similarity between the feature vectors of the two instances average similarity between i and all other instances j the total number of instances in the corpus
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 4: Diversity - prefer a set of diverse instances instead of similar instances • “, said * in his” • Highly confident instance • High density, representative instance • BUT, continuing to promote such instance would not gain additional benefit • diff(i, j) := difference between instances i and j • Use a small value for diff(i, j) • dense instances still have a higher chance to be selected while a certain degree of diversity is achieved at the same time.
NYU 3.2 Instance Selection Based on Multiple Criteria • Putting all criteria together • Novelty: filter out source-dependent instances • Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set • Density: rank instances in the candidate set in descending order of density • Diversity: • accepts the first instance (with the highest density) in the candidate set • and selects other candidates based on the diff measure.
NYU 3.2 Instance Selection Based on Multiple Criteria • Results
NYU 4. Conclusion • Proposed a general cross-domain bootstrapping algorithm for adapting a model trained only on a source domain to a target domain • Improved the source model’s F score by around 7 points • This is achieved • without using any annotated data from the target domain • without explicitly encoding any target-domain-specific knowledge into our system • The improvement is largely due to • the feature generalization of the source model with word clusters • the multi-criteria-based instance selection method