200 likes | 280 Views
NYU. Cross-Domain Bootstrapping for Named Entity Recognition. Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011. NYU. Outline. Named Entity Recognition (NER) Domain Adaptation Problem for NER Cross-domain Bootstrapping
E N D
NYU Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011
NYU Outline • Named Entity Recognition (NER) • Domain Adaptation Problem for NER • Cross-domain Bootstrapping 3.1 Feature Generalization with Word Clusters 3.2 Instance Selection Based on Multiple Criteria • Conclusion
NYU 1. Named Entity Recognition (NER) • Two missions • U.S.Defense Secretary Donald H. Rumsfeld discussed the resolution … Identification Classification NAME NAME NAME GPE ORG PERSON
NYU 2. Domain Adaptation Problem for NER • NYU NER system performs well on in-domain data (F-measure 83.08) • But performs poorly on out-of-domain data (F-measure 65.09) Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq …
NYU 2. Domain Adaptation Problem for NER • No annotated data from the target domain • Many words are out-of-vocabulary • Naming conventions are different: • Length: shortvslong • source: George Bush; Donald H. Rumsfeld • target: Abdul Sattar al-Rishawi; Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud • Capitalization: weaker in target • Name variation occurs often in target • Shaikh, Shaykh, Sheikh, Sheik, … We want to automatically adapt the source-domain tagger to the target domain without annotating target domain data
NYU 3. Cross-domain Bootstrapping • Train a tagger from labeled source data • Tag all unlabeled target data with current tagger • Selectgood tagged words and add these to labeled data • Re-train the tagger Labeled Source data Instance Selection President Assad Multiple Criteria Feature Generalization Unlabeled target data Trained tagger
NYU 3.1 Feature Generalization with Word Clusters • The source model • Sequential model, assigning name classes to a sequence of tokens • One name type is split into two classes • B_PER (beginning of PERSON) • I_PER (continuation of PERSON) • Maximum Entropy Markov Model (McCallum et al., 2000) • Customary features 3. Cross-domain Bootstrapping
NYU 3.1 Feature Generalization with Word Clusters • The source/seed model • Customary features • Extracted from context window (ti-2, ti-1, ti, ti+1, ti+2) 3. Cross-domain Bootstrapping
NYU 3.1 Feature Generalization with Word Clusters • Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm • Represent each word as a bit string
NYU 3.1 Feature Generalization with Word Clusters • Add an additional layer of features that include word clusters • currentToken = John • currentPrefix3 = 100fires also for target words • To avoid commitment to a single cluster: cut word hierarchy at different levels
NYU 3.1 Feature Generalization with Word Clusters • Performance on the target domain • Test set contains 23K tokens • PERSON/ORGANIZATION/GPE 771/585/559 instances • All other tokens belong to not-a-name class • 4 points improvement of F-measure
NYU 3.2 Instance Selection Based on Multiple Criteria • Single-domain bootstrapping uses a confidence measure as the single selection criterion • In a cross-domain setting, the most confidently labeled instances • are highly correlated with the source domain • contain little information about the target domain. • We propose multiple criteria • Criterion 1: Novelty– prefer target-specific instances • Promote Abdul instead of John
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 2: Confidence -prefer confidently labeled instances • Local confidence: based on local features I := instance v := feature vector for I ci := name class i minimum: 0. when one name class is predicted with probability 1, e.g., p(ci|v) = 1 maximum: when the predictions are evenly distributed over all the name classes. The lower the value, the more confident the instance is.
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 2: Confidence • Global confidence: based on corpus statistics P( Abdul is a PER) = 0.9
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 2: Confidence • Global confidence • Combined confidence: product of local and global confidence The lower the entropy, the more confident the instance is.
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 3: Density - prefer representative instances which can be seen as centroid instances Jaccard Similarity between the feature vectors of the two instances average similarity between i and all other instances j the total number of instances in the corpus
NYU 3.2 Instance Selection Based on Multiple Criteria • Criterion 4: Diversity - prefer a set of diverse instances instead of similar instances • “, said * in his” • Highly confident instance • High density, representative instance • BUT, continuing to promote such instance would not gain additional benefit • diff(i, j) := difference between instances i and j • Use a small value for diff(i, j) • dense instances still have a higher chance to be selected while a certain degree of diversity is achieved at the same time.
NYU 3.2 Instance Selection Based on Multiple Criteria • Putting all criteria together • Novelty: filter out source-dependent instances • Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set • Density: rank instances in the candidate set in descending order of density • Diversity: • accepts the first instance (with the highest density) in the candidate set • and selects other candidates based on the diff measure.
NYU 3.2 Instance Selection Based on Multiple Criteria • Results
NYU 4. Conclusion • Proposed a general cross-domain bootstrapping algorithm for adapting a model trained only on a source domain to a target domain • Improved the source model’s F score by around 7 points • This is achieved • without using any annotated data from the target domain • without explicitly encoding any target-domain-specific knowledge into our system • The improvement is largely due to • the feature generalization of the source model with word clusters • the multi-criteria-based instance selection method