80 likes | 255 Views
Name Extraction from Chinese Novels. CS224n Spring 2008 Jing Chen and Raylene Yung. Problem. Given a Chinese novel, extract the names of people and locations Different from English NER: no whitespace within sentences, no capitalization
E N D
Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung
Problem • Given a Chinese novel, extract the names of people and locations • Different from English NER: no whitespace within sentences, no capitalization • Can use other characteristics since the domain is limited
System Outline • Extract bigrams, trigrams, and quadrigrams from text • Run logistic regression on extracted features to learn feature weights • Use weights to compute a score for each n-gram • Apply thresholding to limit the number of guessed names • Use word lists from word segmenter and dictionary • Compare output list to correct list for F1 score
Features • N-gram and segmented word counts • Ratio of count of n-gram to (n-1)-gram • Transliterated characters • Prefixes and suffixes • Segmented words and dictionary • Mutual information
Thresholding • Otsu’s method: • Often used in image processing • Separates data into two classes, minimizing the variance within the classes • Does not depend on training data • F1 Maximization • Find the threshold on training data that maximizes F1 score • Use same threshold on test data
Results • No validation set, so chose a baseline set • Ablation tests show that the baseline chosen was non-optimal • Best individual scores:
Conclusion • Most useful features: • N-gram counts / frequency ratios (0.46F1 alone) • Varies depending on type of n-gram • Thresholding • Otsu’s method yielded better overall performance • Both methods had drawbacks • Future work • More rigorous feature set testing • Larger / cleaner data sets