Name Extraction from Chinese Novels

Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung

Problem • Given a Chinese novel, extract the names of people and locations • Different from English NER: no whitespace within sentences, no capitalization • Can use other characteristics since the domain is limited

System Outline • Extract bigrams, trigrams, and quadrigrams from text • Run logistic regression on extracted features to learn feature weights • Use weights to compute a score for each n-gram • Apply thresholding to limit the number of guessed names • Use word lists from word segmenter and dictionary • Compare output list to correct list for F1 score

Features • N-gram and segmented word counts • Ratio of count of n-gram to (n-1)-gram • Transliterated characters • Prefixes and suffixes • Segmented words and dictionary • Mutual information

Thresholding • Otsu’s method: • Often used in image processing • Separates data into two classes, minimizing the variance within the classes • Does not depend on training data • F1 Maximization • Find the threshold on training data that maximizes F1 score • Use same threshold on test data

Results • No validation set, so chose a baseline set • Ablation tests show that the baseline chosen was non-optimal • Best individual scores:

Conclusion • Most useful features: • N-gram counts / frequency ratios (0.46F1 alone) • Varies depending on type of n-gram • Thresholding • Otsu’s method yielded better overall performance • Both methods had drawbacks • Future work • More rigorous feature set testing • Larger / cleaner data sets

Name Extraction from Chinese Novels

Name Extraction from Chinese Novels

Presentation Transcript

Chinese Term Extraction Based on Delimiters

LYCOPENE EXTRACTION FROM TOMATOES

Graphic Novels

Graphic Novels

DNA Extraction from …

Dystopian Novels

Fantasy Novels

Chinese Internet Name Services - Domain Name and Common Name

Information Extraction from Literature

Information extraction from text

Information extraction from text

Chinese Name Authority Portal

Some Chinese Domain Name Issues

Graphic Novels

GRAPHIC NOVELS

Daily extraction from www.foxnews.com

Information extraction from text

Chinese Domain Name Registration Policy

DNA Extraction from …

DNA Extraction from …

Novels

PAT Tree and Chinese Keyword Extraction