Using N-gram and Word Network Features for Native Language Identification

Using N-gram and Word Network Features for Native Language Identification ShibamouliLahiriRadaMihalcea Computer Science and Engineering, University of North Texas, Denton, TX 76207, USA shibamoulilahiri@my.unt.edu, rada@cs.unt.edu • Native Language Identification • Native speakers of L1 speaking L2. • Classify L2 samples according to L1. • Related to author profiling and authorship attribution. Results on Training Set (10-fold Cross-validation Accuracy (%)) A Word Network brown jumped fox Dataset TOEFL11 corpus contains 12,100 English essays (L2), of which 9,900 are for training, 1,100 are for test, and 1,100 are for validation. There are 11 L1’s (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu and Turkish). over the quick lazy dog • Word Network Features • Degree, coreness, neighborhood size and clustering coefficient of words. • Explored variations including directed and undirected versions of the above. • Most frequent 100, 200, 500, 1000 words on the train + development set. • N-gram Features • Word, POS, and character n-grams • (n = 1, 2, 3). • Explored variations including and excluding punctuations and spaces. • Most frequent 100, 200, 500, 1000 n-grams on the train + development set. Information Gain Ranking of Word Network Features on Training Set

Using N-gram and Word Network Features for Native Language Identification

Using N-gram and Word Network Features for Native Language Identification

Presentation Transcript

N-gram Models

N-Gram Language Models

Identification Of Gram Positive Bacilli

n-gram analysis

Smoothing N-gram Language Models

Statistical language modeling combining n-gram and dependency grammar

Topic-Dependent-Class-Based N-Gram Language Model

MERGING SEGMENTAL AND RHYTHMIC FEATURES FOR AUTOMATIC LANGUAGE IDENTIFICATION

Language and Speaker Identification using Gaussian Mixture Model

Discriminative n-gram language modeling

Using Word Based Features for Word Clustering

N-gram Models

Real-Word Spelling Correction using Google Web 1T n -gram with Backoff

Language Identification

N-gram Tokenization for Indian Language Text Retrieval

Word Identification

Word Identification

Using Fingerprints in n-Gram Indices

N-gram Models

CS 388: Natural Language Processing: N-Gram Language Models