1 / 1

Using N-gram and Word Network Features for Native Language Identification

Using N-gram and Word Network Features for Native Language Identification Shibamouli Lahiri Rada Mihalcea Computer Science and Engineering, University of North Texas, Denton, TX 76207, USA shibamoulilahiri@my.unt.edu, rada@cs.unt.edu. Native Language Identification

kaloni
Download Presentation

Using N-gram and Word Network Features for Native Language Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using N-gram and Word Network Features for Native Language Identification ShibamouliLahiriRadaMihalcea Computer Science and Engineering, University of North Texas, Denton, TX 76207, USA shibamoulilahiri@my.unt.edu, rada@cs.unt.edu • Native Language Identification • Native speakers of L1 speaking L2. • Classify L2 samples according to L1. • Related to author profiling and authorship attribution. Results on Training Set (10-fold Cross-validation Accuracy (%)) A Word Network brown jumped fox Dataset TOEFL11 corpus contains 12,100 English essays (L2), of which 9,900 are for training, 1,100 are for test, and 1,100 are for validation. There are 11 L1’s (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu and Turkish). over the quick lazy dog • Word Network Features • Degree, coreness, neighborhood size and clustering coefficient of words. • Explored variations including directed and undirected versions of the above. • Most frequent 100, 200, 500, 1000 words on the train + development set. • N-gram Features • Word, POS, and character n-grams • (n = 1, 2, 3). • Explored variations including and excluding punctuations and spaces. • Most frequent 100, 200, 500, 1000 n-grams on the train + development set. Information Gain Ranking of Word Network Features on Training Set

More Related