70 likes | 184 Views
Classifying Reading Levels with Statistical Language Models. Johnson Hsieh Sameer Shariff. Problem . Given a passage of English text, can we classify the text at the appropriate reading level? Dataset – English novels from various reading lists 5 th -7 th grade - 15 books
E N D
Classifying Reading Levels with Statistical Language Models Johnson Hsieh Sameer Shariff
Problem • Given a passage of English text, can we classify the text at the appropriate reading level? • Dataset – English novels from various reading lists • 5th-7th grade - 15 books • Tom Sawyer, Black Beauty, etc. • 8th-10th grade - 16 books • A Tale of Two Cities, The Call of the Wild, etc. • 11th-12th grade - 17 books • Pride and Prejudice, The Awakening, etc.
Approach • Build language models for each class • Classify new text based on model that was most likely to generate this text (generative model) • Model 1 • Classify text based purely on these language models with some interesting smoothing techniques • Model 2 • Build a discriminative multinomial logistic regression model that uses these language models as just one of many features
Data Separability • A hard problem
Language Model Results Accuracy = (# of books predicted correctly)/(total # of books) Weighted Accuracy = ((# predicted correctly) + 0.5 * (# off by one))/(total # of books)
Multinomial Logistic Regression Results • Without language model: • With language model:
Conclusions and Future Work • Statistical language models do capture information that can help differentiate between different reading levels, better than traditional measures such as Flesch-Kinkaid • Multinomial logistic regression models with additional features outperform the pure language model approach, though using the language model as a feature greatly improves performance • Future Work • Explore higher order language models • Investigate language model overfitting