140 likes | 154 Views
This study explores the classification of individuals based on their writing style in political blogs. It investigates whether people under 25 use different punctuation and words, and whether political ideologies can be determined by analyzing writing using probabilistic methods. The study includes classification, feature vector generation, and clustering results.
E N D
Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg
Introduction • Can individuals be classified by their writing style? • Do people under 25 use different punctuation than those over 25? • Do they use different words and phrases? • Can you figure out someone’s political ideologies by analyzing their writing using probabilistic methods?
Classifier • Hold Out Cross Validation • 80% of Data in Training Set • 20% of Data in Test Set • Classify Bloggers using a Feature Vector • Features generated from training data
Features • Most frequent unigrams, bigrams, trigrams • “Bush”, “troops in Iraq”, “McCain” • Sentence length, Word length • Punctuation • Pronoun usage
Features • Compute feature probabilities based on frequency in the training data • If women use the word “myself” three times as often as men use the word “myself,” P(female|myself) = 75% • Pick features which are not 50/50 male/female or 50/50 Republican/Democrat
Classification • Using the feature vector to classify, bloggers with a low probability of being a Republican were classified as Democrat • Writers with high Probability of being a Republican were classified as Republican • Writers with moderate Probability were not classified or “Unknown”
Clustering • K-means clustering algorithm used with entire data set • Used sum of absolute differences instead of Euclidean distance because our differences were so small • Initialized centroids to a reasonable guess
Clustering Results oDemocrat Cluster 1 *Democrat Cluster 2 oRepublican Cluster 1 *Republican Cluster 2 oUnknown Cluster 1 *Unknown Cluster 2
Clustering Results oMale Cluster 1 *Male Cluster 2 oFemale Cluster 1 *Female Cluster 2 oUnknown Cluster 1 *Unknown Cluster 2
Conclusion • It is possible to identify the characteristics of a writer based on writing style, words and phrases! • Political Party gave the best results, followed by Gender, then Age
Future Work • Generalize results with a larger data set and greater number of features • Generalize results in a different domain • Possibly implement linear regressions, logistic regressions, SVM