1 / 14

Political Party, Gender, and Age Classification Based on Political Blogs

This study explores the classification of individuals based on their writing style in political blogs. It investigates whether people under 25 use different punctuation and words, and whether political ideologies can be determined by analyzing writing using probabilistic methods. The study includes classification, feature vector generation, and clustering results.

tew
Download Presentation

Political Party, Gender, and Age Classification Based on Political Blogs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg

  2. Introduction • Can individuals be classified by their writing style? • Do people under 25 use different punctuation than those over 25? • Do they use different words and phrases? • Can you figure out someone’s political ideologies by analyzing their writing using probabilistic methods?

  3. Classifier • Hold Out Cross Validation • 80% of Data in Training Set • 20% of Data in Test Set • Classify Bloggers using a Feature Vector • Features generated from training data

  4. Features • Most frequent unigrams, bigrams, trigrams • “Bush”, “troops in Iraq”, “McCain” • Sentence length, Word length • Punctuation • Pronoun usage

  5. Features • Compute feature probabilities based on frequency in the training data • If women use the word “myself” three times as often as men use the word “myself,” P(female|myself) = 75% • Pick features which are not 50/50 male/female or 50/50 Republican/Democrat

  6. Classification • Using the feature vector to classify, bloggers with a low probability of being a Republican were classified as Democrat • Writers with high Probability of being a Republican were classified as Republican • Writers with moderate Probability were not classified or “Unknown”

  7. Classifier Results

  8. Classifier Results

  9. Classifier Results

  10. Clustering • K-means clustering algorithm used with entire data set • Used sum of absolute differences instead of Euclidean distance because our differences were so small • Initialized centroids to a reasonable guess

  11. Clustering Results oDemocrat Cluster 1 *Democrat Cluster 2 oRepublican Cluster 1 *Republican Cluster 2 oUnknown Cluster 1 *Unknown Cluster 2

  12. Clustering Results oMale Cluster 1 *Male Cluster 2 oFemale Cluster 1 *Female Cluster 2 oUnknown Cluster 1 *Unknown Cluster 2

  13. Conclusion • It is possible to identify the characteristics of a writer based on writing style, words and phrases! • Political Party gave the best results, followed by Gender, then Age

  14. Future Work • Generalize results with a larger data set and greater number of features • Generalize results in a different domain • Possibly implement linear regressions, logistic regressions, SVM

More Related