80 likes | 91 Views
This process outlines how to identify entities in leaked US diplomatic cables, extract features, co-occurring entities, sentiment scores, perform K-Means clustering, sentiment analysis, and model features using Multinomial Mixture Model.
E N D
Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables Xuwen Cao Beyang Liu
Process Outline • Identify entities in 3891 leaked U.S. diplomatic cables published by Wikileaks • Extract features from window around entities • Sentiment scores • Co-occurying entities • Adjectives in some fixed-size window • Cluster entities in feature space
K-Means Clustering • Stanford NLP (NER + POS) • Extract Locations (LOCATION & NN) • eg. London, Africa, China, Caucasus • Sentiment Analysis on JJ (SentiWordNet) • Calibrate Using sentiment towards US • Frequency Counting
K-means Results Entity frequency Sentiment score
Multinomial Mixture Model • Model many features as (probabilistic) function of cluster assignment • Naïve Bayes independence assumption • Maximize expected log-likelihood objective with EM (Cluster Label) (Features)
EM Initialization Issues Histograms of cluster sizes (k = 100)
Sample Clusters from Multinomial Mixture Model • Examples • Good • cairo iran saudi arabiawest bankpalestinianauthorityqatar middle eastkarachi maliki • tripolidutch franceabujamuammaral-qadhafiicc (international criminal court) • Bad • atmar ben ali saleh european union eu icrc (red cross) wto ahmadinejad • helmand, karzai, seoul, brown, williams, tadic • Many other clusters very small or heterogeneous • Model seems to be cuing off of co-occurrence features the most
Future Direction • More advanced features, targeted toward sentiment • E.g. n-gram adjective phrases • Better model: mixture of CRF clustering, rather than Naïve Bayes