Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables

Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables Xuwen Cao Beyang Liu

Process Outline • Identify entities in 3891 leaked U.S. diplomatic cables published by Wikileaks • Extract features from window around entities • Sentiment scores • Co-occurying entities • Adjectives in some fixed-size window • Cluster entities in feature space

K-Means Clustering • Stanford NLP (NER + POS) • Extract Locations (LOCATION & NN) • eg. London, Africa, China, Caucasus • Sentiment Analysis on JJ (SentiWordNet) • Calibrate Using sentiment towards US • Frequency Counting

K-means Results Entity frequency Sentiment score

Multinomial Mixture Model • Model many features as (probabilistic) function of cluster assignment • Naïve Bayes independence assumption • Maximize expected log-likelihood objective with EM (Cluster Label) (Features)

EM Initialization Issues Histograms of cluster sizes (k = 100)

Sample Clusters from Multinomial Mixture Model • Examples • Good • cairo iran saudi arabiawest bankpalestinianauthorityqatar middle eastkarachi maliki • tripolidutch franceabujamuammaral-qadhafiicc (international criminal court) • Bad • atmar ben ali saleh european union eu icrc (red cross) wto ahmadinejad • helmand, karzai, seoul, brown, williams, tadic • Many other clusters very small or heterogeneous • Model seems to be cuing off of co-occurrence features the most

Future Direction • More advanced features, targeted toward sentiment • E.g. n-gram adjective phrases • Better model: mixture of CRF clustering, rather than Naïve Bayes

Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables