80 likes | 220 Views
Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables. Xuwen Cao Beyang Liu. Process Outline. Identify entities in 3891 leaked U.S. diplomatic cables published by Wikileaks Extract features from window around entities Sentiment scores Co-occurying entities
E N D
Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables Xuwen Cao Beyang Liu
Process Outline • Identify entities in 3891 leaked U.S. diplomatic cables published by Wikileaks • Extract features from window around entities • Sentiment scores • Co-occurying entities • Adjectives in some fixed-size window • Cluster entities in feature space
K-Means Clustering • Stanford NLP (NER + POS) • Extract Locations (LOCATION & NN) • eg. London, Africa, China, Caucasus • Sentiment Analysis on JJ (SentiWordNet) • Calibrate Using sentiment towards US • Frequency Counting
K-means Results Entity frequency Sentiment score
Multinomial Mixture Model • Model many features as (probabilistic) function of cluster assignment • Naïve Bayes independence assumption • Maximize expected log-likelihood objective with EM (Cluster Label) (Features)
EM Initialization Issues Histograms of cluster sizes (k = 100)
Sample Clusters from Multinomial Mixture Model • Examples • Good • cairo iran saudi arabiawest bankpalestinianauthorityqatar middle eastkarachi maliki • tripolidutch franceabujamuammaral-qadhafiicc (international criminal court) • Bad • atmar ben ali saleh european union eu icrc (red cross) wto ahmadinejad • helmand, karzai, seoul, brown, williams, tadic • Many other clusters very small or heterogeneous • Model seems to be cuing off of co-occurrence features the most
Future Direction • More advanced features, targeted toward sentiment • E.g. n-gram adjective phrases • Better model: mixture of CRF clustering, rather than Naïve Bayes