150 likes | 367 Views
U.S. SENATE BILL CLASSIFICATION & VOTE PREDICTION. Alessandra Paulino Rick Pocklington Serhat Selcuk Bucak. GOALS . Determine the topic of a bill based on the keywords it contains Based on the content of a bill and characteristics of a senator, predict how a senator would vote for a bill
E N D
U.S. SENATE BILL CLASSIFICATION & VOTE PREDICTION Alessandra Paulino Rick Pocklington Serhat Selcuk Bucak
GOALS • Determine the topic of a bill based on the keywords it contains • Based on the content of a bill and characteristics of a senator, predict how a senator would vote for a bill • Aye, Nay, No Vote (looks irrelevant, but some New York Senators seems to use this option quite often!!!) • On the same bill, senators sharing similar characteristics (age, party, region) are expected to behave similar • A senator is likely to vote the same for bills with the same content
CHALLENGES • Although “No vote” option follows a pattern in some cases (i.e. New York senators or Obama & Mccain), sometimes it is just random (personal problems etc) • These are approved bills: So there is a great imbalance between the number of Aye and Nay votes. • Even though the keywords are known, the content is not precisely known: i.e. for the topic “Iraq war policy” does the bill offers a retreat or an attack? • One bill might belong to more than one topic; so, this makes the situation more complicated. • i.e. a bill can be labeled as a military, education and energy bill at the same time and even if you can predict a senator’s strategy on military topics, it would be difficult to predict her/his behavior for energy related topics. • So, senator characteristics become more significant
OUTLINE • Related Work • Data Collection & Preprocessing • Keyword extraction and selection • Bill Classification (Multi-class multi-label Classification) • Information about the Senators • Data Mining • Data and task description (Multi-class single-labelclassification) • Methods used: SVM & RIPPER • Evaluations • Visualization
RELATED WORK • The number of related work is limited. • Political scientists use simple statistical measures to get some meaningful interpretations: • “Is John Kerry A Liberal?” [Poole] • Analyzing (clustering) the votes of Republicans, democrats as well as South and North states on certain topics, i.e. Race related • Recently some statisticians showed interest on the topic • How influential is a senator w.r.t. to the outcome of the vote [Jakulin] – (mutual information, information theory) • Similarity among senator votes • Per-Issue analysis, i.e. in which issue do Rs and Ds differ the most • Unlike the previous ones, we try to predict the votes
DATA COLLECTION • Votes for each bill that are confirmed between 2005-2008 are taken from www.govtrack.us • Senator information and keywords for each bill are obtained from http://www.opencongress.org/ • Less frequent keywords are eliminated and a total of 175 words are selected. So each bill is represented as a binary vector (bag-of-words) with a dimension of 175 • The senator characteristics used are: • State • Gender • Party (Democrat, Independent , Republican) • Age: Divided into four groups: 1. x<55, 2. 54<x<65, 3. 64<x<70, 4. 69<x<80, 5. 79<x • This grouping made in a way that each group contains approximately the same number of members
BILL CLASSIFICATION • Based on the keywords it contains, predict the topic(s) of a bill . • This is a multi-class multi-label classification problem: One bill may belong to more than one topic. • There are 9 topics: Economy, Education, Energy, Environment, Government, Health, Law, Military, Social. • Labeling is done heuristically based on the keywords. • There are about 110 bills, and this data is divided into two sets: training set, test set. • One-versus-all (OvA) SVM (Support Vector Machines) approach is implemented by using LIBSVM [Lin et al] on MATLAB.
ONE-versus-ALL (OvA) SVM • i shows the class index (1,2,…,9) • j is the index showing the instance that is processed Find the classifier wi and term bi for each label i • For single-label problem: • In our case (multi-label problem): Where T is a threshold (T=0 works fine !)
BILL CLASSIFICATION RESULTS • Successful results by OvA SVM with RBF kernel: Training Set (percentage) Classification Accuracy 20% 88% 40% 89.5% 50% 90% 70% 91.5%
VOTE PREDICTION • Rule-based classification (RIPPER) • JRip of Weka • OvA SVM (weighted) is also implemented • 10-fold cross-validation • Key points of RIPPER [P. Tan et al] • Class-based ordering w.r.t. increasing class prevalence • Start from the smallest class (No-vote) and label it positive • SVM • Positive and negative samples are weighted differently in order to overcome imbalance issue => No improvement!!! Always “Aye”
RESULTS – w.r.t. PARTY Info. • Performance for predicting “Aye” votes is quite good for all groups (D,R,I) • Performance is lower for “Nay” votes as their number is significantly lower than “Aye”. • No vote performance is unacceptable which means that it is random according to our classifier! • Even though there are more Republicans, prediction performance is better for Democrats. They are more predictable. • Independents cover a small percentage, The performance is not good.
RESULTS – w.r.t. STATE Info. • “No vote” is not a surprise for New York! So, we have better results for “No vote” prediction in New York • Texas Senators always vote!!! And although votes of democrats are generally more predictable, the prediction performance is higher for TX senators (R) compared to that of blue states. • Michigan has almost the same statistics as TX. But it contains one “No vote” in real data. Although no votes are predicted in this category, it still degrades the performance of “Nay” vote prediction compared to TX
Conclusions • Since we only consider accepted amendments, there is a big imbalance between the number of “Aye” and “Nay” votes. • SVM suffers from this imbalance • RIPPER gives better results (yet not perfect). Even though there is an imbalance, it can still extract some rules for “Nay” votes if there is enough number of samples. • Predicting whether some one will not vote (“No vote”) does not seem to be a sensible task. But in some cases, it may have some rules. • Obama, Clinton and McCain did not use votes probably due to their campaigns • For others, is it a kind protest or are there some other reason? • The attributes can also be weighted. • Party information is probably the most important attribute • For each topic,contribution of the keywords that are related to that topic could be increased • More information should be added (who has sponsored the bill etc, or more personal info about the senators) • As a future work, prediction whether a bill will be accepted or not can be done (i.e. a bill is more likely to be accepted if Obama and Clinton agrees etc.).
VISUALIZATION • http://www.cse.msu.edu/~paulinoa/Rules.html