640 likes | 765 Views
Text as Data in the Social Sciences Introduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010. Abe Gong agong@umich.edu www-personal.umich.edu/~agong. Big Picture The field of NLP Automated text classification A census of the political web. Agenda.
E N D
Text as Data in the Social SciencesIntroduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010 Abe Gong agong@umich.edu www-personal.umich.edu/~agong
Big Picture • The field of NLP • Automated text classification • A census of the political web Agenda
1. Language is the root of conscious thought, culture, and shared meaning.
2. Artificial and human intelligence are complementary tools for scientific inquiry.
3. Computers are surprisingly good at understanding human language.
Supervised learning Using a large set of labeled data, the computer learns to mimic humans on some task Applications • Handwriting, speech, and pattern recognition • Spam filtering • Bioinformatics • … Learning Modes
Supervised learning Using a large set of labeled data, the computer learns to mimic humans on some task Strengths • Very flexible • Easy to adapt to existing theory Weaknesses • Specifying ontologies can be time-consuming • Requires substantial training data Learning Modes
Unsupervised learning Using raw, unlabeled data, the computer looks for patterns and regularities Applications • Clustering • Neural networks • Algorithmic stock trading • Data-driven marketing • … Learning Modes
Supervised learning Using raw, unlabeled data, the computer looks for patterns and regularities Strengths • Does not require labeled data • Discovers new patterns Weaknesses • Often difficult to relate to existing theory Learning Modes
Active learning Supervised learning, but the computer selects or generates training examples • Optimal experimental design • Performance boost for supervised learning Semi-supervised learning Blend of supervised and unsupervised learning • Algorithmic forecasting, stock trading • Topic maps • Machine summarization Learning Modes
In all of these applications, a large degree of control is turned over to the computer. • “Data Mining” is not always a dirty word. Bad: Re-run statistical models until p > .05 Good: Tap all the data available for patterns and inference “Data Mining”
Google Image Search: “data mining books” “Data Mining”
Topic tracking and sentiment analysis Track trends in attention and opinion over time. http://www.google.com/trends http://memetracker.org http://textmap.com http://www.ccs.neu.edu/home/amislove/twittermood/ Current applications
Data visualization Clever ways to make data accessible http://manyeyes.alphaworks.ibm.manyeyes http://flowingdata.com http://morningside-analytics.com Current applications
Machine translation Translate text from one language to another. http://babelfish.yahoo.com/ Machine summarization Summarize the most important points from a document or group of related documents. http://newsblaster.cs.columbia.edu/ http://www.newsinessence.com/ Current applications
Miscellaneous • Language detection http://www.google.com/uds/samples/language/detect.html • Part-of-speech tagging • Word-sense disambiguation • Probabilistic parsing • Spell checking • Grammar checking • Spam filtering Current applications
Speeches • Legislation • Amendments • Hearings • Rules • Floor debate • Public comments • Judicial opinions • Legal Briefs • Party Manifestos • Media coverage • Blogs • Treaties • Reports • Anything on the public record… Data sources
http://bulk.resource.org/ Data sources
Two options • Out-of-the-box software • Nice for getting started • Methodology is constrained • Lags the development curve • Build it yourself • High overhead • Requires skill development • Extremely flexible Make sure to use existing libraries! Software
Ex: Provalis WordStat • Out of Box, Plug and Play • Software Package Developed by Provalis • http://www.provalisresearch.com/ • Booth at Midwest & APSA -- 2008, 2009 • The Full Package: WordStat, QDA Miner, SimStat Software
Programming languages Perl, C++, Java, Ruby… Python If you’re going to learn a language, make it python • Free, open source • Intuitive syntax • Enormous code and user base • Well-documented, with excellent references • Multiplatform, mature distribution • Strong NLP capability • Ex: nltk, lxml, numpy, scipy, scikits libraries Software
5-minute demo Train a classifier to recognize the difference between Twain’s Huck Finn and Stoker’s Dracula. Get python here: http://www.python.org/download/ Download the script here: http://www-personal.umich.edu/~agong/temp/text_classifier_demo.zip Download the books here: http://www.gutenberg.org/files/32325/32325-h/32325-h.htm http://www.gutenberg.org/files/345/345-h/345-h.htm Demo
Goal: Sort documents into predefined categories, based on their text. • Task • Document • Corpus • Token • Feature • Feature string • Feature vector • Bag-of-words classifiers Terminology
Naïve Bayes Classifiers Assume words are drawn independently, conditional on document class. Infer each document’s class from its words. Strengths • Clear statistical foundation • Fast to train and implement • Lightweight Weaknesses • Noticeably less effective than other approaches • Statistical foundation is based on false assumptions Algorithms and Estimators
Support Vector Machines (SVM) Vectorize documents, then find the maximum-margin separating hyperplane. Strengths • High accuracy • Intuitive explanation • Work with little training data Weaknesses • No explicit statistical foundation • Training is slow with large data sets Algorithms and Estimators
Support Vector Machines (SVM) Vectorize documents, then find the maximum-margin separating hyperplane. Algorithms and Estimators
Logistic regression Maximum likelihood estimator Algorithms and Estimators
Decision Trees Like playing 20 questions. Strengths • Able to capture subtle details Weaknesses • Require large amounts of training data • Classification is often “brittle” Algorithms and Estimators
Goal: Sort documents into predefined categories, based on their text. • Task • Document • Corpus • Token • Feature • Feature string • Feature vector • Bag-of-words classifiers Terminology
Percent agreement Precision Recall F-measure Cohen’s kappa Krippendorff’s alpha Evaluation
Bias plot and difficulty curve Evaluation
Why study politics online? • Impact of new technology on politics • Barack Obama did 60% of his record-breaking fundraising online • Trent Lott, Dan Rather, Howard Dean • New data on age-old political behavior • Examples to follow shortly Motivation
“No complete index of political websites exists.” • Unable to use sampling theory • Size, representativeness, generalizability, etc. • Possible bias, error in existing methods Motivation
Web site http://domain Web page http://domain/path Examples (3 sites and 1 page) http://www.yahoo.com http://www.yahoo.com/politics http://www.dailykos.com http://abegong.dailykos.com Web sites v. web pages
Sites correspond with human beings • Feasibility. ~ 230 million websites ~ 30 billion web pages Why web sites?
Train an automated text classifier to recognize political content. • Start from a seed batch of political sites. • Download and classify each site in the batch. • For political sites: • Harvest all outbound hyperlinks. • Add previously unvisited links to the next batch. • Repeat until no new links are found. Automated snowball census
How can we know if the automated classifier is working properly? The same way we know if a human coder is working properly: compare coding with others • Hand-code a training set (n=1,000 x 1) • Train the classifier • Hand-code a testing set (n=200 x 4) • Compare results • Human-human • Human-computer Evaluation
Intuitive definition • Minimal training Amazon Mechanical Turk Coding protocol