Abe Gong agong@umich www-personal.umich/~agong

Text as Data in the Social SciencesIntroduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010 Abe Gong agong@umich.edu www-personal.umich.edu/~agong

Big Picture • The field of NLP • Automated text classification • A census of the political web Agenda

Big Picture…

1. Language is the root of conscious thought, culture, and shared meaning.

2. Artificial and human intelligence are complementary tools for scientific inquiry.

3. Computers are surprisingly good at understanding human language.

4. Suddenly, huge amounts of digitized text are available.

The field of NLP…

NLP and Related fields

Supervised learning Using a large set of labeled data, the computer learns to mimic humans on some task Applications • Handwriting, speech, and pattern recognition • Spam filtering • Bioinformatics • … Learning Modes

Supervised learning Using a large set of labeled data, the computer learns to mimic humans on some task Strengths • Very flexible • Easy to adapt to existing theory Weaknesses • Specifying ontologies can be time-consuming • Requires substantial training data Learning Modes

Unsupervised learning Using raw, unlabeled data, the computer looks for patterns and regularities Applications • Clustering • Neural networks • Algorithmic stock trading • Data-driven marketing • … Learning Modes

Supervised learning Using raw, unlabeled data, the computer looks for patterns and regularities Strengths • Does not require labeled data • Discovers new patterns Weaknesses • Often difficult to relate to existing theory Learning Modes

Active learning Supervised learning, but the computer selects or generates training examples • Optimal experimental design • Performance boost for supervised learning Semi-supervised learning Blend of supervised and unsupervised learning • Algorithmic forecasting, stock trading • Topic maps • Machine summarization Learning Modes

In all of these applications, a large degree of control is turned over to the computer. • “Data Mining” is not always a dirty word. Bad: Re-run statistical models until p > .05 Good: Tap all the data available for patterns and inference “Data Mining”

Google Image Search: “data mining books” “Data Mining”

Topic tracking and sentiment analysis Track trends in attention and opinion over time. http://www.google.com/trends http://memetracker.org http://textmap.com http://www.ccs.neu.edu/home/amislove/twittermood/ Current applications

Data visualization Clever ways to make data accessible http://manyeyes.alphaworks.ibm.manyeyes http://flowingdata.com http://morningside-analytics.com Current applications

Machine translation Translate text from one language to another. http://babelfish.yahoo.com/ Machine summarization Summarize the most important points from a document or group of related documents. http://newsblaster.cs.columbia.edu/ http://www.newsinessence.com/ Current applications

Miscellaneous • Language detection http://www.google.com/uds/samples/language/detect.html • Part-of-speech tagging • Word-sense disambiguation • Probabilistic parsing • Spell checking • Grammar checking • Spam filtering Current applications

Speeches • Legislation • Amendments • Hearings • Rules • Floor debate • Public comments • Judicial opinions • Legal Briefs • Party Manifestos • Media coverage • Blogs • Treaties • Reports • Anything on the public record… Data sources

http://bulk.resource.org/ Data sources

Data sources

Two options • Out-of-the-box software • Nice for getting started • Methodology is constrained • Lags the development curve • Build it yourself • High overhead • Requires skill development • Extremely flexible  Make sure to use existing libraries! Software

Ex: Provalis WordStat • Out of Box, Plug and Play • Software Package Developed by Provalis • http://www.provalisresearch.com/ • Booth at Midwest & APSA -- 2008, 2009 • The Full Package: WordStat, QDA Miner, SimStat Software

Programming languages Perl, C++, Java, Ruby… Python If you’re going to learn a language, make it python • Free, open source • Intuitive syntax • Enormous code and user base • Well-documented, with excellent references • Multiplatform, mature distribution • Strong NLP capability • Ex: nltk, lxml, numpy, scipy, scikits libraries Software

5-minute demo Train a classifier to recognize the difference between Twain’s Huck Finn and Stoker’s Dracula. Get python here: http://www.python.org/download/ Download the script here: http://www-personal.umich.edu/~agong/temp/text_classifier_demo.zip Download the books here: http://www.gutenberg.org/files/32325/32325-h/32325-h.htm http://www.gutenberg.org/files/345/345-h/345-h.htm Demo

Demo

Automated text classification

Goal: Sort documents into predefined categories, based on their text. • Task • Document • Corpus • Token • Feature • Feature string • Feature vector • Bag-of-words classifiers Terminology

Naïve Bayes Classifiers Assume words are drawn independently, conditional on document class. Infer each document’s class from its words. Strengths • Clear statistical foundation • Fast to train and implement • Lightweight Weaknesses • Noticeably less effective than other approaches • Statistical foundation is based on false assumptions Algorithms and Estimators

Support Vector Machines (SVM) Vectorize documents, then find the maximum-margin separating hyperplane. Strengths • High accuracy • Intuitive explanation • Work with little training data Weaknesses • No explicit statistical foundation • Training is slow with large data sets Algorithms and Estimators

Support Vector Machines (SVM) Vectorize documents, then find the maximum-margin separating hyperplane. Algorithms and Estimators

Logistic regression Maximum likelihood estimator Algorithms and Estimators

Decision Trees Like playing 20 questions. Strengths • Able to capture subtle details Weaknesses • Require large amounts of training data • Classification is often “brittle” Algorithms and Estimators

Goal: Sort documents into predefined categories, based on their text. • Task • Document • Corpus • Token • Feature • Feature string • Feature vector • Bag-of-words classifiers Terminology

Percent agreement Precision Recall F-measure Cohen’s kappa Krippendorff’s alpha Evaluation

Bias plot and difficulty curve Evaluation

A Census of the Political Web

Why study politics online? • Impact of new technology on politics • Barack Obama did 60% of his record-breaking fundraising online • Trent Lott, Dan Rather, Howard Dean • New data on age-old political behavior • Examples to follow shortly Motivation

“No complete index of political websites exists.” • Unable to use sampling theory • Size, representativeness, generalizability, etc. • Possible bias, error in existing methods Motivation

Goal:A complete census of the political web

Web site http://domain Web page http://domain/path Examples (3 sites and 1 page) http://www.yahoo.com http://www.yahoo.com/politics http://www.dailykos.com http://abegong.dailykos.com Web sites v. web pages

Sites correspond with human beings • Feasibility. ~ 230 million websites ~ 30 billion web pages Why web sites?

Train an automated text classifier to recognize political content. • Start from a seed batch of political sites. • Download and classify each site in the batch. • For political sites: • Harvest all outbound hyperlinks. • Add previously unvisited links to the next batch. • Repeat until no new links are found. Automated snowball census

How can we know if the automated classifier is working properly? The same way we know if a human coder is working properly: compare coding with others • Hand-code a training set (n=1,000 x 1) • Train the classifier • Hand-code a testing set (n=200 x 4) • Compare results • Human-human • Human-computer Evaluation

Intuitive definition • Minimal training Amazon Mechanical Turk Coding protocol

Abe Gong agong@umich www-personal.umich/~agong