1 / 64

Abe Gong agong@umich www-personal.umich/~agong

Text as Data in the Social Sciences Introduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010. Abe Gong agong@umich.edu www-personal.umich.edu/~agong. Big Picture The field of NLP Automated text classification A census of the political web. Agenda.

inara
Download Presentation

Abe Gong agong@umich www-personal.umich/~agong

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text as Data in the Social SciencesIntroduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010 Abe Gong agong@umich.edu www-personal.umich.edu/~agong

  2. Big Picture • The field of NLP • Automated text classification • A census of the political web Agenda

  3. Big Picture…

  4. 1. Language is the root of conscious thought, culture, and shared meaning.

  5. 2. Artificial and human intelligence are complementary tools for scientific inquiry.

  6. 3. Computers are surprisingly good at understanding human language.

  7. 4. Suddenly, huge amounts of digitized text are available.

  8. The field of NLP…

  9. NLP and Related fields

  10. Supervised learning Using a large set of labeled data, the computer learns to mimic humans on some task Applications • Handwriting, speech, and pattern recognition • Spam filtering • Bioinformatics • … Learning Modes

  11. Supervised learning Using a large set of labeled data, the computer learns to mimic humans on some task Strengths • Very flexible • Easy to adapt to existing theory Weaknesses • Specifying ontologies can be time-consuming • Requires substantial training data Learning Modes

  12. Unsupervised learning Using raw, unlabeled data, the computer looks for patterns and regularities Applications • Clustering • Neural networks • Algorithmic stock trading • Data-driven marketing • … Learning Modes

  13. Supervised learning Using raw, unlabeled data, the computer looks for patterns and regularities Strengths • Does not require labeled data • Discovers new patterns Weaknesses • Often difficult to relate to existing theory Learning Modes

  14. Active learning Supervised learning, but the computer selects or generates training examples • Optimal experimental design • Performance boost for supervised learning Semi-supervised learning Blend of supervised and unsupervised learning • Algorithmic forecasting, stock trading • Topic maps • Machine summarization Learning Modes

  15. In all of these applications, a large degree of control is turned over to the computer. • “Data Mining” is not always a dirty word. Bad: Re-run statistical models until p > .05 Good: Tap all the data available for patterns and inference “Data Mining”

  16. Google Image Search: “data mining books” “Data Mining”

  17. Topic tracking and sentiment analysis Track trends in attention and opinion over time. http://www.google.com/trends http://memetracker.org http://textmap.com http://www.ccs.neu.edu/home/amislove/twittermood/ Current applications

  18. Data visualization Clever ways to make data accessible http://manyeyes.alphaworks.ibm.manyeyes http://flowingdata.com http://morningside-analytics.com Current applications

  19. Machine translation Translate text from one language to another. http://babelfish.yahoo.com/ Machine summarization Summarize the most important points from a document or group of related documents. http://newsblaster.cs.columbia.edu/ http://www.newsinessence.com/ Current applications

  20. Miscellaneous • Language detection http://www.google.com/uds/samples/language/detect.html • Part-of-speech tagging • Word-sense disambiguation • Probabilistic parsing • Spell checking • Grammar checking • Spam filtering Current applications

  21. Speeches • Legislation • Amendments • Hearings • Rules • Floor debate • Public comments • Judicial opinions • Legal Briefs • Party Manifestos • Media coverage • Blogs • Treaties • Reports • Anything on the public record… Data sources

  22. http://bulk.resource.org/ Data sources

  23. Data sources

  24. Two options • Out-of-the-box software • Nice for getting started • Methodology is constrained • Lags the development curve • Build it yourself • High overhead • Requires skill development • Extremely flexible  Make sure to use existing libraries! Software

  25. Ex: Provalis WordStat • Out of Box, Plug and Play • Software Package Developed by Provalis • http://www.provalisresearch.com/ • Booth at Midwest & APSA -- 2008, 2009 • The Full Package: WordStat, QDA Miner, SimStat Software

  26. Programming languages Perl, C++, Java, Ruby… Python If you’re going to learn a language, make it python • Free, open source • Intuitive syntax • Enormous code and user base • Well-documented, with excellent references • Multiplatform, mature distribution • Strong NLP capability • Ex: nltk, lxml, numpy, scipy, scikits libraries Software

  27. 5-minute demo Train a classifier to recognize the difference between Twain’s Huck Finn and Stoker’s Dracula. Get python here: http://www.python.org/download/ Download the script here: http://www-personal.umich.edu/~agong/temp/text_classifier_demo.zip Download the books here: http://www.gutenberg.org/files/32325/32325-h/32325-h.htm http://www.gutenberg.org/files/345/345-h/345-h.htm Demo

  28. Demo

  29. Demo

  30. Automated text classification

  31. Goal: Sort documents into predefined categories, based on their text. • Task • Document • Corpus • Token • Feature • Feature string • Feature vector • Bag-of-words classifiers Terminology

  32. Naïve Bayes Classifiers Assume words are drawn independently, conditional on document class. Infer each document’s class from its words. Strengths • Clear statistical foundation • Fast to train and implement • Lightweight Weaknesses • Noticeably less effective than other approaches • Statistical foundation is based on false assumptions Algorithms and Estimators

  33. Support Vector Machines (SVM) Vectorize documents, then find the maximum-margin separating hyperplane. Strengths • High accuracy • Intuitive explanation • Work with little training data Weaknesses • No explicit statistical foundation • Training is slow with large data sets Algorithms and Estimators

  34. Support Vector Machines (SVM) Vectorize documents, then find the maximum-margin separating hyperplane. Algorithms and Estimators

  35. Logistic regression Maximum likelihood estimator Algorithms and Estimators

  36. Decision Trees Like playing 20 questions. Strengths • Able to capture subtle details Weaknesses • Require large amounts of training data • Classification is often “brittle” Algorithms and Estimators

  37. Goal: Sort documents into predefined categories, based on their text. • Task • Document • Corpus • Token • Feature • Feature string • Feature vector • Bag-of-words classifiers Terminology

  38. Percent agreement Precision Recall F-measure Cohen’s kappa Krippendorff’s alpha Evaluation

  39. Bias plot and difficulty curve Evaluation

  40. A Census of the Political Web

  41. Why study politics online? • Impact of new technology on politics • Barack Obama did 60% of his record-breaking fundraising online • Trent Lott, Dan Rather, Howard Dean • New data on age-old political behavior • Examples to follow shortly Motivation

  42. “No complete index of political websites exists.” • Unable to use sampling theory • Size, representativeness, generalizability, etc. • Possible bias, error in existing methods Motivation

  43. Goal:A complete census of the political web

  44. Web site http://domain Web page http://domain/path Examples (3 sites and 1 page) http://www.yahoo.com http://www.yahoo.com/politics http://www.dailykos.com http://abegong.dailykos.com Web sites v. web pages

  45. Sites correspond with human beings • Feasibility. ~ 230 million websites ~ 30 billion web pages Why web sites?

  46. Train an automated text classifier to recognize political content. • Start from a seed batch of political sites. • Download and classify each site in the batch. • For political sites: • Harvest all outbound hyperlinks. • Add previously unvisited links to the next batch. • Repeat until no new links are found. Automated snowball census

  47. How can we know if the automated classifier is working properly? The same way we know if a human coder is working properly: compare coding with others • Hand-code a training set (n=1,000 x 1) • Train the classifier • Hand-code a testing set (n=200 x 4) • Compare results • Human-human • Human-computer Evaluation

  48. Intuitive definition • Minimal training Amazon Mechanical Turk Coding protocol

More Related