Fourth-Generation Content Analysis Computational Linguistics for the Social Sciences

Fourth-Generation Content AnalysisComputational Linguistics for the Social Sciences Douglas W. Oard Joint work with Ping Wang, Ken Fleischmann, Tiffany Chao, An-Shou Cheng, Chia-jung Tsui and Lidan Wang

Outline • Content analysis • (some) Computational linguistics • Putting them together • An example: adoption of IT concepts • Collaboration opportunities

Insight through Triangulation • Think aloud • Observation notes • Interviews • Surveys • Content analysis • Citation analysis

Content Analysis • “… any technique for making inferences by objectively and systematically identifying specified characteristics of messages …” (Holesti, 1969) • “… the study of recorded human communications such as books, Web sites, paintings, and laws …” (Babbie, 1975) • “… a summarizing, quantitative analysis of messages that relies on the scientific method …” (Neuendorf, 2002)

Four Generations of Content Analysis • Read and understand something • Manually infer something, then count it • Directly observe something, then count it • Automatically infer something, then count it

Problem identification Data selection Conceptualization Operationalization Coding frame design Analysis Second-Generation Content Analysis Content acquisition Manual Coding

Third-Generation Content Analysis:(“Computer Assisted Text Analysis”) • “Dictionary-based” word counting • Person or organization names • Positive and negative sentiment terms • Vastly more scalable than manual coding • Alias list can accommodate synonymy • Focused domain can limit homonomy effects • Regression models some context-dependent effects

(Some of) Computational Linguistics • Transducers • Document image processing (e.g., OCR) • Speech processing (e.g., ASR) • Machine translation • “Text Mining” • Segmentation • Clustering • Classification

OCR MT Handwriting Speech Transducer Capability Curve Searchable Fraction Transducer Capabilities

Segmentation • Find mentions of specific types of items in a sequence • Equivalently, learn to mark start and end points • Applicable at many scales • Coherent passages • Multi-word expressions • Named entities (e.g., people or organizations) • Noun phrases • Chinese words • Stems

Aggregate related items Documents, based on topical similarity Entities, based on detected relationships Clustering

Classification • Associate each item with a category • Document  topic • Passage  sentiment • Entity  type Feature extraction Model learning Content acquisition Classification evaluation

Automating the Annotation Process “There has been a lot of buzz over the arrival of Firefox, the open-source browser published by the Mozilla Foundation… Sun Microsystems Inc. hopes that open-source Solaris will draw in new users and new growth opportunities.” Segmentation: Classification: Association and clustering: (company, software) Firefox Mozilla Foundation Sun Microsystems Solaris Open Source Software Organization Organization Open Source Software Firefox Mozilla Foundation Sun Microsystems Solaris

Interdisciplinary Collaboration

PopIT: Scalable Computational Analysis • Objectives • Describe, explain, and predict trends in technological fields (IT, biotech, nanotech…) • Develop methodology for domain-specific computational analysis of large-scale textual data from multiple sources • Advance theory development in social sciences • Period: September 2007-August 2010 • Sponsor: National Science Foundation http://www.wam.umd.edu/~pwang/PopIT/

Theory-Based Iterative Inquiry

Problem identification Data selection Content acquisition Conceptualization Feature extraction Model learning Operationalization Classification Coding frame design Analysis evaluation Process Integration

Rethinking “Operationalization” • Coupled models as “boundary object” • Input representation • Transformation • Output representation • Layered uncertainty • Meaning of the text • Meaning of the coding frame • Purpose of the coding frame

SaaS Chatbots Portable Personality Ajax RFID Ultramobile Devices BPO Application Quality Dashboards SOA VoIP Mashup DRM Identity Management OSS Thin Provisioning Business Intelligence Semantic Web Web2.0 SCM Tera-architectures CRM Distributed Encryption iSCSI Why do some innovations become popular, but others don’t?

Hype Cycle Performance S-curve Adoption Curve IT Innovation Life Cycle Time Management Fashion Theory: Knowledge entrepreneurs create a transitory collective belief that an innovation is at the forefront of progress. Linden & Fenn, 2003

Hype Cycle Emerging Technologies 2007 Gartner Fenn et al., 2007

Conceptual and Material Innovation

Fluctuation Hypothesis An innovation will be more prevalent when its environmental cues are more prevalent. Abrahamson & Fairchild, 1999

Sentiment Hypothesis Emotional and positive discourse characterizes the upswing of an innovation’s hype cycle, whereas reasoned, negative discourse characterizes the downswing. Abrahamson & Fairchild, 1999

Competition Hypothesis The discourse volume of an old concept is negatively associated with the discourse volume of a new and related concept. Wang, 2007

Concept Popularity Evolution Model

Manual Content Acquisition

Automating Content Acquisition Manually identify source(s) Paid content, Blogs Locally cache Web pages FlashGet Parse HTML, build XML Perl Read XML, write tool’s format LingPipe, SVMlite

Collections 6-month Pilot Study Collection (used to date) • Computerworld: 1 January 2005 – 30 June 2005 • 1,193 documents • 26 issues 10-year Trade Press Collection (now available) • Computerworld: 1 January 1998 – 9 June 2008 • 25,278 documents • 534 issues • Information Week: 1 January 1998 – 30 June 2008 • 31,112 documents • 527 issues

Manual Document Classification Coding frame: ProQuest innovation labels

Automatic Document Classification SVM, 6-month Computerworld pilot collection

10-Year Subject Label Distribution 10-year Computerworld collection

Recall F1 Precision Recall F1 Precision Automatic Annotation of Mentions LingPipe, 6-month Computerworld pilot collection

Manual Selective Acquisition In ABI/Inform, search co-occurrence of two innovations

Manual Co-occurrence Analysis

Next Steps for PopIT • Additional content types • Academic papers • Blogs • Interviews • Classification • Cross-domain (e.g., trade press : blogs) • Non-topical (e.g., sentiment) • Social network (e.g., opinion leaders) • Extraction • Non-entity (e.g., values)

Build Tools, not “Solutions”

Interdisciplinary Innovation Cycle Language Technology Transducer Technology Application Systems Application Technology

Some Collaboration Opportunities Content acquisition • API access to content providers • “Sandbox” collections, focused access • Diverse sources • Email, speech, multiple languages • Interaction trails • Query logs, clickstreams Computational linguistics • Cross-language co-reference Applications • Global diffusion of innovation

References • Oard, D., “A Whirlwind Tour of Automated Language Processing for the Humanities and Social Sciences,” CLIR-NEH Symposium on Cyberinfrastructure for the Humanities and Social Sciences, Washington, DC, September 2008. • Cheng, A.-S., Fleischmann, K.R., Wang, P. and Oard, D., “Advancing Social Science Research by Applying Computational Linguistics,” Annual Conference of the American Society for Information Science and Technology, Columbus, OH, October 2008. • Wang, P., “Chasing the Hottest IT: Effects of Information Technology Fashion on Organizations,” Best Paper Proceedings of the Academy of Management Annual Meeting, Philadelphia PA, August 2007.

Ping Wang pwang@umd.edu+1-301-593-4518 Doug Oard oard@umd.edu +1-301-405-7590 Ken Fleischmann kfleisch@umd.edu +1-301-405-3989 College of Information Studies University of Maryland Room 4105 Hornbake Bldg, South Wing College Park, MD 20742-4325, USA The PopIT Team at Maryland http://www.wam.umd.edu/~pwang/PopIT/

Fourth-Generation Content Analysis Computational Linguistics for the Social Sciences