420 likes | 568 Views
Fourth-Generation Content Analysis Computational Linguistics for the Social Sciences. Douglas W. Oard Joint work with Ping Wang, Ken Fleischmann, Tiffany Chao, An-Shou Cheng, Chia-jung Tsui and Lidan Wang. Outline. Content analysis (some) Computational linguistics Putting them together
E N D
Fourth-Generation Content AnalysisComputational Linguistics for the Social Sciences Douglas W. Oard Joint work with Ping Wang, Ken Fleischmann, Tiffany Chao, An-Shou Cheng, Chia-jung Tsui and Lidan Wang
Outline • Content analysis • (some) Computational linguistics • Putting them together • An example: adoption of IT concepts • Collaboration opportunities
Insight through Triangulation • Think aloud • Observation notes • Interviews • Surveys • Content analysis • Citation analysis
Content Analysis • “… any technique for making inferences by objectively and systematically identifying specified characteristics of messages …” (Holesti, 1969) • “… the study of recorded human communications such as books, Web sites, paintings, and laws …” (Babbie, 1975) • “… a summarizing, quantitative analysis of messages that relies on the scientific method …” (Neuendorf, 2002)
Four Generations of Content Analysis • Read and understand something • Manually infer something, then count it • Directly observe something, then count it • Automatically infer something, then count it
Problem identification Data selection Conceptualization Operationalization Coding frame design Analysis Second-Generation Content Analysis Content acquisition Manual Coding
Third-Generation Content Analysis:(“Computer Assisted Text Analysis”) • “Dictionary-based” word counting • Person or organization names • Positive and negative sentiment terms • Vastly more scalable than manual coding • Alias list can accommodate synonymy • Focused domain can limit homonomy effects • Regression models some context-dependent effects
(Some of) Computational Linguistics • Transducers • Document image processing (e.g., OCR) • Speech processing (e.g., ASR) • Machine translation • “Text Mining” • Segmentation • Clustering • Classification
OCR MT Handwriting Speech Transducer Capability Curve Searchable Fraction Transducer Capabilities
Segmentation • Find mentions of specific types of items in a sequence • Equivalently, learn to mark start and end points • Applicable at many scales • Coherent passages • Multi-word expressions • Named entities (e.g., people or organizations) • Noun phrases • Chinese words • Stems
Aggregate related items Documents, based on topical similarity Entities, based on detected relationships Clustering
Classification • Associate each item with a category • Document topic • Passage sentiment • Entity type Feature extraction Model learning Content acquisition Classification evaluation
Automating the Annotation Process “There has been a lot of buzz over the arrival of Firefox, the open-source browser published by the Mozilla Foundation… Sun Microsystems Inc. hopes that open-source Solaris will draw in new users and new growth opportunities.” Segmentation: Classification: Association and clustering: (company, software) Firefox Mozilla Foundation Sun Microsystems Solaris Open Source Software Organization Organization Open Source Software Firefox Mozilla Foundation Sun Microsystems Solaris
PopIT: Scalable Computational Analysis • Objectives • Describe, explain, and predict trends in technological fields (IT, biotech, nanotech…) • Develop methodology for domain-specific computational analysis of large-scale textual data from multiple sources • Advance theory development in social sciences • Period: September 2007-August 2010 • Sponsor: National Science Foundation http://www.wam.umd.edu/~pwang/PopIT/
Problem identification Data selection Content acquisition Conceptualization Feature extraction Model learning Operationalization Classification Coding frame design Analysis evaluation Process Integration
Rethinking “Operationalization” • Coupled models as “boundary object” • Input representation • Transformation • Output representation • Layered uncertainty • Meaning of the text • Meaning of the coding frame • Purpose of the coding frame
SaaS Chatbots Portable Personality Ajax RFID Ultramobile Devices BPO Application Quality Dashboards SOA VoIP Mashup DRM Identity Management OSS Thin Provisioning Business Intelligence Semantic Web Web2.0 SCM Tera-architectures CRM Distributed Encryption iSCSI Why do some innovations become popular, but others don’t?
Hype Cycle Performance S-curve Adoption Curve IT Innovation Life Cycle Time Management Fashion Theory: Knowledge entrepreneurs create a transitory collective belief that an innovation is at the forefront of progress. Linden & Fenn, 2003
Hype Cycle Emerging Technologies 2007 Gartner Fenn et al., 2007
Fluctuation Hypothesis An innovation will be more prevalent when its environmental cues are more prevalent. Abrahamson & Fairchild, 1999
Sentiment Hypothesis Emotional and positive discourse characterizes the upswing of an innovation’s hype cycle, whereas reasoned, negative discourse characterizes the downswing. Abrahamson & Fairchild, 1999
Competition Hypothesis The discourse volume of an old concept is negatively associated with the discourse volume of a new and related concept. Wang, 2007
Automating Content Acquisition Manually identify source(s) Paid content, Blogs Locally cache Web pages FlashGet Parse HTML, build XML Perl Read XML, write tool’s format LingPipe, SVMlite
Collections 6-month Pilot Study Collection (used to date) • Computerworld: 1 January 2005 – 30 June 2005 • 1,193 documents • 26 issues 10-year Trade Press Collection (now available) • Computerworld: 1 January 1998 – 9 June 2008 • 25,278 documents • 534 issues • Information Week: 1 January 1998 – 30 June 2008 • 31,112 documents • 527 issues
Manual Document Classification Coding frame: ProQuest innovation labels
Automatic Document Classification SVM, 6-month Computerworld pilot collection
10-Year Subject Label Distribution 10-year Computerworld collection
Recall F1 Precision Recall F1 Precision Automatic Annotation of Mentions LingPipe, 6-month Computerworld pilot collection
Manual Selective Acquisition In ABI/Inform, search co-occurrence of two innovations
Next Steps for PopIT • Additional content types • Academic papers • Blogs • Interviews • Classification • Cross-domain (e.g., trade press : blogs) • Non-topical (e.g., sentiment) • Social network (e.g., opinion leaders) • Extraction • Non-entity (e.g., values)
Interdisciplinary Innovation Cycle Language Technology Transducer Technology Application Systems Application Technology
Some Collaboration Opportunities Content acquisition • API access to content providers • “Sandbox” collections, focused access • Diverse sources • Email, speech, multiple languages • Interaction trails • Query logs, clickstreams Computational linguistics • Cross-language co-reference Applications • Global diffusion of innovation
References • Oard, D., “A Whirlwind Tour of Automated Language Processing for the Humanities and Social Sciences,” CLIR-NEH Symposium on Cyberinfrastructure for the Humanities and Social Sciences, Washington, DC, September 2008. • Cheng, A.-S., Fleischmann, K.R., Wang, P. and Oard, D., “Advancing Social Science Research by Applying Computational Linguistics,” Annual Conference of the American Society for Information Science and Technology, Columbus, OH, October 2008. • Wang, P., “Chasing the Hottest IT: Effects of Information Technology Fashion on Organizations,” Best Paper Proceedings of the Academy of Management Annual Meeting, Philadelphia PA, August 2007.
Ping Wang pwang@umd.edu+1-301-593-4518 Doug Oard oard@umd.edu +1-301-405-7590 Ken Fleischmann kfleisch@umd.edu +1-301-405-3989 College of Information Studies University of Maryland Room 4105 Hornbake Bldg, South Wing College Park, MD 20742-4325, USA The PopIT Team at Maryland http://www.wam.umd.edu/~pwang/PopIT/