670 likes | 685 Views
Topic Discovery in Text-Driven Social Science Research. Philip Resnik University of Maryland resnik@umd.edu. June 29, 2017. Thanks. National Science Foundation DARPA IARPA Bloomberg. Jordan Boyd-Graber Viet-An Nguyen Kris Miler Will Armstrong Leo Claudino Thang Nguyen
E N D
Topic Discovery in Text-Driven Social Science Research Philip Resnik University of Maryland resnik@umd.edu June 29, 2017
Thanks • National Science Foundation • DARPA • IARPA • Bloomberg • Jordan Boyd-Graber • Viet-An Nguyen • Kris Miler • Will Armstrong • Leo Claudino • Thang Nguyen • Deborah Cai • Amber Boydstun • Noah Smith • Justin Gross • Weiwei Yang • VladEidelman • Daniel Argyle • Andrew Stavisky • Chris Musialek
http://www.businessinsider.com/social-medias-big-data-future--from-deep-learning-to-predictive-marketing-2014-2http://www.businessinsider.com/social-medias-big-data-future--from-deep-learning-to-predictive-marketing-2014-2
Traditional content analysis http://si.wsj.net/public/resources/images/RV-AN417_CUBICL_G_20140509201519.jpg
Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, Noah A. Smith, From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series, Proceedings of the International AAAI Conference on Weblogs and Social Media, Washington, DC, May 2010.
This jogger might be adequate for use on smooth pavement and well-maintained paths at walking speeds, but on any other surface, the lack of suspension and the flexy design of the frame makes for a rough ride for baby and rather unpredictablesteering response for mom or dad.
Topic Modeling A topic model is a kind of statistical model that discovers latent (not directly observable) “topics” in a collection of documents. A document can be any unit of text – an open ended survey response, a speaker’s turn in a focus group, a product review, …
In statistical topic models, a topic resembles a thematically focused word cloud. Adapted from V-A Nguyen, Guided Probabilistic Topic Models for Agenda-Setting and Framing, UMD dissertation (slides), Feb 2015.
Every document is a mixture of these latent topics. In probabilistic topic models, a topic resembles a thematically focused word cloud. This is very similar in spirit to PCA or factor analysis. Adapted from V-A Nguyen, Guided Probabilistic Topic Models for Agenda-Setting and Framing, UMD dissertation (slides), Feb 2015.
Latent Dirichlet Allocation (LDA), Blei et al. (2003) For every word position, pick a topic for that word. For every document, pick the mixture of topics that will be used. Prior probabilities for topic-word distributions Prior probabilities for document-topic distributions Generate a word associated with that topic.
pain, doctor, nurse, told, medication, meds, gave nurse, room, bed, hours, minutes, nurses, hour insurance, bill, pay, cost, charge, visit, hospital insurance, billing, bill, hospital, department, company, paid blood, test, doctor, results, tests, lab, work place, big, hand, time, TV, watch, cool front, security, desk, room, friend, back, waiting Example adapted from Benjamin L. Ranard, Rachel M. Werner, Tadas Antanavicius, H. Andrew Schwartz, Robert J. Smith, Zachary F. Meisel, David A. Asch, Lyle H. Ungar and Raina M.Merchant,Yelp Reviews Of Hospital Care Can Supplement And Inform Traditional SurveysOf The Patient Experience Of Care Health Affairs 35, no.4 (2016):697-705 doi: 10.1377/hlthaff.2015.1030
Benjamin L. Ranard, Rachel M. Werner, Tadas Antanavicius, H. Andrew Schwartz, Robert J. Smith, Zachary F. Meisel, David A. Asch, Lyle H. Ungar and Raina M. Merchant,Yelp Reviews Of Hospital Care Can Supplement And Inform Traditional SurveysOf The Patient Experience Of Care Health Affairs 35, no.4 (2016):697-705 doi: 10.1377/hlthaff.2015.1030
Benjamin L. Ranard, Rachel M. Werner, Tadas Antanavicius, H. Andrew Schwartz, Robert J. Smith, Zachary F. Meisel, David A. Asch, Lyle H. Ungar and Raina M. Merchant,Yelp Reviews Of Hospital Care Can Supplement And Inform Traditional SurveysOf The Patient Experience Of Care Health Affairs 35, no.4 (2016):697-705 doi: 10.1377/hlthaff.2015.1030
Supervised LDA (sLDA), Blei & McAuliffe (2008) For every word position, pick a topic for that word. For every document, pick the mixture of topics that will be used. Prior probabilities for topic-word distributions Prior probabilities for document-topic distributions Generate a word associated with that topic. Observed response variable for this document y
“Unfortunately, itssimple plot was too dull to be watchable” “Unfortunately, itssimple plot was too dull to be watchable” -15.1 -28.5 16.2 -14.2 2/5 2/5 1/5 Adapted from Blei & McAuliffe (2007)
Rude, Stephanie, Eva-Maria Gortner, and James Pennebaker. "Language use of depressed and depression-vulnerable college students." Cognition & Emotion,18.8 (2004): 1121-1133.
COLLEGE IS GREAT AS LONG AS I DO NOT HAVE TO GO TO CLASS OR LEAVE MY ROOM. I DO NOT LIKE GOING OUT ANYMORE EVEN THOUGH I USED TO LOVE IT. NOW I JUST WANTTO SIT IN MY ROOMAND PLAY ON MY COMPUTER OR SLEEP. I DO NOT EVEN LIKE TALKING ON THE PHONE. THINGS I USED TO ENJOY, LIKE PEOPLE, I DO NOT ANYMORE. THEN THERE ARE THE CLASSES. I HATE ALL OF MINE. I FEEL LIKE SUCH A FAILURE. EVERYONE TOLD ME THEY WOULD BE HARD, BUT THIS IS RIDICULOUS. I CANNOT BELIEVE ANYONE CAN PASS THESE. I TRY MY HARDEST BUT THAT NEVER SEEMS TO BE ENOUGH. I KNOW I COULD SPEND MORE TIME ON MY HOMEWORK BUT WHEN I AM WORKING ON IT I GET SO WORN OUTI CANNOT THINK ANYMORE. THEN I REGRET NOT DOING IT. BUT IT IS LIKE A VICIOUS CYCLE. I AM SO EXHAUSTED I CANNOT THINK SO I SLEEP, THEN I WAKE UP EXHAUSTEDAND I DO NOT HAVE ENOUGH ENERGY TO GO TO CLASS. THEN I DO NOT KNOW HOW TO DO MY HOMEWORK AND I GET DISCOURAGED AND IT TAKES ME TWICE AS LONG TO DO, SO I GET SO EXHAUSTED THAT I CANNOT THINK! THIS IS SO FRUSTRATING I FEEL LIKE THERE IS NO ONE IN THIS UNIVERSITY THAT CARES THAT I HATE IT HERE. …
-NE +NE Supervised LDA topics from undergraduate stream-of-consciousness essays identified by a clinician as most relevant for assessing depression. Supervision (regression) is based on Z-scored Big-5 scores for emotional instability (neuroticism). Resnik et al., Beyond LDA: Exploring Supervised Topic Modeling for Depression-Related Language in Twitter. NAACL Workshop on Computational Linguistics and Clinical Psychology, Denver, CO, June 2015.
Language evidence for behavioral symptoms • watch movie time episode read write season totally book favorite
True positives (sensitivity) False positives (1 – specificity) Resnik et al., “The University of Maryland CLPsych Shared Task System”, Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 54–60, Denver, Colorado, June 5, 2015.
Hierarchical modeling [The] press may not be successful much of the time in telling people what to think, but it is stunningly successful in telling its readers what to think about. The world will look different to different people depending on the map that is drawn for them by writers, editors, and publishers of the paper they read." Cohen, B.C. (1963). The press and foreign policy. Princeton. (emph. added) Entman, R.M. (1993). "Framing: Toward clarification of a fractured paradigm". Journal of Communication 43 (4): 51–58. What framing does is to "select some aspects of a perceived reality and make them more salient in a communicating text, in such a way as to promote a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation for the item described."
“Some of the children who have come to this country may not have a valid legal basis to remain, but some will. Yet, it is virtually impossible for a child to assert a valid claim under immigration law in the absence of legal representation. … It is a fantasy to believe that unrepresented children have a fair shot in an immigration proceeding” --Rep. Hakeem Jeffries, D-N.Y. Immigration Legal process Immigration The swine flu finding only fuels fears from law enforcement along the border who say the illegal immigrants are not being properly screened for diseases and contagious sicknesses before moving along to other facilities for holding across the nation. Healthcare
Supervised Hierchical LDA (SHLDA) Supervised Nested LDA (SNLDA) Lexical and Hierarchical Topic Regression, Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik, Advances in Neural Information Processing Systems (NIPS 2013), Lake Tahoe, NV
Number of stars DW-NOMINATE Mean squared error (lower is better)
Selected SNLDA topics from Twitter training data Philip Resnik et al. (2015), “Beyond LDA: Exploring Supervised Topic Modeling for Depression-Related Language in Twitter”, Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality.
Sequential modeling: agenda setting and influence “The ability to change topical focus, especially given strong cultural and social pressure to be relevant, means having enough interpersonal power to take charge of the agenda.” Palmer MT (1989) Controlling conversations: Turns, topics and interpersonal control. Communication Monographs 56(1):1–18
For this new turn, we decide whether to keep the old mix of topics, or generate a new mix. That choice depends on the author of this new turn and their tendency to change the topic. For every turn in the conversation, we know who the author is and what words they used. We have some mix of topics from the previous turn. Once we have the topic mix for this turn, we generate words according to that mixture of topics. Speaker Identity for Topic Segmentation V. Nguyen, J. Boyd-Graber, and P Resnik. SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations. ACL 2012 Then we do it all over again with the next turn.
Ifill, moderator: Terrible. Yes, she was constrained by the agreed debate rules. But she gave not the slightest sign of chafing against them or looking for ways to follow up the many unanswered questions or self-contradictory answers. This was the big news of the evening. Katie Couric, and for that matter Jim Lehrer, have never looked so good.
Agenda setting Influence Modeling Topic Control to Detect Influence in Conversations using Nonparametric Topic Models Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik, Deborah A. Cai, Jennifer E. Midberry, Yuanxin Wang Machine Learning Journal, October 2013
Modeling framing and decisions Ideal point model Text not used Logistic or Gaussian Martin and Quinn (2002), Bafumi et al. (2005), Gerrish and Blei (2011), slide adapted from Viet-An Nguyen
Modeling framing and decisions Multidimensional ideal point model Lauderdale and Clark (2014). Slide adapted from Viet-An Nguyen
Hierarchical ideal point topic model (HIPTM) Bill text generated using distribution over topics Issues Speech text (new!)generated using distribution over issue-specific frames Frames Issue-specific ideal point uses a weighted combination based on how much author talks about each frame. Ideal points Votes Nguyen et al., Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress. Association for Computational Linguistics, Beijing, July 2015. Slide adapted from Viet-An Nguyen
Hierarchical Ideal-Point Topic Model Establishment Tea Party Nguyen et al., Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress. Proc. Association for Computational Linguistics, Beijing, July 2015.
Hierarchical Ideal-Point Topic Model Analysis of Republican voting in the 112th Congress on 60 FreedomWorks “key votes”, comparing Establishment and Tea Party ideal points and issue framing. Issues where establishment Republicans have a “cheap” way to take Tea Party friendly positions without going against their preferred ideal points. (Example: Obamacare) Tea Party “wins” on these issues: 83%
Hierarchical Ideal-Point Topic Model Disagreement on preferred outcomes, but low rhetorical polarization, which renders the issue less memorable for voters. (Example: reforming the budget process) Tea Party “wins” on these issues: 38%
Interactive modeling • Topic models are promising in qualitative research because they offer an automatic analysis into intuitive, latent thematic units. • But they’re very difficult to use unless you’ve got a data analytics specialist on your team. • Data wrangling and programming skills • The black art of “tuning” the model • Limited ways to exploit subject matter expertise • Unfamiliar or unintuitive visualization tools
Interactive Modeling:The business of trust "I don't trust [existing] software to give me all the relevant insights, and my client doesn't trust just me. I need something that can fill that gap.” - Polling industry qualitative analyst
Musialek et al. (AAPOR 2016). Inspired by Hu et al. (2011, 2014), Interactive Topic Models.
Validation with FAA Data 8 – 10 person-hours 1-2 days 100 – 150 person-hours Several weeks 93.5% of GAO codes covered, capturing key high-level themes
Take-aways Bayesian topic models provide a way to uncover latent structure in text Integrating supervision makes it possible to predict response variables of interest Sequential models can capture conversational dynamics Hierarchical models can capture not only topics, but how those topics are framed Interactive models combine the best of machine and human capabilities