220 likes | 241 Views
This study explores the use of machine learning to assess sentiment towards HPV vaccination using Twitter data. The hierarchical classification approach improves prediction performance on the highly unbalanced dataset and identifies associations between real-world outcomes and social media discussion. Limitations include limited keywords and features, separated training and prediction corpus, and poor performance on classes with limited samples. Future work includes uncovering discussion topics using topic modeling, evaluating deep learning approaches, and predicting demographic information.
E N D
Leveraging Machine Learning Based Approaches to Assess HPV Vaccination Sentiment with Twitter Jingcheng Du, B.S., Jun Xu, Ph.D., Hsingyi Song, MPH, Cui Tao, Ph.D. Ontology Research Group School of Biomedical Informatics University of Texas Health Science Center at Houston (UTHealth) IRB Number: HSC-SBMI-16-0291
Social Media for Public Health • An important medium for public, patients and health professionals to communicate about health-related issues • 90% of respondents (age 18 to 24) said they would trust medical information shared by others on their social media networks • Studies show that information shared on social media is able to alter vaccine acceptance and decision-making 1. Moorhead SA, Hazlett DE, Harrison L, et al. A new dimension of health care: systematic review of the uses, benefits, and limitations of social media for health communication. J Med Internet Res 2013;15:e85. 2. https://getreferralmd.com/2013/09/healthcare-social-media-statistics/
HPV and HPV vaccine • Nearly all sexually active men and women get it at some point in their lives • Lead to 25-30% oral and throat cancers; 90% of anal cancers; 40% of the penile cancer; nearly 100% of cervical cancers • Haven’t started HPV vaccine series • 4 out of 10 adolescent girls • 6 out of 10 adolescent boys https://www.cdc.gov/std/hpv/stdfact-hpv.htm https://www.cdc.gov/media/releases/2015/p0730-hpv.html
Tweets Collection & Annotation • Tweets collection • Training corpus: July 15, 2015 to August 17, 2015 • 33,228 tweets have been collected • Prediction corpus: November 2, 2015 to March 28, 2016 • 184,214 tweets have been collected • Keywords • hpv, human papillomavirus, cervarixand gardasil • Annotation • 6,000 tweets randomly sampled from training corpus • Three annotators
Sentiment Classification Overview of the scheme for HPV vaccine sentiment classification on Twitter
Sentiment Distribution • Kappa inter-rater value: 0.851 • Highly unbalanced class distribution Sentiment distribution in gold standard
Machine Learning System • Pre-processing • Remove URLs, hashtags and Twitter user names • Remove duplicate letters, “wooooow” -> “woow” • Convert the texts to the lowercase • Feature extraction • Word n-grams: contiguous 1 and 2 grams of words • Word clusters feature: map tweets tokens into 1000 clusters • POS tags: extracted by TweeboParser • Classification algorithm • Support vector machines (SVM) , RBF kernel • Evaluation • 10-fold cross-validation on gold standard
Baseline Model Baseline model: use n-grams feature only and consider each classes equally
Hierarchical Classification Scheme Level 1 Level 2 Level 3 Hierarchical classification scheme for HPV vaccine sentiment classification on Twitter
Performance Improvement Performance comparison of the baseline model and hierarchical classification model
Performance improvement through optimization Du, Jingcheng, et al. "Optimization on machine learning based approaches for sentiment analysis on HPV vaccines related tweets." Journal of Biomedical Semantics 8.1 (2017): 9.
Evaluation on Unlabeled Tweets Dataset Evaluation on 500 randomly selected samples from prediction corpus 184,214 tweets
14 Feb 22, 16 Trends of for different sentiment
16 The association of different days of the week with the relative proportions of tweets containing Negative, Neutral and Positive opinions
Summary • Contributions: • Apply hierarchical classification to improve the prediction performance on highly unbalanced Tweets dataset • Identify interaction of real world outcome with Twitter discussionand discover the association of HPV related tweeting behaviors with different opinions • Limitations: • Limited keywords, limited features • Training and prediction corpus are separated • Poor performance on classes with very limited samples • On going projects: • Uncover discussion topics using topic model • Evaluate deep learning (i.e. CNN) on this task • Predict demographic information for these social media users
Summary • Contributions: • Apply hierarchical classification to improve the prediction performance on highly unbalanced Tweets dataset • Identify interaction of real world outcome with social media discussionand discover the association of HPV related tweeting behaviors with different opinions • Limitations: • Limited keywords, limited features • Training and prediction corpus are separated • Poor performance on classes with very limited samples • On going projects: • Uncover discussion topics using topic model • Evaluate deep learning (i.e. CNN) on this task • Predict demographic information for these social media users
Summary • Contributions: • Apply hierarchical classification to improve the prediction performance on highly unbalanced Tweets dataset • Identify interaction of real world outcome with social media discussionand discover the association of HPV related tweeting behaviors with different opinions • Limitations: • Limited keywords, limited features • Training and prediction corpus are separated • Poor performance on classes with very limited samples • On going projects: • Uncover discussion topics using topic model • Evaluate deep learning (i.e. CNN) on Twitter tasks • Predict demographic information for these social media users
Acknowledgments • UTHealth SBMI • Dr. Cui Tao’s research group • Dr. Hua Xu’s research group • UTHealth CPRIT Fellowship • Dr. Roberta Ness • Dr. Patricia Dolan Mullen • Dr. David Loose • All the fellows • Grants: • National Institutes of Health under Award Number R01LM011829 • National Institutes of Health under Award Number R01AI130460 • Cancer Prevention and Research Institute of Texas grant # RP160015
cui.tao@uth.tmc.edu jingcheng.du@uth.tmc.edu @jingchengdu Disclaimer The content is solely the responsibility of the authors and does not necessarily represent the official views of the the National Institutes of Health and Cancer Prevention and Research Institute of Texas.
Classification Definition Detailed definition of different classes