170 likes | 185 Views
Explore the data and insights behind popular TED Talks, leveraging text analytics and predictive modeling. Discover influential factors that drive talk popularity and learn how R libraries and Shiny facilitate data transformations and tool deployment.
E N D
Words that will inspire Eduardo Contreras Cortes www.speakthedata.com @edco_one
The Motivation I have a dream that one day… We choose to go to the moon in this decade… We shall fight on the beaches…
But we need more data! • Sufficient number of talks • Ideally same format and style • Transcripts to be scrapable • A way to track progress of popularity
Eureka! https://www.kaggle.com/rounakbanik/ted-talks
The approach • I. Data extraction and feature engineering • II. Data analysis and model ensemble • III. Model insights
I. Data Extraction and Feature Engineering “The” Source: www.tidytextmining.com/
I. Data Extraction and Feature Engineering • Dataset • More than 2,500 Ted Talks from all TED Events • All transcripts from each talk • Available data: Number of views, comments and ratings • Enriched dataset: Filmed date, published date and duration time • Building Features • Word counts: Number of sentences/words per minute, average words per sentence • Audience reaction: Laughs, questions, applauses • N-gram Word analysis: Frequency of words like “I”, “You”, “Going To”, “Want” To Predict: Binary classification if the Ted Talk is a top most viewed talk
II. Data analysis and model ensemble 1 2 3 4 • Correlation Analysis • Remove variables that were correlated • Descriptive Analysis • Duration of the talks to be similar • Analysis frequency of n-grams • Standarised views per time shown in website • Additional Feature Engineering • Combine n-grams to reduce features • Model Assessment • From simplex to complex models • Understand the most relevant variables of the models • Produce an explainable model! • Libraries used • Tidyverse, Tidytext, • Smbinning, Wordcloud • Libraries used • Cor, Vinf • Libraries used • Cor, Vinf • Libraries used • ROCR, glm, randomForest, • Xgboost, H2O
III. Model Insights • The selected model • Logistic Regression Scorecard with 7 variables • AUC: 76% Accuracy: 73% • Shorter duration talks were more effective • 2X More Effective • Speak slowly, less words per minute is better • 2X More Effective • Ask questions! The more the better! • 1.5X More Effective • Make your audience laugh! • 1.5X More Effective
III. Model Insights • Libraries used • Developed in Shiny • Deployed in Shinyapps.io • Shinydashboard as layout • Tidyverse for data transformations • Plotly for graphics www.speakthedata.com
Final remarks Text analytics and Predictive modelling showed influential factors that predicts popularity of talks R libraries eased the work of data transformations and modelling Shiny and Shinyapps.io facilitated the deployment of the tool
Thank you! www.speakthedata.com @edco_one