560 likes | 769 Views
Session 7 Wharton Summer Tech Camp. Scrapy Big Data in Empirical Business Research. What’s Scrapy ? And Why?. Application Framework for crawling websites and scraping & extracting data using APIs Basically a set of pre-defined classes and instructions for efficiently writing scraping code
E N D
Session 7Wharton Summer Tech Camp Scrapy Big Data in Empirical Business Research
What’s Scrapy? And Why? • Application Framework for crawling websites and scraping & extracting data using APIs • Basically a set of pre-defined classes and instructions for efficiently writing scraping code • It’s In Python • Simple once you know the framework • Fast, Extensible, Many built-in functions, good sized online support community • Some companies use this commercially. It’s that powerful.
Scrapy Components • Engine • Main engine that passes around items and requests throughout the framework • Scheduler • Gets requests from the engine and enqueues them for further requests • Downloader • Downloads the raw http files and feeds them into spider • Spiders • Receives the downloaded raw http files and extracts information • Item Pipeline • Collects extracted items from the spider and post-process them • Built in modules & typical uses • Clean html • Validate data & check for duplicates • Store the data into a database
Scrapy Framework • Use command line to create project folder • Define Item.py • Define Spider for crawling the website • (Option) Write the item pipeline
Scrapy commands Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test crawl Run a spider fetch Fetch a URL using the Scrapy downloader runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory
Example & Tutorial • scrapystartproject “foldername” • scrapystartproject tutorial tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Fully working example • git clone https://github.com/scrapy/dirbot.git • Or go download the zip version and extract on your working directory • https://github.com/scrapy/dirbot
1-min HTML • Hyper Text Markup Language • Describe webpage – mostly with tags <!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>
HTML Parts • Each element can have attributes • <a href="http://www.w3schools.com">This is a link</a> • href is an attribute • You can have • Class • ID • Style • etc
XPath • Language for finding information in an XML(HTML) document
XPath <?xml version="1.0" encoding="UTF-8"?> <bookstore> <book> <title lang="en">Harry Potter</title> <price>29.99</price> </book> <book> <title lang="en">Learning XML</title> <price>39.95</price> </book> </bookstore>
Snipplr • http://snipplr.com/all/tags/scrapy/
Videos about Big Data • Some videos • Joy of Stat - Hans Rosling! • http://www.youtube.com/watch?v=CiCQepmcuj8 • http://www.ted.com/playlists/56/making_sense_of_too_much_data.html • http://www.intel.com/content/www/us/en/big-data/big-data-101-animation.htmlMorerelevant stop at 2:10 • http://motherboard.vice.com/blog/big-data-explained-brilliantly-in-one-short-video Broad stop at 4:26 • https://www.youtube.com/watch?v=LrNlZ7-SMPkmanyinterestingstats • http://blog.varonis.com/10-big-data-videos-watch-right-now/
“Data is the new Oil. Data is just like crude. It’s valuable, but if unrefined it cannot really be used.” – Clive Humby, DunnHumby "The goal is to turn data into information, and information into insight." – Carly Fiorina, former chief executive of Hewlett-Packard Company or HP. "You can have data without information, but you cannot have information without data.” -Daniel Keys Moran “We live in the caveman era of Big Data” – Rick Smolan “From the beginning of recorded time until 2003, we created 5 exabytes of data” (5 billion gigabytes) – Eric Schmidt In 2011, same amount was generated in 2 days In 2013, the same amount was expected to be generated in 10 minutes Fun Fact: In 2014, World cup final match was estimated to generate 4.3 Exabytes of internet traffic
Big data embodies new data characteristics created by today’s digitized marketplace Big data characteristics Characteristics of big data Source: IBM methodology 19
Big data embodies new data characteristics created by today’s digitized marketplace Big data characteristics Characteristics of big data Big Companies IBM, Intel, etc Computer Scientists US!!! Statisticians Source: IBM methodology 20
Big data: This is just the beginning 9000 100 Sensors & Devices Percentage of uncertain data 6000 Volume in Exabytes Percent of uncertain data Social Media 50 2012 Volume Veracity VoIP 3000 Enterprise Data Variety 0 2010 2015 Source: IBM Global Technology Outlook 2012 IBM source data is based on analysis done by the IBM Market Intelligence Department. IBM Market Intelligence data is provided for illustrative purposes and is not intended to be a guarantee of future growth rates or market opportunity 22
Current Stage of Big Data • “This is the caveman era of the big data” • What’s cool is cool because we are looking at these for the first time and even correlation is cool sometimes! Mash up of different big data makes things scary sometimes (CMU Face app) • Scientific process always begins with correlation then moves onto causality when mature
Big Data: Predictive Analytics VS Causal Inference Agenda What’s the deal here? Why should you be aware? What kind of development is going on right now? “Big Data and You”
Predictive vs Causal Statistics Causal Inference Predictive analytics Econometrics Machine Learning
The Rise of Predictive Models • Statistics & Computer Science (Logical AI -> Statistical AI) • Overflowing data + computational power • Better prediction • Model free – no theory backing • Blackbox algorithms • Statistical algorithms • Goal: Predict well (with big enough data, it works) • Techniques: MANY • Take CIS 520:Machine Learning forbasicintro. At least audit! It will open up your eyes • Stat 9XX- Statistical Learning Theory if offered! Also great – will be a lot of probability/stat theory (Sasha Rakhlin) • * Online courses: Andrew Ng’s course, John’s Hopkins Data Science Course, etc
Good Old Causal Inference • Statistics & Econometrics • Explore -> Develop Theory -> Test with Statistical Inference models ( Linear Models / Graphical Models / etc) • Requirement for X Causes Y • X must temporally come before Y (NOT in Predictive model) • X must have significant statistical relation to Y • Association between X and Y must not be due to other omitted variable (NOT in Predictive model) • Theory is from economics/sociology/psychology etc
Predictive Analytics VS Causal Inference • Predictive analytics (Machine Learning, Algorithms) • Art of prediction • RMSE/Error functions • Causal Inference (Rubin Causal Model, Structural) • Theory building • Testing theory with statistical tools and robust design of experiment or techniques to deal with observational data • Statistics/Comp Sci (Algorithms and Data mining, Machine Learning) • Statistics/Econometrics (Causality – different school of thoughts even within causal inference groups. For brief fun intro, see http://leedokyun.com/obs.pdf) • Paradigm-Building – Kuhnian sense & Falsify existing beliefs – Popperian Sense • Causal inference can do both. Predictive Models cannot
Resources for Causal Inference • Andrew Gelman: Bayesian Statistician at Columbia U • http://andrewgelman.com/ • The great fight of 2009 between the PearlianvsRubinian! • “Boy, these academic disputes are fun! Such vitriol! Such personal animosity! It's better than reality TV. Did Rubin slap Pearl's mom, or perhaps vice versa?” • “With all due respect, I think you are wrong that Judea does not understand the Rubin approach.” – Larry Wasserman • Judea Pearl “Causality” • Observational Learning books by Paul Rosenbaum • Miguel Hernan and Jamie Robins “Causal Inference” free now • http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Arguments About the Big data Movement in Industry • Great portion of Start-ups & Many big data firms these days • Companies are trying to collect everything about everyone. Becomes unwieldy beast! • http://xkcd.com/882/
The Big Data Approach to Analytics is Different (Industry) AnalyzedInformation All Information Traditional AnalyticsStructured & RepeatableStructure built to store data Big Data AnalyticsIterative & ExploratoryData is the structure Hypothesis Question Data Exploration ? Answer Data Correlation Actionable Insight Start with hypothesis Test against selected data Data leads the way Explore all data, identify correlations Analyze after landing… Analyze in motion…
Arguments About the Big data Movement in Academia Read these interesting pieces featuring the dynamic duo of marketing (Prof Eric Bradlow and Prof Peter Fader) and Prof Eric Clemons of OPIM. http://www.sas.com/resources/asset/SAS_BigData_final.pdf http://knowledge.wharton.upenn.edu/article.cfm?articleid=2186 http://www.datanami.com/datanami/2012-05-03/wharton_professor_pokes_hole_in_big_data_balloon.html
In Academia, we should stay somewhere in the middle All Information All Information Raw Data Exploration & Reduction & Structure Pattern & Correlation Data Mining & Machine Learning “ETET” – Empirical Theory Empirical Theory Hypotheses Formation Based on extant theory Answer, Actionable Insights, and Theory Causal Data Analysis
Some notable examples Causal Lab experiment UCLA Stat book Targeted learning Economists Hans Rosling Angrist, Krueger Hal Varian: Google Susan Athey: MS Marketing Structural Modeling Finance Research Information Systems Management Small Large Association rule RevolutionR Data Mining Google Netflix Fraud detection Machine Learning What are you doing here? Not Causal
When dealing with unstructured/big data: Causal inference without data mining is myopic and data mining without theory-driven causal inference is blind
Quick Overview of Predictive Analytics (Machine Learning) and Applications
Machine Learning - Types • Supervised Learning • Use labeled training data and identify classes or attributes of new data. Calculate predictive models. • USES: Predictive models • Regression, Neural networks, support vector machine, etc • Unsupervised Learning • Find structure in unlabeled data • USES: Exploratory analysis, organization, visualization • Clustering, feature extraction, self organizing map • Semi-Supervised Learning
Machine Learning Broad Applications • Face Detection • Spam Detection • Song Recognition • Signature/Zipcode Recognition • Micro Array • Astrophysics • Medical • Consumer segmentation/targeting • Recommendation Algorithms
Machine Learning Business (research) Applications • Unstructured data -> Structured data • Natural Language Processing • Spoken Language Processing • Computer Vision • Exploratory Analysis • Clustering • Anomaly detection • Visualization • Dimensionality reduction • Multi-Dimensional Scaling • Some people have started to incorporate machine learning techniques into causal inference • Machine learning in matching (PSM) • Targeted Learning, 2012 Springer Series • (http://www.targetedlearningbook.com/)
Intro to Practical Natural Language Processing Agenda Brief light-hearted Intro to NLP (What is it and why should I care?) Basic ideas in NLP Usage in Business Research
Quick Overview • What is Natural (Spoken) Language Processing (NLP)? • Examples • How this technology may affect: • Industry • Academics
Natural Language Processing • Natural Language Processing is an interdisciplinary field composed of techniques and ideas from computer science, statistics and linguistics that are concerned with making computers able to parse, understand (knowledge representation), store (knowledge database), and ultimately interact (convey information) in natural language (human language such as English) • Methods: machine learning, bayesian statistics, algorithms, higher order logic, linguistics.
Subcategories of NLP • Information Retrieval: Google. Optimizing text database search. • Information Extraction: Crude basic form is Web Crawling + REGEX. Really sophisticated form, you’ll see later – Thomson Reuters • Machine Translation • Sentiment Analysis and more
Cool Applications • NSA - uses NLP to detect anomalous activity in internet and phone calls for terrorist activities (and us…) • Lie detection via spoken language processing • Automatic plagiarism detector • ETS Testing - since 1999 “e-rater” automatic essay scoring on GMAT, GRE, TOEFL. • Shazam – song discovery (application of spoken language processing) • News aggregators based on topic • Entertainment - Cleverbot (Turing test 59.3% VS real human 63.3%) Really evolved from dumb predecessors ELIZA, Smarter child etc.
Business Applications • Marketing - sentiment analysis and demand analysis of products from reviews and blogs e.g. movies, consumer products • Marketing – Opinion Mining/Subjectivity analysis/Emotion Detection/Opinion Spam Detection etc • Finance - Quantitative Qualitative high frequency trading ( Thomson Reuters, Bloomberg) • Management – Resume filtering and firm-employee matching • Legal Studies – legal document search engines • E-Commerce – help chat bots
Main stream Applications • Siri (dumb) - preprogrammed. No learning • IBM Watson/ Wolfram Alpha (smart): • semantic representation of concepts • acquisition of knowledge • logical inference machine • As of 2011, Watson had knowledge equivalent of a second year medical student (which isn’t saying much but still cool due to the speed Watson learns)
Main stream Applications • Siri (dumb) - preprogrammed. No learning • IBM Watson/ Wolfram Alpha (smart): • semantic representation of concepts • acquisition of knowledge • logical inference machine • As of 2011, Watson had knowledge equivalent of a second year medical student (which isn’t saying much but still cool due to the speed Watson learns)
Watson gets an attitude Well no $@!# Sherlock! You mea@#$%s can bite my shiny metal !@$ IBM Watson learned urban dictionary in 2013… “Watson couldn't distinguish between polite language and profanity -- which the Urban Dictionary is full of. Watson picked up some bad habits from reading Wikipedia as well. In tests it even used the word "bullshit" in an answer to a researcher's query. Ultimately, Brown's 35-person team developed a filter to keep Watson from swearing and scraped the Urban Dictionary from its memory.”
Some fun facts • 15,000 • Average number of words spoken by an average person per day (various sociology, linguistics studies). approximately 15 words per min assuming 8 hour sleep. • 100Million~300Million: • Average number of words spoken by an average person in a lifetime. • 100 TRILLION: • approx number of words on internet in 2007 by Peter Norvig (leads google research & AI scientist).
Reasons why you should at least acknowledge NLP and keep it in mind for the rest of your life • It will definitely be a disrupting technology and a large part of everyday life affecting most type of business (already has disrupted finance, marketing, management, etc) • Text Data: Explosion of web, Company performance report, news, security filings etc • Even in business research outside of Information Systems Management and Marketing, more and more researchers are utilizing NLP