70 likes | 85 Views
Learn how to collect data from webpages using web scraping techniques. This tutorial covers the basics of web scraping, including downloading webpages using the Requests library and parsing them using BeautifulSoup. The dataset used in this tutorial contains approximately 12,000 reviews for 180 laptops, with a total of about 712,000 review words. Each review varies in length, ranging from 2 words to about 600 words, with an average of about 60 words.
E N D
Build a Text Dataset from AMAZON Raymond ZHAO Wenlong(03/07/2018)
Collect data In the data age In StatisticalML/DL/NLP, volumes of data is a key. We could collect data from the wide world of web.
HTML HTML stands for Hyper Text Markup Language. HTML describes the structure of Web pages.
Web scraping Download the webpage and parse it.
The process Download:Requests is a HTTP library Parse:BeautifulSoup is to parses a web page See the developed script amazon_scraper.py
The dataset There are about 12k reviews for 180 laptops, and about 712k review words totally. Each review is from 2 words to about 600 words; The mean is about 60 words. See the AMAZON dataset amazon_reviews.json
Thanks Thanks Dr. Wong, David and Linkai