Semalt : How To Scrape Web Pages

23.05.2018 Semalt – How To Scrape Web Pages? Beautiful Soup is a Python library widely used to scrape web pages by creating a parse tree from XML and HTML documents. Web scraping, a technique of extracting data from websites and pages, is widely used in data analysis and management ?elds. In most cases, Python programming language is a prerequisite in data science. Python 3 has scraping tools and modules you can apply to your data management project. Currently running as Beautiful Soup 4, this module is compatible with both Python 3 and Python 2.7. Beautiful Soup 4 module is also capable of creating a parse tree for non-closed tag soup. In this tutorial, you'll learn how to scrape the page and write the scraped data to a CSV ?le. Getting started To get started, set up a server or local-based Python coding environment on your PC. You should also install Beautiful Soup and Requests module on your machine. Knowledge of working with both modules is also a necessary prerequisite. Familiarity with HTML tagging and structure is also an added advantage. Understanding your data In this context, real data from the National Gallery of Art will be used to help you understand how to use Beautiful Soup 4. National Gallery of Art comprises of 120,000 pieces that are done by an approximate of 13,000 artists. The Art is based in Washington D.C, United States. https://rankexperience.com/articles/article2256.html 1/3

23.05.2018 Web data extraction with Beautiful Soup is not that complicated. For example, if you focus on letter Z, mark and note down the ?rst name on the list. In this case, the ?rst name is Zabaglia, Niccola. For consistency, indicate the number of pages and the name of the last artist on that page. How to import Requests and Beautiful Soup library To import libraries, activate your Python 3 programming environment. Check to make sure you are in the same directory with your programming environment. Run the following command to get started. my_env/bin/activate. Create a new ?le and start importing Beautiful Soup and Requests libraries. Requests library will allow you to use HTTP within your Python programs in readable formats. Beautiful Soup, on the other hand, works to scrape pages quickly. Use bs4 to import Beautiful Soup. How to collect and parse a web page Using Requests collect URL of your ?rst page. URL of the ?rst page will be assigned to the variable page. Build a BeautifulSoup object from Requests and parse the object from Python's parser. In this tutorial, the aim is to collect links and the artists' names. For instance, you can collect artists' dates and nationalities. For Windows users, right click on the artist's ?rst name. In this case, use Zabaglia, Niccola. For Mac OS users, tap "CTRL" and click the name. Click the "Inspect Element" menu that pop-ups on your screen to access web developers' tools. Print the artist's names out to make Beautiful Soup parse a tree quickly. Removing the bottom links To remove the bottom links on your web page, inspect the DOM by right-clicking the element. You'll identify that the links are under an HTML table. Using Beautiful Soup, use the "decompose method" to remove tags from the parse tree. How to pull content from a tag You don't have to print the entire link tag, use Beautiful Soup to remove material from a tag. You can also capture URLs associated with the artists by using Beautiful Soup 4. Capturing scraped data to a CSV ?le CSV ?le will allow you to store structured data in a plain text, a format that is mostly used for datasheets. Knowledge on handling plain text ?les in Python is recommended. Web data extraction is used to scrape pages and obtain information. Be considerate of the websites you are extraction information from. Some dynamic websites restrict web data extraction on their sites. To scrape page with Beautiful Soup and Python 3 is that simple. https://rankexperience.com/articles/article2256.html 2/3

23.05.2018 https://rankexperience.com/articles/article2256.html 3/3

Semalt : How To Scrape Web Pages