How can I do this using urllib or requests in Python? From urllib.request import urlopen from. About; Products. Web scraping with Python urllib or requests. Ask Question Asked 3 years, 2 months ago. Browse other questions tagged python web python-requests. Check out the tutorial on how to scrape dynamic web pages with Python. Learn how to extract data with Selenium, headless browsers, and the web scraping API.
- In order to be able to do web scraping with Python, you will need a basic understanding of HTML and CSS. This is so you understand the territory you are working in. You don’t need to be an expert but you do need to know how to navigate the elements on a web-page using an inspector such as chrome dev tools.
- With Python's requests (pip install requests) library we're getting a web page by using get on the URL. The response r contains many things, but using r.content will give us the HTML. Once we have the HTML we can then parse it for the data we're interested in analyzing.
- In this Python Programming Tutorial, we will be learning how to scrape websites using the Requests-HTML library. Requests-HTML is an excellent tool for parsi.
One of the awesome things about Python is how relatively simple it is to do pretty complex and impressive tasks. A great example of this is web scraping.
Staruml for mac os. This is an article about web scraping with Python. In it we will look at the basics of web scraping using popular libraries such as requests
and beautiful soup
.
Topics covered:
- What is web scraping?
- What are
requests
andbeautiful soup
? - Using CSS selectors to target data on a web-page
- Getting product data from a demo book site
- Storing scraped data in CSV and JSON formats
What is Web Scraping?
Some websites can contain a large amount of valuable data. Web scraping means extracting data from websites, usually in an automated fashion using a bot or web crawler. The kinds or data available are as wide ranging as the internet itself. Common tasks include
- scraping stock prices to inform investment decisions
- automatically downloading files hosted on websites
- scraping data about company contacts
- scraping data from a store locator to create a list of business locations
- scraping product data from sites like Amazon or eBay
- scraping sports stats for betting
- collecting data to generate leads
- collating data available from multiple sources
Legality of Web Scraping
There has been some confusion in the past about the legality of scraping data from public websites. This has been cleared up somewhat recently (I’m writing in July 2020) by a court case where the US Court of Appeals denied LinkedIn’s requests to prevent HiQ, an analytics company, from scraping its data.
The decision was a historic moment in the data privacy and data regulation era. It showed that any data that is publicly available and not copyrighted is potentially fair game for web crawlers.
However, proceed with caution. You should always honour the terms and conditions of a site that you wish to scrape data from as well as the contents of its robots.txt
file. You also need to ensure that any data you scrape is used in a legal way. For example you should consider copyright issues and data protection laws such as GDPR. Also, be aware that the high court decision could be reversed and other laws may apply. This article is not intended to prvide legal advice, so please do you own research on this topic. One place to start is Quora. There are some good and detailed questions and answers there such as at this link
One way you can avoid any potential legal snags while learning how to use Python to scrape websites for data is to use sites which either welcome or tolerate your activity. One great place to start is to scrape – a web scraping sandbox which we will use in this article.
An example of Web Scraping in Python
You will need to install two common scraping libraries to use the following code. This can be done using Mac os for powerpc g4.
pip install requests
and
pip install beautifulsoup4
in a command prompt. For details in how to install packages in Python, check out Installing Python Packages with Pip.
The requests
library handles connecting to and fetching data from your target web-page, while beautifulsoup
enables you to parse and extract the parts of that data you are interested in.
Let’s look at an example:
So how does the code work?
In order to be able to do web scraping with Python, you will need a basic understanding of HTML and CSS. This is so you understand the territory you are working in. You don’t need to be an expert but you do need to know how to navigate the elements on a web-page using an inspector such as chrome dev tools. If you don’t have this basic knowledge, you can go off and get it (w3schools is a great place to start), or if you are feeling brave, just try and follow along and pick up what you need as you go along.
To see what is happening in the code above, navigate to http://books.toscrape.com/. Place your cursor over a book price, right-click your mouse and select “inspect” (that’s the option on Chrome – it may be something slightly different like “inspect element” in other browsers. When you do this, a new area will appear showing you the HTML which created the page. You should take particular note of the “class” attributes of the elements you wish to target.
In our code we have
This uses the class attribute and returns a list of elements with the class product_pod
.
Then, for each of these elements we have:
The first line is fairly straightforward and just selects the text of the h3
element for the current product. The next line does lots of things, and could be split into separate lines. Basically, it finds the p
tag with class price_color
within the div
tag with class product_price
, extracts the text, strips out the pound sign and finally converts to a float. This last step is not strictly necessary as we will be storing our data in text format, but I’ve included it in case you need an actual numeric data type in your own projects.
Storing Scraped Data in CSV Format
csv
(comma-separated values) is a very common and useful file format for storing data. It is lightweight and does not require a database.
Add this code above the if __name__ '__main__':
line
and just before the line print('### RESULTS ###')
, add this:
store_as_csv(data, headings=['title', 'price'])
When you run the code now, a file will be created containing your book data in csv format. Pretty neat huh?
Storing Scraped Data in JSON Format
Another very common format for storing data is JSON
(JavaScript Object Notation), which is basically a collection of lists and dictionaries (called arrays and objects in JavaScript).
Add this extra code above if __name__ ..:
and store_as_json(data)
above the print('### Results ###')
line.
So there you have it – you now know how to scrape data from a web-page, and it didn’t take many lines of Python code to achieve!
Full Code Listing for Python Web Scraping Example
Here’s the full listing of our program for your convenience.
One final note. We have used requests
and beautifulsoup
for our scraping, and a lot of the existing code on the internet in articles and repositories uses those libraries. However, there is a newer library which performs the task of both of these put together, and has some additional functionality which you may find useful later on. This newer library is requests-HTML
and is well worth looking at once you have got a basic understanding of what you are trying to achieve with web scraping. Another library which is often used for more advanced projects spanning multiple pages is scrapy
, but that is a more complex beast altogether, for a later article.
Working through the contents of this article will give you a firm grounding in the basics of web scraping in Python. I hope you find it helpful
Happy computing.
Internet extends fast and modern websites pretty often use dynamic content load mechanisms to provide the best user experience. Still, on the other hand, it becomes harder to extract data from such web pages, as it requires the execution of internal Javascript in the page context while scraping. Let's review several conventional techniques that allow data extraction from dynamic websites using Python.
What is a dynamic website?#
A dynamic website is a type of website that can update or load content after the initial HTML load. So the browser receives basic HTML with JS and then loads content using received Javascript code. Such an approach allows increasing page load speed and prevents reloading the same layout each time you'd like to open a new page.
Usually, dynamic websites use AJAX to load content dynamically, or even the whole site is based on a Single-Page Application (SPA) technology.
In contrast to dynamic websites, we can observe static websites containing all the requested content on the page load.
A great example of a static website is example.com
:
The whole content of this website is loaded as a plain HTML while the initial page load.
To demonstrate the basic idea of a dynamic website, we can create a web page that contains dynamically rendered text. It will not include any request to get information, just a render of a different HTML after the page load:
All we have here is an HTML file with a single <div>
in the body that contains text - Web Scraping is hard
, but after the page load, that text is replaced with the text generated by the Javascript:
To prove this, let's open this page in the browser and observe a dynamically replaced text:
Alright, so the browser displays a text, and HTML tags wrap this text.
Can't we use BeautifulSoup or LXML to parse it? Let's find out.
Extract data from a dynamic web page#
BeautifulSoup is one of the most popular Python libraries across the Internet for HTML parsing. Almost 80% of web scraping Python tutorials use this library to extract required content from the HTML.
Let's use BeautifulSoup for extracting the text inside <div>
from our sample above.
This code snippet uses os
library to open our test HTML file (test.html
) from the local directory and creates an instance of the BeautifulSoup library stored in soup
variable. Using the soup
we find the tag with id test
and extracts text from it.
In the screenshot from the first article part, we've seen that the content of the test page is I ❤️ ScrapingAnt
, but the code snippet output is the following:
And the result is different from our expectation (except you've already found out what is going on there). Everything is correct from the BeautifulSoup perspective - it parsed the data from the provided HTML file, but we want to get the same result as the browser renders. The reason is in the dynamic Javascript that not been executed during HTML parsing.
We need the HTML to be run in a browser to see the correct values and then be able to capture those values programmatically.
Below you can find four different ways to execute dynamic website's Javascript and provide valid data for an HTML parser: Selenium, Pyppeteer, Playwright, and Web Scraping API.
Selenuim: web scraping with a webdriver#
Selenium is one of the most popular web browser automation tools for Python. It allows communication with different web browsers by using a special connector - a webdriver.
To use Selenium with Chrome/Chromium, we'll need to download webdriver from the repository and place it into the project folder. Don't forget to install Selenium itself by executing:
Selenium instantiating and scraping flow is the following:
- define and setup Chrome path variable
- define and setup Chrome webdriver path variable
- define browser launch arguments (to use headless mode, proxy, etc.)
- instantiate a webdriver with defined above options
- load a webpage via instantiated webdriver
In the code perspective, it looks the following:
And finally, we'll receive the required result:
Selenium usage for dynamic website scraping with Python is not complicated and allows you to choose a specific browser with its version but consists of several moving components that should be maintained. The code itself contains some boilerplate parts like the setup of the browser, webdriver, etc.
I like to use Selenium for my web scraping project, but you can find easier ways to extract data from dynamic web pages below.
Pyppeteer: Python headless Chrome#
Web Scraping With Requests Python Pdf
Pyppeteer is an unofficial Python port of Puppeteer JavaScript (headless) Chrome/Chromium browser automation library. It is capable of mainly doing the same as Puppeteer can, but using Python instead of NodeJS.
Puppeteer is a high-level API to control headless Chrome, so it allows you to automate actions you're doing manually with the browser: copy page's text, download images, save page as HTML, PDF, etc.
To install Pyppeteer you can execute the following command:
The usage of Pyppeteer for our needs is much simpler than Selenium:
I've tried to comment on every atomic part of the code for a better understanding. However, generally, we've just opened a browser page, loaded a local HTML file into it, and extracted the final rendered HTML for further BeautifulSoup processing.
As we can expect, the result is the following:
We did it again and not worried about finding, downloading, and connecting webdriver to a browser. Though, Pyppeteer looks abandoned and not properly maintained. This situation may change in the nearest future, but I'd suggest looking at the more powerful library.
Playwright: Chromium, Firefox and Webkit browser automation#
Playwright can be considered as an extended Puppeteer, as it allows using more browser types (Chromium, Firefox, and Webkit) to automate modern web app testing and scraping. You can use Playwright API in JavaScript & TypeScript, Python, C# and, Java. And it's excellent, as the original Playwright maintainers support Python.
The API is almost the same as for Pyppeteer, but have sync and async version both.
Installation is simple as always:
Let's rewrite the previous example using Playwright.
As a good tradition, we can observe our beloved output:
We've gone through several different data extraction methods with Python, but is there any more straightforward way to implement this job? How can we scale our solution and scrape data with several threads?
Meet the web scraping API!
Web Scraping API#
ScrapingAnt web scraping API provides an ability to scrape dynamic websites with only a single API call. It already handles headless Chrome and rotating proxies, so the response provided will already consist of Javascript rendered content. ScrapingAnt's proxy poll prevents blocking and provides a constant and high data extraction success rate.
Usage of web scraping API is the simplest option and requires only basic programming skills.
You do not need to maintain the browser, library, proxies, webdrivers, or every other aspect of web scraper and focus on the most exciting part of the work - data analysis.
As the web scraping API runs on the cloud servers, we have to serve our file somewhere to test it. I've created a repository with a single file: https://github.com/kami4ka/dynamic-website-example/blob/main/index.html
To check it out as HTML, we can use another great tool: HTMLPreview
The final test URL to scrape a dynamic web data has a following look: http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html
The scraping code itself is the simplest one across all four described libraries. We'll use ScrapingAntClient library to access the web scraping API.
Let's install in first:
And use the installed library:
To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.
And the result is still the required one.
Web Scraping With Requests Python Programming
All the headless browser magic happens in the cloud, so you need to make an API call to get the result.
Check out the documentation for more info about ScrapingAnt API.
Summary#
Today we've checked four free tools that allow scraping dynamic websites with Python. All these libraries use a headless browser (or API with a headless browser) under the hood to correctly render the internal Javascript inside an HTML page. Below you can find links to find out more information about those tools and choose the handiest one:
Happy web scraping, and don't forget to use proxies to avoid blocking 🚀