In this article I will demonstrate how easy it is to perform basic text Web scraping using Python and just a few lines of code.
The example have been developed and tested using Python 3.5.2.
The first step is to see if you have the following third party libraries already installed; Requests and Beautiful Soup 4. So start idle and try typing the following command:
After you press return, if you see no error messages then requests is installed. If you see an error message that shows requests has not been found, you should install it using pip from the command line as shown below.
pip install requests
Repeat the process to see if you already have the Beautiful Soup library installed, fortunately you don’t have too much to type….
Again if Python complains that it can’t find the library, use pip from the command line to install it.
pip install beautifulsoup4
With the libraries installed, here is a program that scrapes this site. It returns the titles from the blog posts that are shown on this page.
To demonstrate how this is achieved with just a few lines of code, here is the program without comments:
import requests, bs4 def getTitlesFromMySite(url): res = requests.get(url) res.raise_for_status() soup = bs4.BeautifulSoup(res.text, 'html.parser') elems = soup.select('.entry-title') return elems titles = getTitlesFromMySite('https://oraclefrontovik.com') for title in titles: print(title.text)
Now the same code but this time with each section commented…
# import requests (for downloading web pages) and beautiful soup (for parsing html) import requests, bs4 # create a function that allows a parameter containing a url to be passed into it def getTitlesFromMySite(url): # download the webpage and store it in res variable res = requests.get(url) # check for problems - if there are, raise_for_status() raises an exception # and the program stops at this point res.raise_for_status() # running the downloaded webpage through Beautiful Soup returns a # Beautiful Soup object which represents the HTML as a nested data structure. soup = bs4.BeautifulSoup(res.text, 'html.parser') # store in an array the items that match this css selector. # I will explain how I obtained this entry below elems = soup.select('.entry-title') return elems # call the function and store the results in titles titles = getTitlesFromMySite('https://oraclefrontovik.com') # loop through the array printing out the title. for title in titles: print(title.text)
Running the example returns the following expected output….
Learn C# in One Day and Learn It Well – Review Contributing to an Open Source Project A step by step guide to building a Raspberry Pi Hedgehog camera Is there more than one reason to use PL/SQL WHERE CURRENT OF ? Structured Basis Testing Raspberry Pi connected to WiFi but no internet access The auditing capabilities of Flashback Data Archive in Oracle 12c. DBMS_UTILITY.FORMAT_ERROR_BACKTRACE and the perils of the RAISE statement Using INSERT ALL with related tables The best lesson I learnt from Steve McConnell
To summarise, the code imports two third party libraries, requests and Beautiful Soup 4, that perform the lions share of the work. In the example I use the requests library to download a web page as HTML and then pass it to Beautiful Soup along with a CSS selector to return the information I want from it.
Obtaining the CSS selector
The code example has the following line which extracts the part of the webpage, the blog post titles, that we are interested in:
elems = soup.select('.entry-title')
Using Firefox, I obtained the CSS Selector ‘.entry-title’ by:
- Navigate to the page of interest, in this case, oraclefrontovik.com
- Opened Firefox developer tools (Ctrl + Shift + I)
- Highlighted the first title (which at the time of writing was Learn C# in One Day and Learn it Well – Review) , right click and select Inspect Element
- In the console, I then right click and select Copy and then choose CSS Selector from the sub menu.
At the time of writing, I was unable to get the same CSS Selector using the native developer tools from Chrome. If you know of a way please let me know in the comments.
In this post I have walked through the steps to perform basic text Web scraping using Python 3.