Getting Started With beautifulsoup4 – Tutorial For Web Scrapers – Part 1

Table of Contents

Introduction

Python is one of the most versatile languages out there, from simple automation scripts, to complex artificial intelligence applications, it is everywhere. One of the areas where this language shines is when it comes to anything data-related.

Web scraping is the process of scraping data from the internet using different technologies, and in this first article, we will explore and learn the basics of one of those scraping technologies, beautifulsoup4.

This post is for beginners that want an easy-way to get started with web scraping and data extraction.

I’ve divided this mini series into 3 parts, each part will be more advanced than the previous one. I made this division so that it’s easier for beginners to digest the content and the information.

Overview

This is part 1 of a 3 parts mini series on web scraping using beautifulsoup4. This first part is for total beginners that are interested in basic stuff related the package like setting up, the first steps, explaining stuff in details and so on.

In this first part we will start by setting everything up then explaining the theory behind the package and finally extract the data from a local html file.

In the next parts, we will us the requests package to get web pages and extract real data from real websites.

For those interested in reading the documentation, here is the official beautifulsoup4 Documentation page.

beautifulsoup4 vs bs4

beautifulsoup4 - web scraping - python package

beautifulsoup4

You will come across both packages being installed, so what’s the difference?

Both these packages are the same one, managed by the same team. Always go with beautifulsoup4 since it’s the official name.

According to the official documentation: “bs4 is a package ensures that if you type pip install bs4 by mistake you will end up with beautifulsoup4.”

In simple terms, beautifulsoup4 is the official one and bs4 redirects to it.

Setting Up beautifulsoup4

First, we need to install beautifulsoup4. After installing python from the official website open a command prompt on Windows or a terminal on Linux and Mac and type the following code:

pip install beautifulsoup4

This command will install beautifulsoup4. Sometimes, this command doesn’t work so you can try on of the following commands:

pip3 install beautifulsoup4
python -m pip install beautifulsoup4 
python -m pip3 install beautifulsoup4 
python3 -m pip install beautifulsoup4 
python3 -m pip3 install beautifulsoup4 

If none of these work, search on Google because there are many reasons on why this happens.

Why we can’t scrape data using beautifulsoup4 alone?

beautifulsoup4 allows us to search and parse data from html but we need the html first to do that. That’s why, as web scrapers we always need a second package when working with beautifulsoup4. That second package’s main purpose is to get the html page from the server to beautifulsoup4. The most used packages for sending requests and getting responses is requests.

For more Documentation on web scraping theories check out these articles :

Web Scraping for none-programmers

Web Scraping for programmers

we Install requests as follows:

pip install requests

How does beautifulsoup4 work?

When python reads an html file on our computer or request an html page from a website, it reads it as text which is not very practical for us, we need to parse it into an html again and extract data using html tags. That’s why we need a package like beautifulsoup4 because it allows us to do both those operations.

The process goes as follows:

  • Get html from a file or a website page using python. Python automatically reads it as text.
  • Use beautifulsoup4 to to parse the text into html again.
  • Locate and parse the desired data using beautifulsoup4.
  • Save that data.

Everyone of these steps is very important and we will understand them well with each article of the mini series.

To give an Idea about each step:

Get the html: As we explained before, beautifulsoup doesn’t have the capability to send a request and get the html from a website like other python packages and frameworks do (ex: Scrapy). In the coming articles we will use requests for that.

Parse the html: Once we have the html, we will parse it using the library. this is a simple one line step.

Locate the data: Locating the data is done using special methods that allow us to filter the entire html tree.

Save the data: After the data is located and saved in a variable, we can save it in any format that we want (JSON, CSV…etc.). This step is a pure python step.

Parsing data from the html file

In this section we will have an html file that we will use as an example to extract data from. I choose to start on this first part by an html file to make it easy for beginners to understand how beautifulsoup4 works.

Note that on the beautifulsoup4’s side, that same process applies whether we want to scape data from an html file (locally) or from a website’s page. The only thing that will change is that, locally we import the html file, but on a website, we send a request to get the html of the web page.

Here’s the html file we will be practicing on (named it index.html):

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Test for beautifulsoup4</title>
    <link rel="stylesheet" href="style.css">
  </head>
  <body>
	 <h1>This is the heading</h1>
   <div class="first_div">
    <a href="link_one.com"> click me I'm link one </a>
    <a href="link_two.com"> click me I'm link two </a>
    <a href="link_three.com"> click me I'm link three </a>
   </div>
   <div>
    <span id="span_one"><a href="link_inside_span.com"> click me I'm link inside a span </a></span>
    <span id="span_two">
     <ul>
      <li>I'm li one</li>
      <li>I'm li two</li>
      <li>I'm li three</li>
     </ul>
    </span>
   </div>
  </body>
</html>

Now that we have our html, we can start by creating a python file (will name it scraper.py)

First we will import beautifulsoup4 inside our python file (script) then, load the html in, and finally start parsing the data.

We import the packages as follows:

# first we will import beautifulsoup4
from bs4 import BeautifulSoup 

Note: in python, any line that start with # is a comment. Meaning it won’t we executed. We use comments to explain the code and save notes for the future modifications and updates.

In the above example, only the second line is executed.

Now we can load the html into our scraper.py python file

We load the html into the python file as follows:

# loading the html file content into the variable html_doc 
with open('index.html', 'r') as f:
    html_doc = BeautifulSoup(f, 'html.parser')

Note: index.html and scraper.py are in the same directory (same folder).

In case you are not familiar with what we did: We used python’s “with open” to open the index.html file in read mode ( ‘r’ ) and we named what we read in the file as “f“.

Then we used BeautifulSoup() to parsed the content of the file ( ‘f‘ ) using the html parser ( ‘html.parser‘ ). That’s how we parse html files and html web pages, we use BeautifulSoup() that we imported.

Then we saved the parsed html in a variable we name html_doc. Using html_doc we can extract the data we want.

How do we extract the data now that we parsed the html?

we extract the data using many functions that comes with beautifulsoup. The most used functions used by web scrapers are find() and find_all(). Both these two functions take two arguments, an html name Tag (h1, p, ul, img, ..etc.) and a CSS attribute (class, id, …etc.) to locate the element inside the parsed html.

The difference between the two functions is that, find() is used to get a single element where find_all() is used to get more than one element, it returns a list.

For example, let’s say we have the following <h2> Tag with a class of “heading”:

<h2 class="heading">I'm the title</h2>

We would get that element as follows:

# locate and getting the h2 tag with a class of heading
h2_element = html_doc.find('h2', {'class':'heading'})

Note: the h2_element is the entire tag. If we just want the text we add .text as follows:

# locate and getting the text from the h2 tag with a class of heading
h2_element_text = html_doc.find('h2', {'class':'heading'}).text

# let's print the resulats
print(h2_element_text)

I added the print() function to show us the results in the command prompt/terminal:

I'm the title

This is what got printed, the text inside the <h2> tag. Meaning we scraped the data successfully.

Now let’s extract data from the index.html file that we have:

# extracting the text of h1 element
h1_text = html_doc.find('h1')

Note: we got the text of the h1 without using a CSS attribute (class, id, ..etc.), that’s because the h1 tag don’t have any CSS attribute!

The question here is that, if we have multiple elements with the same tag but none of them have any attribute or even the same attribute, but we don’t want all of them, what do we do?

There are many techniques we implement, w will see some of the advanced techniques of searching and filtering in the next posts, in this beginner friendly example we will just use find_all() then iterate the resulted list.

For the next example, want to extract the second li tag in our index.html:

# we want to get all the li elements
li_list = html_doc.find_all('li')

# print li_list
print(li_list)

Output in the terminal/command prompt:

[<li>I'm li one</li>, <li>I'm li two</li>, <li>I'm li three</li>]

As you can see, we have a list of the three li elements, so if we want the second one we get it as follows:

# getting the second li
second_li = li_list[1]

# getting the text of the second li
second_li_text = li_list[1].text

# printing the second li
print(second_li)

# printing the text of the second li
print(second_li_text)

Output:

<li>I'm li two</li>
I'm li two

Conclusion

In this article we explored how to set up beautifulsoup4, explained briefly how the package works and the we extracted data from the html file that we had. As I mentioned before, the same process will apply for when scraping data from a real live website what will change is how to get the web page.

We didn’t went into details in this article since this is just a part 1 of a 3 parts series. In The next part, we will go deeper and explore the different tactics of scraping live websites and the tools we use. We will also Filter more complex HTML trees since the bigger the website, the more complex the html tree and the more challenging the searching and filtering process is. But also, the more interesting the challenge is!

Thank you for reading my post!