How to Scrape Cryptocurrency Data Using Playwright and save it in PostgreSQL using Python

python scraping cryptocurrency using Playwright and save it to PostgreSQL

Table of Contents

Introduction

In this article we walkthrough how to create a basic ETL Pipeline that scrapes cryptocurrency data using Playwright and save it to PostgreSQL database using the Psycopg2 package. Of course, the entire code will be written in python, in a simple and easy way for beginners to understand

The tutorial is step-by-step detailed, and separated into sections.

Video on my YouTube channel:

Getting started – Setting up the ETL Pipeline

Install Packages

To start, we download the required python packages which are Playwright and Psycopg2. Open the terminal or command prompt and type:

pip install playwright psycopg2

Then hit enter.

Note: Python must be installed first, and added to PATH.

Website to scrape – Endpoint of our ETL Data Pipelines

The website that we will scrape cryptocurrency data from is: coinmarketcap.com

https://coinmarketcap.com/ website

Note that this tutorial is meant for education purposes only.

Text Editor Used

Our text editor of choice is VSCode

https://code.visualstudio.com/ text editor vscode

Playwright – Web Scraping Tool

For web scraping we use Playwright.

Why use playwright? Playwright is a fast lightweight framework that’s used for both testing web app, as well as web scraping. When it comes to web scraping, it allows us to scrape data from dynamic websites, meaning websites that use a pure AJAX or a JavaScript framework like React.

Why not use Selenium instead of Playwright? Playwright is faster than Selenium and has capabilities that Selenium simply lacks.

web scraping https://playwright.dev/ playwright

Sync vs Async APIs and Import

Playwright provides two APIs, the sync, and the async API. For this example we used the sync API. These two APIs are simply abstractions of the sync and async functionalities. Meaning we don’t need to code these concept ourselves.

When using the Synchronous API: The code executes sequentially, and keeps on executing as long as there’s no error.

However, when using the Asynchronous API: All the lines of code executes simultaneously at once. Think of it as if lines of code execute separate from each other.

Basically the Async API is best suited for complex cases, where the Sync API is suited for simple scenarios.

Beginners are advices to start learning using the sync API, and along the way switch to the async API.

Here’s the code to import the sync API:

from playwright.sync_api import sync_playwright

Sync_playwright

sync_playwright is context manager, meaning we can use the with keyword to skip some overheads. We use the sync_playwright object to instantiate a browser.

with sync_playwright() as p:
   ...

The Main function

For good practice, all the code will be inside the main function, and we will run it inside the condition:

if __name__ == ‘__main__’

basically it just means that: if we run the file directly, main will run, but we import the file as a module, main won’t run until we explicitly call it.

def main():
    ...

if __name__ == '__main__':
    main()

Browser in playwright

Using the sync_playwright context manager, we can launch a browser of either Chromium which is Chrome, Firefox, or Webkit which is Safari. In this example, chromium is used.

Note that we must always close the browser at the end using the close method.

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)

    # code will be here

    browser.close()

We passed headless=False to the launch method in order to see the browser popping-up. If we set headless=True, which is the default anyways, we won’t see the browser.

Usually, in the development phase, we want to see the browser and what it is doing in order to spot any problems that may occur. In production phase, we usually don’t need to see it.

Page in playwright

Now that we have our browser we can get the page object. The page object is instantiated using the new_page method, and it does exactly that. It creates a new page.

After getting the page, we invoke the goto method that takes the website URL as a parameter. Basically, this method get the given URL endpoint of the web page that we want to scrape.

page = browser.new_page()
page.goto('https://coinmarketcap.com/')

Scrolling down in playwright

If you go back to the website, refresh, and start scrolling fast, you will see data being loaded since the website is dynamic. Same effect as Facebook or Instagram. That’s why when we load the website using Playwright, we need to automatically scroll to the bottom of the page before scraping the data, Otherwise we won’t get all the data since not all is loaded at first.

To scroll down in Playwright, we use the wheel method. This method takes two values, first one for horizontal scrolling and the second one for vertical scrolling. We set first value to 0, and the second to 2000.

Usually, the scroll is being performed inside a loop. in our case we want to scroll 5 time so we will loop 5 time and each time call the wheel method.

After each scroll we pause the script 1 second (1000 milliseconds) using the wait_for_timeout method. We pause in case the internet or the website are slow.

# scrolling down
for i in range(5):
    page.mouse.wheel(0, 2000)
    page.wait_for_timeout(1000)

Scraping Data using select_query and select_query_all

Now we can pass to the scraping part. To scrape html elements in playwright, we use either query_selector which returns one value if found, or query_selector_all which returns a list of values if found.

These methods take as argument either the xPath or the CSS Selector of the HTML element. In this example we use xPath.

Xpath

To not go into much details, in web scraping, xPath is used to pinpoint HTML elements on a web page, That way we can can get all needed elements, and then extract the text inside of them using the inner_text method.

XPath follows the HTML tree. For example if we want to get the p tag that’t inside a div, the xPath will be

  • ‘//div/p’

We start with ‘//’ and then we separate elements with ‘/’.

If we want to get a span inside of div that has a class named ‘my-class’ we use the following xPath:

  • “//div[@class=’my-class’]/span”

In our example we have a table with class “h7vnx2-2 czTsgW cmc-table” that contains tbody that contains all tr elements. Each tr is a row in the table.

The following screenshot illustrates the structure of the HTML:

html tree - html elements - playwright

So the code will be:

trs_xpath = "//table[@class='h7vnx2-2 czTsgW cmc-table  ']/tbody/tr"
trs_list = page.query_selector_all(trs_xpath)

Now that we have all the tr elements, we can loop over them and use xPath again to extract the id (rank), the name of the coin, the symbol of the coin, the current price of the coin, the current market cap of the coin, and finally the 24h volume of the coin.

The following screenshots illustrates better where each element is on the HTML tree:

html tree - html elements - playwright
html tree - html elements - playwright

Here’s the code:

master_list = []

    for tr in trs_list:
        coin_dict = {}

        tds = tr.query_selector_all('//td')

        coin_dict['id'] = tds[1].inner_text()
        coin_dict['Name'] = tds[2].query_selector("//p[@color='text']").inner_text()
        coin_dict['Symbol'] = tds[2].query_selector("//p[@color='text3']").inner_text()
        coin_dict['Price'] = float(tds[3].inner_text().replace('$','').replace(',',''))
        coin_dict['Market_cap_usd'] = int(tds[7].inner_text().replace('$','').replace(',',''))
        coin_dict['Volume_24h_usd'] = int(tds[8].query_selector('//p[@color="text"]').inner_text().replace('$','').replace(',',''))

        master_list.append(coin_dict)

    # tuples (id, name, symbole, ...)
    list_of_tuples = [tuple(dic.values()) for dic in master_list] 

Each coin’s data is being saved in a dictionary named coin_dict. All dictionaries are saved in a list named master_list. At the end The master_list will contain 100 dictionary (we have 100 coin listed on that page), and each dictionary contains data for a specific coin.

Note that we use all of the replace, int, and float methods to both clean and transform our data.

Note also that at the end we transformed the list of dictionaries into a list of tuples

  • master_list > list_of_tuples

The reason for that has to do with the database. It will be easier to insert data this way.

PostgreSQL – Storage of our Data Pipeline

The scraped data will be saved in a PostgreSQL database. We created a database named test, and then created a table in the database called crypto.

Here’s the code for the crypto table.

CREATE TABLE IF NOT EXISTS CRYPTO
(
    ID               VARCHAR(20) PRIMARY KEY
  , NAME             VARCHAR(100) NOT NULL
  , SYMBOL           VARCHAR(20) NOT NULL
  , PRICE_USD        FLOAT
  , MARKET_CAP_USD   BIGINT
  , VOLUME_24H_USD   BIGINT
);

Of course, you can name them however you want. You just need to provide the exact names when connecting and inserting data using psycopg2.

psycopg2 – Driver

Psycopg2 is a database driver. Drivers are used in programming languages to connect to a database and perform all sorts of SQL commands. There are a lot of drivers for each language, but when it comes to Python and PostgreSQL, psycopg2 is the most used and most famous. There’s also a famous ORM called SQLAlchemy (ORMs are more complex than drivers and are used most in web/mobile applications).

Here’s what we need to import:

import psycopg2
from psycopg2.extras import execute_values

Create a Connection in psycopg2

After importing the package, the first thing just after that is to create a connection using the connect method. This method takes as arguments the following:

  • the host
  • database name
  • the username
  • the password

you need to enter all above values in order to create a successful connection with the database.

When the connection is established, we have control over the database (depending on what permissions the user we logged in with has).

In this example:

  • The host is localhost since we are running PostgreSQL locally.
  • The database name is test.
  • the username is mac.
  • There is no password, meaning we pass un empty string.
# connect to db
pgconn = psycopg2.connect(
            host = 'localhost',
            database = 'test',
            user = 'mac',
            password = ''
        )

Create the Cursor in psycopg2

After the connection was established successfully, we can use it to create a cursor. The cursor is similar to the Shell. In simple terms, we use it to pass all types of SQL command to the database. In our example we use it to insert the scraped data to the database.

# create cursor 
pgcursor = pgconn.cursor()

Insert the data in psycopg2

In order to insert the data to our database using psycopg2, we have a lot of options. Each option is well suited for a specific case. For our case (Inserting multiple rows at once) we use the execute_values method that we imported previously.

The execute_values method takes in as arguments the cursor, the SQL code, and the data.

 execute_values(pgcursor, 
  "INSERT INTO crypto (id, name, symbol, price_usd, market_cap_usd, volume_24h_usd) VALUES %s", 
  list_of_tuples)

Commit and Close in psycopg2

After inserting data using a database driver like psycopg2, it’s always required to commit the changes. For that we use the commit method. When committing the changes, they get saved in the database.

After the database-driver work is over, we always close the connection to prevent errors. In psycopg2 we close the connection using the close method.

# commit 
pgconn.commit()
# close the connection
pgconn.close()

Entire Code and results

Entire Code

Here’s the entire code:

from playwright.sync_api import sync_playwright
import psycopg2
from psycopg2.extras import execute_values

def main():
    with sync_playwright() as p:
        
        ###################
        ### scrape data ###
        ###################

        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto('https://coinmarketcap.com/')

        # scrolling down
        for i in range(5):
            page.mouse.wheel(0, 2000)
            page.wait_for_timeout(1000)

        trs_xpath = "//table[@class='h7vnx2-2 czTsgW cmc-table  ']/tbody/tr"
        trs_list = page.query_selector_all(trs_xpath)

        master_list = []

        for tr in trs_list:
            coin_dict = {}

            tds = tr.query_selector_all('//td')

            coin_dict['id'] = tds[1].inner_text()
            coin_dict['Name'] = tds[2].query_selector("//p[@color='text']").inner_text()
            coin_dict['Symbol'] = tds[2].query_selector("//p[@color='text3']").inner_text()
            coin_dict['Price'] = float(tds[3].inner_text().replace('$','').replace(',',''))
            coin_dict['Market_cap_usd'] = int(tds[7].inner_text().replace('$','').replace(',',''))
            coin_dict['Volume_24h_usd'] = int(tds[8].query_selector('//p[@color="text"]').inner_text().replace('$','').replace(',',''))

            master_list.append(coin_dict)


        # tuples (id, name, symbole, ...)
        list_of_tuples = [tuple(dic.values()) for dic in master_list] 

        #################
        ### save data ###
        #################

        # connect to db 
        pgconn = psycopg2.connect(
            host = 'localhost',
            database = 'test',
            user = 'mac',
            password = ''
        )

        # create cursor 
        pgcursor = pgconn.cursor()

        execute_values(pgcursor, 
            "INSERT INTO crypto (id, name, symbol, price_usd, market_cap_usd, volume_24h_usd) VALUES %s", 
            list_of_tuples)

        # commit 
        pgconn.commit()
        # close the connection
        pgconn.close()
        # close the browser
        browser.close()

if __name__ == '__main__':
    main()

Results

Let’s check the final results by getting all the data in our table:

SELECT * FROM crypto;

Here is the results:

PostgreSQL table - pgadmin - scrapng results

As we can see, All the data is successfully saved in our test PostgreSQL database, and specifically in our crypto table.

Conclusion

This tutorial was a simple entry to creating an ETL pipelines using Playwright, Psycopg2, and PostgreSQL. In production environments, the code would look more complex than this one since more functionalities and concept would get added.

Remember that we didn’t use headless since this is the development phase, which is the opposite case when it comes to production phase.