Introduction
In this article we walkthrough how to create a basic ETL Pipeline that scrapes cryptocurrency data using Playwright and save it to PostgreSQL database using the Psycopg2 package. Of course, the entire code will be written in python, in a simple and easy way for beginners to understand
The tutorial is step-by-step detailed, and separated into sections.
Video on my YouTube channel:
Getting started – Setting up the ETL Pipeline
Install Packages
To start, we download the required python packages which are Playwright and Psycopg2. Open the terminal or command prompt and type:
pip install playwright psycopg2
Then hit enter.
Note: Python must be installed first, and added to PATH.
Website to scrape – Endpoint of our ETL Data Pipelines
The website that we will scrape cryptocurrency data from is: coinmarketcap.com

Note that this tutorial is meant for education purposes only.
Text Editor Used
Our text editor of choice is VSCode

Playwright – Web Scraping Tool
For web scraping we use Playwright.
Why use playwright? Playwright is a fast lightweight framework that’s used for both testing web app, as well as web scraping. When it comes to web scraping, it allows us to scrape data from dynamic websites, meaning websites that use a pure AJAX or a JavaScript framework like React.
Why not use Selenium instead of Playwright? Playwright is faster than Selenium and has capabilities that Selenium simply lacks.

Sync vs Async APIs and Import
Playwright provides two APIs, the sync, and the async API. For this example we used the sync API. These two APIs are simply abstractions of the sync and async functionalities. Meaning we don’t need to code these concept ourselves.
When using the Synchronous API: The code executes sequentially, and keeps on executing as long as there’s no error.
However, when using the Asynchronous API: All the lines of code executes simultaneously at once. Think of it as if lines of code execute separate from each other.
Basically the Async API is best suited for complex cases, where the Sync API is suited for simple scenarios.
Beginners are advices to start learning using the sync API, and along the way switch to the async API.
Here’s the code to import the sync API:
from playwright.sync_api import sync_playwright
Sync_playwright
sync_playwright is context manager, meaning we can use the with keyword to skip some overheads. We use the sync_playwright object to instantiate a browser.
with sync_playwright() as p:
...
The Main function
For good practice, all the code will be inside the main function, and we will run it inside the condition:
if __name__ == ‘__main__’
basically it just means that: if we run the file directly, main will run, but we import the file as a module, main won’t run until we explicitly call it.
def main():
...
if __name__ == '__main__':
main()
Browser in playwright
Using the sync_playwright context manager, we can launch a browser of either Chromium which is Chrome, Firefox, or Webkit which is Safari. In this example, chromium is used.
Note that we must always close the browser at the end using the close method.
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
# code will be here
browser.close()
We passed headless=False to the launch method in order to see the browser popping-up. If we set headless=True, which is the default anyways, we won’t see the browser.
Usually, in the development phase, we want to see the browser and what it is doing in order to spot any problems that may occur. In production phase, we usually don’t need to see it.
Page in playwright
Now that we have our browser we can get the page object. The page object is instantiated using the new_page method, and it does exactly that. It creates a new page.
After getting the page, we invoke the goto method that takes the website URL as a parameter. Basically, this method get the given URL endpoint of the web page that we want to scrape.
page = browser.new_page()
page.goto('https://coinmarketcap.com/')
Scrolling down in playwright
If you go back to the website, refresh, and start scrolling fast, you will see data being loaded since the website is dynamic. Same effect as Facebook or Instagram. That’s why when we load the website using Playwright, we need to automatically scroll to the bottom of the page before scraping the data, Otherwise we won’t get all the data since not all is loaded at first.
To scroll down in Playwright, we use the wheel method. This method takes two values, first one for horizontal scrolling and the second one for vertical scrolling. We set first value to 0, and the second to 2000.
Usually, the scroll is being performed inside a loop. in our case we want to scroll 5 time so we will loop 5 time and each time call the wheel method.
After each scroll we pause the script 1 second (1000 milliseconds) using the wait_for_timeout method. We pause in case the internet or the website are slow.
# scrolling down
for i in range(5):
page.mouse.wheel(0, 2000)
page.wait_for_timeout(1000)
Scraping Data using select_query and select_query_all
Now we can pass to the scraping part. To scrape html elements in playwright, we use either query_selector which returns one value if found, or query_selector_all which returns a list of values if found.
These methods take as argument either the xPath or the CSS Selector of the HTML element. In this example we use xPath.
Xpath
To not go into much details, in web scraping, xPath is used to pinpoint HTML elements on a web page, That way we can can get all needed elements, and then extract the text inside of them using the inner_text method.
XPath follows the HTML tree. For example if we want to get the p tag that’t inside a div, the xPath will be
- ‘//div/p’
We start with ‘//’ and then we separate elements with ‘/’.
If we want to get a span inside of div that has a class named ‘my-class’ we use the following xPath:
- “//div[@class=’my-class’]/span”
In our example we have a table with class “h7vnx2-2 czTsgW cmc-table” that contains tbody that contains all tr elements. Each tr is a row in the table.
The following screenshot illustrates the structure of the HTML:

So the code will be:
trs_xpath = "//table[@class='h7vnx2-2 czTsgW cmc-table ']/tbody/tr"
trs_list = page.query_selector_all(trs_xpath)
Now that we have all the tr elements, we can loop over them and use xPath again to extract the id (rank), the name of the coin, the symbol of the coin, the current price of the coin, the current market cap of the coin, and finally the 24h volume of the coin.
The following screenshots illustrates better where each element is on the HTML tree:


Here’s the code:
master_list = []
for tr in trs_list:
coin_dict = {}
tds = tr.query_selector_all('//td')
coin_dict['id'] = tds[1].inner_text()
coin_dict['Name'] = tds[2].query_selector("//p[@color='text']").inner_text()
coin_dict['Symbol'] = tds[2].query_selector("//p[@color='text3']").inner_text()
coin_dict['Price'] = float(tds[3].inner_text().replace('$','').replace(',',''))
coin_dict['Market_cap_usd'] = int(tds[7].inner_text().replace('$','').replace(',',''))
coin_dict['Volume_24h_usd'] = int(tds[8].query_selector('//p[@color="text"]').inner_text().replace('$','').replace(',',''))
master_list.append(coin_dict)
# tuples (id, name, symbole, ...)
list_of_tuples = [tuple(dic.values()) for dic in master_list]
Each coin’s data is being saved in a dictionary named coin_dict. All dictionaries are saved in a list named master_list. At the end The master_list will contain 100 dictionary (we have 100 coin listed on that page), and each dictionary contains data for a specific coin.
Note that we use all of the replace, int, and float methods to both clean and transform our data.
Note also that at the end we transformed the list of dictionaries into a list of tuples
- master_list > list_of_tuples
The reason for that has to do with the database. It will be easier to insert data this way.
PostgreSQL – Storage of our Data Pipeline
The scraped data will be saved in a PostgreSQL database. We created a database named test, and then created a table in the database called crypto.
Here’s the code for the crypto table.
CREATE TABLE IF NOT EXISTS CRYPTO
(
ID VARCHAR(20) PRIMARY KEY
, NAME VARCHAR(100) NOT NULL
, SYMBOL VARCHAR(20) NOT NULL
, PRICE_USD FLOAT
, MARKET_CAP_USD BIGINT
, VOLUME_24H_USD BIGINT
);
Of course, you can name them however you want. You just need to provide the exact names when connecting and inserting data using psycopg2.
psycopg2 – Driver
Psycopg2 is a database driver. Drivers are used in programming languages to connect to a database and perform all sorts of SQL commands. There are a lot of drivers for each language, but when it comes to Python and PostgreSQL, psycopg2 is the most used and most famous. There’s also a famous ORM called SQLAlchemy (ORMs are more complex than drivers and are used most in web/mobile applications).
Here’s what we need to import:
import psycopg2
from psycopg2.extras import execute_values
Create a Connection in psycopg2
After importing the package, the first thing just after that is to create a connection using the connect method. This method takes as arguments the following:
- the host
- database name
- the username
- the password
you need to enter all above values in order to create a successful connection with the database.
When the connection is established, we have control over the database (depending on what permissions the user we logged in with has).
In this example:
- The host is localhost since we are running PostgreSQL locally.
- The database name is test.
- the username is mac.
- There is no password, meaning we pass un empty string.
# connect to db
pgconn = psycopg2.connect(
host = 'localhost',
database = 'test',
user = 'mac',
password = ''
)
Create the Cursor in psycopg2
After the connection was established successfully, we can use it to create a cursor. The cursor is similar to the Shell. In simple terms, we use it to pass all types of SQL command to the database. In our example we use it to insert the scraped data to the database.
# create cursor
pgcursor = pgconn.cursor()
Insert the data in psycopg2
In order to insert the data to our database using psycopg2, we have a lot of options. Each option is well suited for a specific case. For our case (Inserting multiple rows at once) we use the execute_values method that we imported previously.
The execute_values method takes in as arguments the cursor, the SQL code, and the data.
execute_values(pgcursor,
"INSERT INTO crypto (id, name, symbol, price_usd, market_cap_usd, volume_24h_usd) VALUES %s",
list_of_tuples)
Commit and Close in psycopg2
After inserting data using a database driver like psycopg2, it’s always required to commit the changes. For that we use the commit method. When committing the changes, they get saved in the database.
After the database-driver work is over, we always close the connection to prevent errors. In psycopg2 we close the connection using the close method.
# commit
pgconn.commit()
# close the connection
pgconn.close()
Entire Code and results
Entire Code
Here’s the entire code:
from playwright.sync_api import sync_playwright
import psycopg2
from psycopg2.extras import execute_values
def main():
with sync_playwright() as p:
###################
### scrape data ###
###################
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://coinmarketcap.com/')
# scrolling down
for i in range(5):
page.mouse.wheel(0, 2000)
page.wait_for_timeout(1000)
trs_xpath = "//table[@class='h7vnx2-2 czTsgW cmc-table ']/tbody/tr"
trs_list = page.query_selector_all(trs_xpath)
master_list = []
for tr in trs_list:
coin_dict = {}
tds = tr.query_selector_all('//td')
coin_dict['id'] = tds[1].inner_text()
coin_dict['Name'] = tds[2].query_selector("//p[@color='text']").inner_text()
coin_dict['Symbol'] = tds[2].query_selector("//p[@color='text3']").inner_text()
coin_dict['Price'] = float(tds[3].inner_text().replace('$','').replace(',',''))
coin_dict['Market_cap_usd'] = int(tds[7].inner_text().replace('$','').replace(',',''))
coin_dict['Volume_24h_usd'] = int(tds[8].query_selector('//p[@color="text"]').inner_text().replace('$','').replace(',',''))
master_list.append(coin_dict)
# tuples (id, name, symbole, ...)
list_of_tuples = [tuple(dic.values()) for dic in master_list]
#################
### save data ###
#################
# connect to db
pgconn = psycopg2.connect(
host = 'localhost',
database = 'test',
user = 'mac',
password = ''
)
# create cursor
pgcursor = pgconn.cursor()
execute_values(pgcursor,
"INSERT INTO crypto (id, name, symbol, price_usd, market_cap_usd, volume_24h_usd) VALUES %s",
list_of_tuples)
# commit
pgconn.commit()
# close the connection
pgconn.close()
# close the browser
browser.close()
if __name__ == '__main__':
main()
Results
Let’s check the final results by getting all the data in our table:
SELECT * FROM crypto;
Here is the results:

As we can see, All the data is successfully saved in our test PostgreSQL database, and specifically in our crypto table.
Conclusion
This tutorial was a simple entry to creating an ETL pipelines using Playwright, Psycopg2, and PostgreSQL. In production environments, the code would look more complex than this one since more functionalities and concept would get added.
Remember that we didn’t use headless since this is the development phase, which is the opposite case when it comes to production phase.