Introduction
One powerful skill to have as a programmer is to know how to download files using a URL from the internet with an easy-to-implement language like python. Whether you want to scrape images of products, extract PDF reports, or just download any other format of files, python can do that for you using the code in this article.
Downloading any file from a website using any programming language can be intimidating, that’s why in this article we will simplify the entire process.
After finishing the article, you will have a good understanding on the process of downloading files online, as well as the implementation of the code.
overview
In this article we will explore how we can download and extract images (JPGs or PNGs), PDFs, CADs, ZIPs, STPs, or any other format using Python, requests, and beautifulsoup4.
We will processed as follows:
- Setting up our environment
- Getting the URL of the specific file using requests and beautifulsoup4
- Exploring how to download the file from the URL using python
You can check out the basics of beautifulsoup4 here.
At the end of the article you will find a Cheat Sheet of the code that you can download and save it to use it as reference in future projects.
Setting up the environment
The first step is to always set up the environment. You should have python, requests, and beautifulsoup installed.
To install python, download it from the official python website. Choose the latest version available for your operation system, download it and then install it. Note that you need to set the Path on the installation to checked, otherwise your local computer won’t find python.
After that yo can download and install any package. To download files, we just need requests and beautifulsoup4. We download and install them using PIP which comes with python. Open the Terminal/Command Line and type:
pip install requests beautifulsoup4
And hit Enter. The packages should be installed.
If you get an error, you most likely didn’t didn’t check the Path or something wrong happened during the installation. Google it for a solution.
Why we used python?
Python is the most commonly used language for web scraping and data extraction, meaning you will find a lot of support and documentation online that would help you out.
Another reason it that python is an easy-to-learn language which means that even if you don’t know how to use it, you would pick it up very easily in comparison with other hard-to-learn languages like Java and C#.
Finally, we won’t use any advanced python subjects in this article since we don’t need to. Everything is beginners friendly.
What to keep in mind – The limitations
In order to scrape any type of files using python, we first need to understand how we are going to scrape them.
1 -We first need to get the URL of the file. Files are usually statically stored on a server somewhere, meaning the don’t change place or name. The URL of the file will contain the extension of that file. For example a PDF URL would look like: “https://website.com/file.pdf”, there’s a .pdf in the link.
2- We will use requests to send an HTTP request to the server using the URL. The server will send back an HTTP response that contains the file.
3- When requesting for the file, we need to keep the stream open by setting stream=True. This option is mandatory in python when scraping binary files online.
4- We write (save) the file in binary mode (wb) by chunks using the iter_content() function.
5- The difference in saving one file over another in our code, has to do with the extension in which we save it in since all file are saved in binary. Assign pdf to PDFs, mp4 to mp4 videos, jpg to JPG images, and so on. But before assigning an extension, we need to understand what extension the file has in the first place. Because if we download and save with a wrong extension, it won’t open later.
With all that in mind, Note that:
- The file desired to scrape should not be streamed or opened in the browser using a special web application that doesn’t show its URL.
- The file’s URL must contain the extension of the file. If not, the file might be streamed and scraping it could be hard or even impossible.
The approach to download files
The approach for scraping files in python is to get the file’s URL from the web page first using a scraping package like beautifulsoup4, Selenium, or Scrapy. Then send another request to get the file itself in order to save it locally or remotely. All of that will be achieved using a programming language.
We choose python for this tutorial since python is commonly used for automation in general, form scraping web pages and files to automation operation system’s workflows. And since it is commonly used, you will find a lot of support on line when you are stuck.
HTTP protocol
We first need to to talk about the HTTP protocol, which is the protocol that allows computers to communicate over the internet.
Usually, when using a browser like Google Chrome or Safari to access a website, the browser send an HTTP request to the server that’s hosting that website. Then the server sends back a response that’s containing the HTML, CSS, JavaScript, Images. and sorts of static files that the website needs to work. Finally, the browser display everything as intended and we would see the website.

HTTP Protocol Simplified
In our case, we use python to send a request to the server instead of a browser since we want to handle the response (save the files) that will be sent from the server.
For the server to know what web page of the website or what file we want, we need to specify the URL of that web page or file. For example, if we set the URL equals to “https://aminboutarfi.com”, the sever will respond with the landing page of my website, but if we set the URL equals to “https://aminboutarfi.com/how-does-web-scraping-work-explanation-for-programmers/” the server responds with the article titled “how does web scraping work explanation for programmers”.
In our case, we don’t want the HTML or any other code, we just want the file we want to download. That’s why we specify the URL of that file itself, and for that, we need to get the URL of the file.

HTTP Protocol using python
When the server sends the file to our local machine, python will catch that response, and save it to our local machine.

Processing the response and saving file locally using python
Meaning that python will handle everything, we just need to write to correct code and execute it.
The First Step – Always check for the file’s URL on the website
The first step to download a file, is to get the URL of that file on the web page. The way we get the URL is ass follows
- Inspect the web page pointing on the file
- copying the URL
- paste is it in our code
the copy and past process should be don’t automatically in large projects. but if you just want to download one file, you can do it manually.
Here’s an example:

Go to any amazon product and right click on the product image (picture), then click on Inspect.
The source code of the web page will appear in the Elements Tab in the Inspector, and the image html tag “img” will be highlighted (check the screenshot bellow).
The “img” tag contains the “src” attribute that holds the URL of the image. We use that URL to download the image. Each file on a web page has a URL, and we can get it manually like we just saw or automatically using code.

Note that the file might be streamed through a software or protected somehow. In that case you may not be able to get the URL easily, and you need another method to get it in order to scrape the file. But keep in mind that in most cases, the URL is easy to get.
We usually want to scrape many images or other files at once, That’s why we need to automate the process of getting the URLs. We automate the process using python, beautifulsoup4, and requests.
Here’s an example on getting the URL of the picture of the product above using beautifulsoup4
image_url = soup.find('img', {'id':'landingImage'})['href']
The Python Code to download the files
After getting the URL of the specified file, we can download it using requests as follows:
Download images using Python and Requests
When we have the URL, we can scrape the image using python by implementing the following code:
### Download JPG/PNG
url = '<image url>'
image_name = 'file.png' # or 'image.jpg' depending on the original format
response = requests.get(url, stream=True)
try:
with open(image_name, 'wb') as f:
for chunk in response .iter_content(chunk_size=128):
f.write(chunk)
except:
pass
When sending a request, we must keep the stream=True. like we mentioned before, we must do it when getting static files online in python. We don’t need it when scraping HTML.
After sending the request, we get a response object that holds the file in binary mode.
The next step is to save the binary mode file we received in a new file that we create using open(), and that must have the same extension (JPG or PNG in the case of an image). For example the name of the new file in the code above is file.png. We gave it a PNG extension, therefor the image we want to scrape must have the same extension. Otherwise we need to change it.
Since we are writing the file in binary format, we use “wb” in the code (write binary). The reason for using binary is that in python, we read or write in binary, any file where the format isn’t made up of readable characters.
We save the file by iteration chucks using the iter_content that takes a chunk_size parameter, which is the size of data to save with each chunk.
I used try/except in the code to catch any errors that may crash your code like you internet connection getting cut or the server blocking you. If the code doesn’t work, remove them and run the code again to read the error in the Terminal/Command Prompt.
Download PDFs Using Python and Requests
To scrape PDFs using python we use the same method as we did in scraping images. The only change is the PDF URL, and the PDF extension when naming the file.
Everything else is the same, from sending a request with stream=True, to writing the response in binary format by iterating the content by a chunk size.
Here’s the example code for scraping a PDF using Python:
### Download PDF file
url = '<PDF file url>'
file_name = 'document.pdf'
response = requests.get(url, stream=True)
try:
with open(file_name, 'wb') as fd:
for chunk in response .iter_content(chunk_size=128):
fd.write(chunk)
except:
pass
Download Videos Using Python and Requests
The example code for scraping a mp4 video using Python:
### Download MP4 file
url = '<video file url>'
file_name = 'video.mp4'
r = requests.get(url, stream=True)
try:
with open(file_name, 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
except:
pass
Note that you can’t download YouTube videos using this method since they are streamed. You can use this method to download videos from other website like TED.
Download 2D DWG file (AutoCAD files)
An the example code for scraping a 2D DWG file using Python:
### Download DWG file
url = '<DWG file url>'
file_name = '2d_file.dwg'
r = requests.get(url, stream=True)
try:
with open(file_name, 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
except:
pass
Download 3D STP files using Python and Requests
The following code example is for scraping a 3D STP file using Python:
### Download STP file
url = '<STP file url>'
file_name = '3d_file.stp'
r = requests.get(url, stream=True)
try:
with open(file_name, 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
except:
pass
Download ZIP files using python
This example code is for scraping a ZIP file using Python:
### Download Zip file
url = 'zip file url'
file_name = 'file.zip'
r = requests.get(url, stream=True)
try:
with open(file_name, 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
except:
pass
Download Any file format using python
From what we learned before, to Download any file of any format, we just need to write the correct extension at the end of the file name. Try to find a way to add the format dynamically in your code depending on each file format.
for example set up an If statement to check what file format we have, then add the correct extension when downloading it.
An Example that usually works is to test if the extension is present in the URL.
Here’s an example to test the PDF and ZIP files formats:
urls = ['https://website.com/file1.pdf','https://website.com/file.zip','https://website.com/file2.pdf']
# test the file format
for url in urls:
# test if the format if pdf
if '.pdf' in url:
file_name = 'file.pdf'
# test if the format is zip
elif: '.zip' in url:
file_name = 'file.zip'
# Then we can scrape them
Now that we have the correct format extension associated with the file name, we can download it.
Here’s a final script that generalize the scraping of all sorts of file formats!
### Download any file format
url = 'file url'
file_name = 'file.' + 'format extension' # example: pdf, dwg, jpg...etc
r = requests.get(url, stream=True)
try:
with open(file_name, 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
except:
pass
Using this structure you can download any file just by replacing the URL with the one of the file needed to be downloaded, the correct extension, and giving it a unique name.
One tip to always keep in your head is that you usually need to download many files at once. For example download the images of multiple products on the an ecommerce website. For that you need to give each file a unique name so that no file overrides the one previous to it. You will usually do that dynamically. Maybe each image’s name = the product’s title + extension.
CHEAT SHEET – Download files of any format using python
Here’s the cheat sheet for scraping all file formats like images and PDFs using python.

Save this cheat sheet on your local computer and use it as a reference when you need to check the code again.
Conclusion
Using the given knowledge on this blog and the shared python scripts, you are able to download files of all types. Remember that some website would be trickier than others to get data from, including files. That’s why, you may need more advanced techniques to extract files on some specific websites, but for the big majority, this technique will work fine.
One thing to keep in mind is that you shouldn’t save and past the URL of each file you want to download, but you just need to automate getting the URLs of all needed files on one, or many web pages. After that you can just loop through all the URLs and download each file, by giving each one a unique name so that it doesn’t override the previous file with the same name.