You may have heard before that web scraping is the process of extracting data from websites. But what does that really means? and what technologies do web scrapers use to extract the data into their local or hosted machines?
How does the internet work – simplified
In order to understand how the process of web scraping works, we first need to understand how the internet works.
The internet is a very large network of computers connected via wires of WIFI. Some can be personal computers and other can be servers. Servers are usually high-end computers located somewhere on the world, that have many purposes, one of those purposes is hosting websites. We access those websites using their domain names (ex: google.com, facebook.com, and aminboutarfi.com) in browsers (ex: Google Chrome).
The connection between your personal computer and a website is made using the HTTP protocol, that’s why for example, the Facebook domain is written https://www.facebook.com. http stands for the protocol, s for security, www for world wide web and facebook.com is the domain name.
A website is usually made of a back-end and a front-end. The back-end handles the logic of the website and the front-end is the visual part of the website, what we see on the browser.
Usually, the information we want to scrape are visible on the front-end.
HTML what we all tags to specify the structure of elements on a website page, those elements can be headlines, paragraphs, images…etc.
CSS use attributes to decorate and add style to the html tags.
We target the html tags and CSS attributes of each specific element to scrape it.
How do web scrapers extract data from a website – Theory of web scraping
Now that we had an overview on how the internet works, talk about web scraping.
To simplify it, web scrapers use different technologies that allows them to send one or many HTTP requests to a website, get back the front-end part, and then parse what data is needed from it, and finally save it in the desired format.
Technologies used for web scraping
There are a lot of technologies used for data extraction, from pure programming languages all the way to sophisticated software/applications. Using a technology over another has to do with the experience of the scraper, the scale of the project or the website needed to be scraped, the budget of the client and the deadline giving by the client.
To name few technologies, we first have the most popular one and that is the programming language Python and all is packages and frameworks like requests with beautifulsoup, Scrapy, Selenium, Splash… etc. Each one of these packages and frameworks has it’s own uses, advantages and disadvantages. Another option to scrape data is through web scraping services like Parsehub and Octaparse.
Saving the raw data
Once the data is scraped, it can be saved into many formats like CSV, JSON, Excel, TXT.. etc. Saving that data is achieved through pure programming languages (like python) or using specified software in the case of paid services.
Web scraping is a skill that requires a basic understanding on how the internet and websites work as well as a deep understanding of specific data extracting technologies and theories. A routine update on what’s new is important since web scraping is a rapidly changing discipline.
Thank you for reading my post!