What difference does it make for a web scraper if a website is dynamic or static? what difficulties are represented with each one? and most importantly, what technologies should one use in each case?
NOTE: I’ll be talking about scraping both static and dynamic websites for web scrapers that use python related technologies and not paid services.
In a nutshell, static sites are known for displaying the same content for each user. No database user-specific filtering.
To simplified it more, the data is directly assigned to HTML and in order to request another page (HTTP GET request) or send data (HTTP Post request), the entire page reloads so that new data or another page is rendered.
Examples of static websites are : Wikipedia and simple portfolio websites.
On the other side, dynamic websites can be totally user specific. A good example of that would be social media websites, where you may see a post on Facebook or Twitter that another person may never see.
Basically, the way dynamic websites function under the hood is as follows:
- back-end (the server side) serves the data.
All of that without reloading the entire page, just a section of the page is reloaded, which is the section where the data changes.
Example of dynamic websites: Social media websites and News websites.
With all that being said, we can understand why dynamic websites can be challenging to make and require a lot of experience and skills in both areas, client-side, and server-side scripting (code executed by the server, which can be Python, PHP, Node…etc. Which is what handles the logic of the website).
Difference between Scraping Static websites vs Scraping dynamic websites
A rule of thumb is that, static websites are generally easier to scrape since the data is fixed on the HTML when the page loads. Meaning we simply get the html using a packages like requests and then parse it to get our needed data.
Technologies used for scraping static and dynamic websites
For static websites we usually use almost any python scraping package like requests with beautifulsoup, Scrapy, selenium and splash. Using any one over another depends on many factors like the scrapers experiences, the scale of the project, the time and budget of client…etc.
The approach for extracting data from dynamic websites may differ from static ones slightly or totally depending on the website.
The Go-To method is to use either Selenium or Splash since these two technologies automate the browser and therefor mimic the human behavior. Selenium is easier to learn and use than splash is, therefor it is the most used technology of the two.
The best way to scrape a dynamic website is through the API if found. Usually if the API is visible, we can access it via the browser inspector tool, then we go to the Network tab and check all the sent files there.
Extracting data from dynamic websites is more challenging than extracting from static ones, it requires more expertise and a better understanding of some specific technologies like selenium and splash. However, it’s important for any scraper to learn how to scrape static website first, since they are easier and let’s you get your feet in the door. Then learn how to get data from more complex dynamic providing web applications.
Check out my other blogs to get better at this craft.
Thank you for reading my post!