Web Scraping
This topic matters as it relates to learning what is web scraping and how to implement it with python.
What is Web Scraping?
Web scraping aka web harvesting, or web data extraction, is a technique to automatically access and extract large amounts of information from a website. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. The term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Web Scrape with Python in 4 minutes
Important notes
- Read through the website’s Terms and Conditions to understand how WE can legally use the data. Most sites prohibit you from using the data for commercial purposes.
- Make sure we are not downloading data at too rapid a rate because this may break the website. You may potentially be blocked from the site as well.
Inspecting the Website
We need to figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags. Find the relevant pieces of code that contains our data. We use inspect
in our browser to locate .txt
files, usually located inside of <a>
tags.
Python Code
Import libraries:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
Set the url to the website and access the site with our requests library:
url = 'https://web.mta.info/developers/turnstile.html'
response = requests.get(url)
Parse html with BeautifulSoup:
soup = BeautifulSoup(response.text, "html.parser")
Use the method .findAll to locate all of our <a>
tags:
soup.findAll('a')
Extract the link that we want:
one_a_tag = soup.findAll('a')[38]
link = one_a_tag['href']
Pause code to avoid spamming the website with requests:
time.sleep(1)
How to scrape websites without getting blocked
Web scraping is a task that has to be performed responsibly so that it does not have a detrimental effect on the sites being scraped. Web Crawlers can retrieve data much quicker, in greater depth than humans, so bad scraping practices can have some impact on the performance of the site. While most websites may not have anti-scraping mechanisms, some sites use measures that can lead to web scraping getting blocked, because they do not believe in open data access.
Best practices to avoid getting blocked:
- Respect Robots.txt
- Make the crawling slower, do not slam the server, treat websites nicely
- Do not follow the same crawling pattern
- Make requests through Proxies and rotate them as needed
- Rotate User Agents and corresponding HTTP Request Headers between requests
- Use a headless browser like Puppeteer, Selenium or Playwright
- Beware of Honey Pot Traps
- Check if Website is Changing Layouts
- Avoid scraping data behind a login
- Use Captcha Solving Services
Things I want to know more about
- I would like to know more about web crawlers and how to web scrape efficiently without getting blocked.