How to scrap websites with Python!

·

5 min read

Img source: Edureka

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. There are different ways to scrape websites such as online Services, APIs or writing your own code. In this article, we’ll see how to implement web scraping with python. We will use one of the websites I have built.

I will skip the installation of Python in this tutorial.

Using your prefered text editor, create a python file and name it whatever you want. I'll name mine scrapper.py. We'll Import all the libraries that we'll need to build our scrapper. a library is a collection of precompiled routines that a program can use. The routines, sometimes called modules, are stored in object format.

# import libraries
import requests
import csv
from bs4 import BeautifulSoup

Now let's get the url for the website from which we want to scrap data from. In this case, we'll use this link.

Using our url, we'll now use the requests library to fetch data from the website.

# import libraries
import requests
import csv
from bs4 import BeautifulSoup

# url to scrap
url = "https://windhoeknamibia.github.io/"

#send http request
resp = requests.get(url)

# prints html layout of the website
print(resp.text)

Create a soup object to get the title of the website

#create a soup object
soup = BeautifulSoup(req.content, 'html.parser')

#get the title of the website
title = soup.find(id="title")
print(title) #html tag
print(title.string) # title as a string

Create a soup object to find places!

# get all <h4> with a className: "place-name"
places_obj = soup.find_all("h4", {"class":"place-name"})
print(places_obj)

Write all place names to a csv file.

# create a list of place names
list_of_places = [["Place Names"]]

#loop through the places object and append them to your list of places
for place in places_obj:
  list_of_places.append([place.string])

print(list_of_places)

with open('places.csv', 'w', newline='') as csv_file: # creates a new csv file
  writerObj = csv.writer(csv_file) # create a csv writer object
  writerObj.writerows(list_of_places) # writerows is used to write data into your csv file

# the csv file will be displayed in your workspace

Let's create another soup object to help us get all the image src links

# all images are inside a div with a className "whk-place"
img_obj = soup.find_all('div', {'class': 'whk-place'})

Now let's print out all the image links.

imgLinks = []

for link in img_obj:
  imgLinks.append(link.find('img').get('src'))  # find the src attribute in each image

print(imgLinks)
# prints a list of image links

I hope this article helped you understand web scrapping and how to use python libraries to scrap websites. You can continue to do more practical examples using different websites.

Be careful not to scrap data from websites which do not give you permission to do so. To know whether a website allows web scraping or not, you can look at the website’s “robots.txt” file. You can find this file by appending “/robots.txt” to the URL that you want to scrape.


To do!

Try to write all image src links into your csv file.

Happy coding!