Extract data from any website in less than 5 minutes.

Extract data from any website in less than 5 minutes.

In my experience as a software engineer and Data science enthusiast, I've been fascinated by how many insights you can get by analyzing the right data. I started to download datasets and trying to process and visualize them. But sometimes I wanted to work with specific data on the internet that wasn't available to download but still available as HTML code. That's how I discovered web-scraping.

Web scraping is mainly collecting data from the web sources in an automated fashion. It's a powerful tool that has multiple use cases, like monitoring prices on a shopping site or finding the more opportunities for you on job posting sites, new monitoring among many other.

In this article, we are going to learn how to extract data from any website and put it in a spreadsheet.

Please note that I am not encouraging anyone to go out there and collect personal or private information from the internet. This is for educational purposes only. Skills you might need. That's it.

Prerequisites

  • Basic understanding of Python
  • Functional programming
  • The will to learn

Pick a target

In this article, and for the sake of educational purposes, we are going to use avito.ma as an example. Our goal is to have a script to which we can specify a query and it can return all the products related to that query on avito.ma as a spread sheet.

Importing libraries

In this first section we are importing the few libraries we are going to use later on.

  • requests: making http requests
  • bs4 : parsing the HTML code into a tree, which helps navigating it and getting data out of it.
  • pandas : manipulating tabular data, we are using it to store the information extracted from the website and saving everything to a spreadsheet.
  • fuckit : ignoring errors in the code. if our code checks for the product price and it doesn't exists, it just ignores that step. alternatively we can try-catch every step of the way but that's too much work πŸ™‚.
import requests 

from bs4 import BeautifulSoup

import pandas as pd

import fuckit

Build Url

For every term we look for in avito.ma, the navigation bar of the results follows a consistent format representing the search query we typed in the search bar and the page of the result.

Example :

the second page of the results related to "voiture" has the following url : https://www.avito.ma/fr/maroc/voiture?o=2

Here we created a function that takes a search query and a page number and returns the url related to that.

def build_url(query, page=1):
    return "https://www.avito.ma/fr/maroc/"+query+"?o="+str(page)

Get HTML

This function is responsible for sending a GET request related to the given url. It parses the returned HTML and returns the parse tree representing the structure of the site and containing the data we need.

def get_html(url):
    r    = requests.get(url)
    soup =  BeautifulSoup(r.content, 'lxml')
    [s.extract() for s in soup(['iframe', 'script', 'style'])]
    return soup

Get Items on a page

A common knowledge in web development is that if you have a list of elements on the same page that have very similar structure ( style ), they probably have the very similar HTML | CSS code, sometimes identical, the data presented is the only thing changing.

See the two pictures bellow. By inspecting the code on the page, you can easily see that every product on the page has a class="item".

The function get_page_posts takes the HTML ( its parse tree ) of the page and returns a list of HTML codes, every item in the list will be the HTML of a single item on the page.

Thanks to BeautifulSoup, we have methods like find and find_all that help us query the parse tree easily.

# get posts on one page
@fuckit
def get_page_posts(sp):
    return sp.find(class_="listing").find_all(class_="item")

Parse one post

Following the same logic, we create a get_post function that takes the HTML code of a single element on the website and returns its attributes by navigating the code using BS's methods.

get_post returns a dictionary {'title': 'voiture a', 'price': '100 dhs', 'location': 'Rabat''}.

we use @fuckit as a decorator for the function in order to ignore possible errors that might occur in runtime. if the code tries to apply 'get_text()' to an element 'item-price' that doesn't exist, it might break the code and return an error. by adding @fuckit we just ignore that and move on.

# get one post informations
@fuckit
def get_post(sp):
    d = {}
    d['title'] = sp.find('h2').get_text().strip()
    d['price'] = sp.find(class_='item-price').get_text().strip()
    
    extra_info = sp.find(class_='item-info-extra')
    d['date'] = extra_info.find(class_='age-text').get_text().strip()
    d['location'] = extra_info.find(class_='re-text').get_text().strip()
    d['category'] = extra_info.find(class_='cg-text').get_text().strip()

    return d

Run the script

The following code puts to use every function we previously discussed in order to get a list of posts as a list of dictionaries.

url = build_url('voiture', 1)
index = get_html(url)
posts = get_page_posts(index)
parsed_posts = [ get_post(post) for post in posts ]

Save to sheet

in this part we use pandas to create a DataFrame out of the list of post and save it as an excel sheet.

df = pd.DataFrame(parsed_posts)
df.to_excel('posts.xlsx')

Here are the first 5 rows of the results. Β πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰

In this article we learned how to extract data from a website step-by-step. This is the simplest form of web scraping. We can get into more details on how to handle more complex ( js generated ) site, how to speed up the process and also how to make a production ready crawler in future articles.

I hope you learned few things about web scraping in this article. Please do not hesitate to read more and share value with your friends. Thank you for your time.

the source code : https://github.com/mouhcineToumi/web-scraping