Abstracting Data Collection from a Website

Abstracting Data Collection from a Website

For every data professional, the worst thing that can happen is having no data about the problem they are trying to solve. Sometimes the problem is so complex you can’t even afford to spend time worrying about collecting and processing data, you need it in a ready-to-use state ASAP.

This situation also makes it very hard for entry-level data scientists to build relevant and useful ML projects for their environments, and they end up using the same famous datasets out there. As a result, it becomes challenging to work on a unique problem to put on their resume.

I took this problem personally and dedicated some time to provide a solution.

https://media.giphy.com/media/bN4Gf6GEs9OtW/giphy.gif

In this article we will answer the following questions :

  • why data engineering must be abstracted ?
  • how can we abstract the data collection from a website ? ( step by step )

There are no specific requirements for you to follow along as long as you are familiar with the field of data science and software engineering.


Why data engineering must be abstracted ?

Short answer

Nobody really wants to know what’s happening under the hood. We just want the results.

Long answer

When we say abstracting in software engineering, we generally refer to hiding the complex logic of a software component and exposing just an interface instead. Abstractions can also manifest in buttons the form of devices ( eg. smartphones ) which allow us to do complex tasks without actually knowing how it’s done inside. For more on this topic check : Design Patterns

The data engineer’s role is making data available for every software or person who needs it and doing so in a reliable manner. The daily tasks of a DE can involve building data pipelines for collecting, processing and storing data and managing databases.

Having a data engineer in the team will help the data scientists focus on what they do best instead of worrying about collecting and processing data.

Use case : building an unofficial API for moteur.ma

Moteur.ma is a Moroccan website where people can search for used and new cars to buy. It’s a very active platform and we can derive a lot of value from its data, here are some use case examples :

  • predicting the price of your car
  • recommendation system that suggests cars for you to buy
  • monitoring the market of used cars in Morocco

In the following sections, I will guide you through the steps of building an API for moteur.ma. The same logic can be followed to develop APIs for other websites.

PS: The API is the abstraction layer we are talking about.

This process is also called web scraping, but we are doing it in a sexy way now 😉.

Step 0: Understanding the problem

Before we jump to the code, let’s think first.

Basically, what we are trying to do is reverse engineer the website’s original API. Moteur_ma already has an API serving data to the user interface. As users, we only receive the HTML code and we see it through the browser.

The reverse engineering step can also be done through intercepting API calls in the network traffic and reusing them, especially when it’s a website that’s not server-side-rendered ( eg. Twitter, Linkedin ). It’s not the case here.

The following figures are simplifying the website’s architecture and workflow when we interact with it through the browser.

Untitled
Untitled

Now we are trying to take that final output ( the HTML code ) and extract the data from it to serve it again.

Untitled

Step 1: Extracting the HTML

The first step is getting the HTML that the website would return once we access a specific url, we’ll use Requests for that ( line 6 )

An important detail is that we need the HTML in a format that is searchable \ parsable, we’ll use BeautifulSoup for that ( line 8 ).

The following code is responsible for extracting a parsable HTML starting with a URL.

Step 2: Parsing the HTML

Now that we have the HTML, we must need to extract the data from it.

Thanks to BeautifulSoup, the HTML code of the website is represented in a tree data structure. We can now access any element by indicating which path to follow and which conditions to validate.

The figure bellow shows the code for parsing the HTML code.

In line 3 for example, we are trying to get “href” value of the element that has the class attribute equal to “slide” and has an “href” attribute.

parse_html.png

Step 3: Translating the search query to URL

Since we are now able to get from a simple URL to data. We are able to get any data we want as long as we have a the URL that renders it. We made progress, but it’s not very practical.

https://media.giphy.com/media/J27aVYwDuPpWHKZ06S/giphy.gif

The answer to this is our first abstraction block : a query builder.

The query builder is a component that will be responsible of taking the user search query and filters, and will be build a URL based on that. It basically translates the human request in to a server request.

The following code is a short version of the code that’s going to build the query for us.

carbon(2).png

Step 4: Hiding the logic behind classes and functions

Finally, we have all the necessary components for extracting data from the website. Our plan is to make all of our work reusable by our friends and coworkers who want to collect data for their data science projects.

We’ll package our code under classes and functions.

The following code shows how we abstract the extraction of parsable HTML using a URL under the class StaticCrawler. This class can be reused by any other crawler in the future. So we wont ever see the code responsible for that part ever again, unless we want to edit it.

static_crawler.png

The following class inherits from StaticCrawler all its behaviors and adds some of its own. It integrates a parse() method that’s responsible of parsing the HTML of a single search result.

It also has another method called run() that’s responsible of utilizing all the behaviors of the class in order to return data using one specific url.

carbon(3).png

In order to see the real impact of this step, we must see how everything comes together to collect data from the website. Here is a example that only needs to take the QueryBuilder parameters as an input in order to return the data you need.

PS: even this can be hidden under a function... but let’s stop here for the sake of finishing this article.

run.png

That’s it, that’s how I always abstract my code for my teammates to use. I hope this article will add value to your career.

You can read the final version of the code on this repository and you can install it using the following command.

carbon(4).png

Thank you for your time.