By the time I am writing this, I had the chance to read more than 300 resumes for data science internships and assist to around 40 interviews, and I could see some common mistakes I would like to point out in this article in order to help people entering the field straight from university.
Most of the issues discussed here are related to the projects realized in the academic context, since most of them fail to simulate real world projects, which I think is the main cause of the mismatch between academia and the job market in Data Science.
Data is not realistic
Students tend to work with ready to use data provided online ( eg. Kaggle ). In this case, they deprive themselves from facing a really common challenge that takes 80% of the time, which is data engineering. Instead, they choose to focus on building models ASAP.
I do not recommend reinventing the wheel and always do the data engineering part even if it's partially done and available. I just want to stress out the importance of working with dirty data for educational purposes, because the struggle of transforming the input data into usable data is real.
Always expect data to be noisy, incomplete or inaccurate, and the features you'll want to use for training your models are almost never there. Skills for processing, feature engineering and data augmentation are called for very often.
Coding in Notebooks
Listen. Notebooks ( eg. Jupyter ) are for prototyping and not for building full projects. When I first started I did everything in notebooks, from data collection to data visualization and building models. It worked well for small projects. But as the projects evolve and grow, it is nearly impossible to recover and refactor everything. I learned this the hard way.
Notebooks are powerful since they allow you to run each cell independently and visualize data easily. But at the same time you need to keep track of which cell run before which, also you can rarely reuse code without repeating yourself ( copy & paste ), not to mention version control.
What I suggest is using and IDE ( eg. Pycharm ) where you can develop modules and abstractions you can import for later use in the notebooks. This way you'll slowly build a set of modules you can use in every projects and write less code in the notebooks while focusing on prototyping.
Projects with no context or impact
In academia, most of the projects are done for the sake of the technical part alone. On the other hand, real world projects are realized with the business value in mind first. Even if the accuracy of the model is good and the technical architecture is brilliant, what matters the most is the cost of the solution and the business problem it is solving.
Data scientists are meant to be at the layer between the technical experts and the decision makers. That's why it is a very advised to highlight the business value of your projects while presenting them. Unless you want to limit yourself to the execution part.
Underrated Data Analysis
Data analysis work is usually not technically challenging thanks to the available BI tools ( eg. Power BI or Tableau ). The real challenge resides in extracting insights from the produced visualizations. There are different kinds of analysis, let's mention few of them :
Descriptive analysis : visualizing data and describing it. ( eg. X is the most visited place in Morocco )
Diagnostic analysis : you set hypotheses and test them. ( eg. Why is X the most visited place in Morocco ? ).
Predictive analysis : you can use the gathered insights to make predictions. ( eg. if we target people with interest X we can increase revenue by Y% )
Highlighting the ability to do analysis work that goes beyond the descriptive phase is a big plus in a DS's resume or interview.
We all love CSV and JSONs since it's very easy to go back and forth between dataframes and these file formats. But again, for educational purposes and for the sake of simulating real world projects, it is necessary to interact with a minimal data infrastructure. A way of doing this with the minimum effort is just using databases. I recommend MongoDB for unstructured and raw data and PostgreSQL for trusted and structured data.
In addition to the fact that using databases will make your workflow more efficient ( eg. concurently storing or processing data in the same repository ), you kind of force yourself to unlock many skills:
- writing advanced and efficient SQL and NoSQL queries
- easily connect you data to BI tools ( eg. Power BI ) and create dashboards easily
- focus on data modeling
- automating backups and recovery scenarios
In this article I listed the main points forming the gap between academia and the job market in data science in my opinion. We tried to give few recommendations to simulate real world projects and therefore increase chances to land more opportunities.
I hope this article added value. Please do not hesitate to read more articles on the blog.