And How to Create One in One Line of Code
Motivation
It is important to structure your data science project based on a certain standard so that your teammates can easily maintain and modify your project.
But what kind of standard should you follow? Wouldn’t it be nice if you can create an ideal structure for a data science project using a template?
There are some great templates for data science projects out there, but they lack some good practices such as testing, configuring, or formatting your code.
That is why I created a repository name data-science-template. This repository is the result of my years refining the best way to structure a data science project so that it is reproducible and maintainable.
In this article, you will learn how to use this template to incorporate best practices into your data science workflow.
Get Started
To download the template, start with installing Cookiecutter:
pip install cookiecutterCreate a project based on the template:
cookiecutter //github.com/khuyentran1401/data-science-template…, and you will be prompted to answer some details about your project:
Now a project with the specified name is created in your current directory! The structure of the project looks like the below:
The tools used in this template are:
- Poetry: Dependency management
- hydra: Manage configuration files
- pre-commit plugins: Automate code reviewing formatting
- DVC: Data version control
- pdoc: Automatically create API documentation for your project
In the next few sections, we will learn the functionalities of these tools and files.
Install Dependencies
There are two common ways to install dependencies: pip and poetry. In following sections show you how to use each approach.
Pip
It is a good practice to create a virtual environment for your project so that that you can isolate the dependencies of your project from the dependencies of other projects on your machine.
To create a virtual environment, type:
python3 -m venv venvTo activate a virtual environment, type:
source venv/bin/activateNext, install dependencies for this project from requirements.txt:
pip install -r requirements.txtTo add a new PyPI library, run:
pip install <library-name>Poetry
This project uses Poetry instead of pip to manage dependencies since Poetry allows you to:
- Separate the main dependencies and the sub-dependencies into two separate files (instead of storing all dependencies in requirements.txt)
- Create readable dependencies files
- Remove all unused sub-dependencies when removing a library
- Avoid installing new packages that are conflict with the existing packages
- Package your project in several lines of code
Find the instruction on how to install Poetry here. All main dependencies for this project are specified in pyproject.toml . To install all dependencies, run:
poetry installTo add a new PyPI library, run:
poetry add <library-name>To remove a library, run:
poetry remove <library-name>Makefile
Makefile allows you to create short and readable commands for a series of tasks. You can use Makefile to automate tasks such as setting up the environment:
Now whenever others want to set up the environment for your projects, they just need to run:
make activatemake setup
And a series of commands will be run:
Manage Code and Tests
All Python code is stored under the directory src .
All test files are under the directory tests . Each test file starts with the word test followed by the name of the file that is tested.
Manage Configuration Files with Hydra
A configuration file stores all of the values in one place, which helps to separate the values from the code and avoid hard coding. In this template, all configuration files are stored under the directory config .
Hydra is a Python library that allows you to access parameters from a configuration file inside a Python script.
Video by AuthorFor example, if our main.yaml file looks like below:
…, then we can access the value inside the configuration file by adding the decorator @hydra.main on a specific function. Inside this function, we can access the value under processed and path by using a dot notation: config.processed.path .
Manage Data and Models with DVC
All data is stored under the subdirectories under data . Each subdirectory stores data from different stages.
All models are stored under the directory model .
Since Git is not ideal for version binary files, we use DVC — Data Version Control to version our data and models.
We specify DVC stages in the dvc.yaml file. Each stage represents individual data processes, including their inputs (deps) and resulting output (outs ).
All directories and files under outs will be automatically tracked by DVC.
If you want to execute commands defined in their stages, run dvc repro . DVC will skip the stages that didn’t change.
Store Your Data Remotely
The main benefit of using DVC is that it allows you to upload data tracked by DVC to remote storage. You can store your data on DagsHub, Google Drive, Amazon S3, Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP.
dvc remote add -d remote <REMOTE-URL>After adding data to your local project, you can push the data to remote storage:
dvc pushAdd and push all changes to Git:
git add .git commit -m 'commit-message'
git push origin <branch>
Check Issues in Your Code Before Committing
When committing your Python code to Git, you need to make sure your code:
- looks nice
- is organized
- conforms to the PEP 8 style guide
- includes docstrings
However, it can be overwhelming to check all of these criteria before committing your code. pre-commit is a framework that allows you to identify simple issues in your code before committing it.
You can add different plugins to your pre-commit pipeline. Once your files are committed, they will be checked by these plugins. Unless all checks are passed, no code will be committed.
In this template, we use 5 different plugins that are specified in .pre-commit-config.yaml . They are:
- black — formats Python code
- flake8 — checks the style and quality of your Python code
- isort — automatically sorts imported libraries alphabetically and separates them into sections and types.
- mypy — checks static type
- nbstripout — strips output from Jupyter notebooks
To add pre-commit to git hooks, type:
pre-commit installNow, whenever you run git commit, your code will be automatically checked and reformatted before being committed.
Add API Documentation
As a data scientist, a lot of time you will collaborate with other team members. Thus, it is important to create good documentation for your project.
To create API documentation based on docstrings of your Python files and objects, run:
make docs_viewOutput:
Save the output to docs...pdoc src --http localhost:8080
Starting pdoc server on localhost:8080
pdoc server ready at //localhost:8080
Now you can view the documentation on //localhost:8080.
To save all API documentation as markdowns, run:
make docs_saveConclusion
Congratulations! You have just learned how to structure your data science project using a data science template. This template means to be flexible. Feel free to adjust the project based on your applications.
Feel free to play with the data-science-template here:
I like to write about basic data science concepts and play with different data science tools. You could connect with me on LinkedIn and Twitter.
Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these: