Python data science project structure

Nội dung chính Show

And How to Create One in One Line of Code
Get Started
Install Dependencies
Manage Code and Tests
Manage Configuration Files with Hydra
Manage Data and Models with DVC
Store Your Data Remotely
Check Issues in Your Code Before Committing
Add API Documentation
How do you structure data for a science project?
What is the typical structure of a data science folder?
What are the 4 components of data science?
Which Python framework is used in data science?

And How to Create One in One Line of Code

Motivation

It is important to structure your data science project based on a certain standard so that your teammates can easily maintain and modify your project.

Image by Author

But what kind of standard should you follow? Wouldn’t it be nice if you can create an ideal structure for a data science project using a template?

There are some great templates for data science projects out there, but they lack some good practices such as testing, configuring, or formatting your code.

That is why I created a repository name data-science-template. This repository is the result of my years refining the best way to structure a data science project so that it is reproducible and maintainable.

In this article, you will learn how to use this template to incorporate best practices into your data science workflow.

Get Started

To download the template, start with installing Cookiecutter:

pip install cookiecutter

Create a project based on the template:

cookiecutter https://github.com/khuyentran1401/data-science-template

…, and you will be prompted to answer some details about your project:

GIF by Author

Now a project with the specified name is created in your current directory! The structure of the project looks like the below:

Image by Author

The tools used in this template are:

Poetry: Dependency management
hydra: Manage configuration files
pre-commit plugins: Automate code reviewing formatting
DVC: Data version control
pdoc: Automatically create API documentation for your project

In the next few sections, we will learn the functionalities of these tools and files.

Install Dependencies

There are two common ways to install dependencies: pip and poetry. In following sections show you how to use each approach.

Pip

It is a good practice to create a virtual environment for your project so that that you can isolate the dependencies of your project from the dependencies of other projects on your machine.

To create a virtual environment, type:

python3 -m venv venv

To activate a virtual environment, type:

source venv/bin/activate

Next, install dependencies for this project from requirements.txt:

pip install -r requirements.txt

To add a new PyPI library, run:

pip install <library-name>

Poetry

This project uses Poetry instead of pip to manage dependencies since Poetry allows you to:

Separate the main dependencies and the sub-dependencies into two separate files (instead of storing all dependencies in requirements.txt)
Create readable dependencies files
Remove all unused sub-dependencies when removing a library
Avoid installing new packages that are conflict with the existing packages
Package your project in several lines of code

Find the instruction on how to install Poetry here. All main dependencies for this project are specified in pyproject.toml . To install all dependencies, run:

poetry install

To add a new PyPI library, run:

poetry add <library-name>

To remove a library, run:

poetry remove <library-name>

Makefile

Makefile allows you to create short and readable commands for a series of tasks. You can use Makefile to automate tasks such as setting up the environment:

Now whenever others want to set up the environment for your projects, they just need to run:

make activate
make setup

And a series of commands will be run:

GIF by Author

Manage Code and Tests

All Python code is stored under the directory src .

Image by Author — Icons obtained from flaticon

All test files are under the directory tests . Each test file starts with the word test followed by the name of the file that is tested.

Image by Author — Icons obtained from flaticon

Manage Configuration Files with Hydra

A configuration file stores all of the values in one place, which helps to separate the values from the code and avoid hard coding. In this template, all configuration files are stored under the directory config .

Image by Author — Icons obtained from flaticon

Hydra is a Python library that allows you to access parameters from a configuration file inside a Python script.

Video by Author

For example, if our main.yaml file looks like below:

Image by Author — Icons obtained from flaticon

…, then we can access the value inside the configuration file by adding the decorator @hydra.main on a specific function. Inside this function, we can access the value under processed and path by using a dot notation: config.processed.path .

Manage Data and Models with DVC

All data is stored under the subdirectories under data . Each subdirectory stores data from different stages.

Image by Author — Icons obtained from flaticon

All models are stored under the directory model .

Since Git is not ideal for version binary files, we use DVC — Data Version Control to version our data and models.

We specify DVC stages in the dvc.yaml file. Each stage represents individual data processes, including their inputs (deps) and resulting output (outs ).

All directories and files under outs will be automatically tracked by DVC.

If you want to execute commands defined in their stages, run dvc repro . DVC will skip the stages that didn’t change.

GIF by Author

Store Your Data Remotely

The main benefit of using DVC is that it allows you to upload data tracked by DVC to remote storage. You can store your data on DagsHub, Google Drive, Amazon S3, Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP.

dvc remote add -d remote <REMOTE-URL>

After adding data to your local project, you can push the data to remote storage:

dvc push

Add and push all changes to Git:

git add .
git commit -m 'commit-message'
git push origin <branch>

Check Issues in Your Code Before Committing

When committing your Python code to Git, you need to make sure your code:

looks nice
is organized
conforms to the PEP 8 style guide
includes docstrings

However, it can be overwhelming to check all of these criteria before committing your code. pre-commit is a framework that allows you to identify simple issues in your code before committing it.

You can add different plugins to your pre-commit pipeline. Once your files are committed, they will be checked by these plugins. Unless all checks are passed, no code will be committed.

Image by Author — Icons obtained from flaticon

In this template, we use 5 different plugins that are specified in .pre-commit-config.yaml . They are:

black — formats Python code
flake8 — checks the style and quality of your Python code
isort — automatically sorts imported libraries alphabetically and separates them into sections and types.
mypy — checks static type
nbstripout — strips output from Jupyter notebooks

To add pre-commit to git hooks, type:

pre-commit install

Now, whenever you run git commit, your code will be automatically checked and reformatted before being committed.

GIF by Author

Add API Documentation

As a data scientist, a lot of time you will collaborate with other team members. Thus, it is important to create good documentation for your project.

To create API documentation based on docstrings of your Python files and objects, run:

make docs_view

Output:

Save the output to docs...
pdoc src --http localhost:8080
Starting pdoc server on localhost:8080
pdoc server ready at http://localhost:8080

Now you can view the documentation on http://localhost:8080.

GIF by Author

To save all API documentation as markdowns, run:

make docs_save

Conclusion

Congratulations! You have just learned how to structure your data science project using a data science template. This template means to be flexible. Feel free to adjust the project based on your applications.

Feel free to play with the data-science-template here:

I like to write about basic data science concepts and play with different data science tools. You could connect with me on LinkedIn and Twitter.

Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these:

How do you structure data for a science project?

How To Structure A Data Science Project.

Cookiecutter. ... .

Install Dependencies. ... .

Folders. ... .

Makefile. ... .

Leverage Hydra for Configuration Files Management. ... .

Manage Models and Data With DVC. ... .

Check Coding Issues Before Committing. ... .

Add API Documentation..

What is the typical structure of a data science folder?

Folder Structure of Data Science Project project_name: Name of the project. src: The folder that consists of the source code related to data gathering, data preparation, feature extraction, etc. tests: The folder that consists of the code representing unit tests for code maintained with the src folder.

What are the 4 components of data science?

Main Components of Data Science.

Data Exploration. It is the most important step, as this step consumes the most amount of time. ... .

Modeling. So, by now, our data is prepared and ready to go. ... .

Testing the Model. It is the next step and very important concerning the performance of the model. ... .

Deploying Models..

Which Python framework is used in data science?

Scikit-Learn is one of the best Python Data Science frameworks for Machine learning. It provides a range of algorithms for tasks such as classification, regression, and clustering. It also includes tools for model selection and evaluation.