Show
And How to Create One in One Line of CodeMotivationIt is important to structure your data science project based on a certain standard so that your teammates can easily maintain and modify your project. Image by AuthorBut what kind of standard should you follow? Wouldn’t it be nice if you can create an ideal structure for a data science project using a template? There are some great templates for data science projects out there, but they lack some good practices such as testing, configuring, or formatting your code. That is why I created a repository name data-science-template. This repository is the result of my years refining the best way to structure a data science project so that it is reproducible and maintainable. In this article, you will learn how to use this template to incorporate best practices into your data science workflow. Get StartedTo download the template, start with installing Cookiecutter: pip install cookiecutter Create a project based on the template: cookiecutter https://github.com/khuyentran1401/data-science-template …, and you will be prompted to answer some details about your project: GIF by AuthorNow a project with the specified name is created in your current directory! The structure of the project looks like the below: Image by AuthorThe tools used in this template are:
In the next few sections, we will learn the functionalities of these tools and files. Install DependenciesThere are two common ways to install dependencies: pip and poetry. In following sections show you how to use each approach. PipIt is a good practice to create a virtual environment for your project so that that you can isolate the dependencies of your project from the dependencies of other projects on your machine. To create a virtual environment, type: python3 -m venv venv To activate a virtual environment, type: source venv/bin/activate Next, install dependencies for this project from requirements.txt: pip install -r requirements.txt To add a new PyPI library, run: pip install <library-name> PoetryThis project uses Poetry instead of pip to manage dependencies since Poetry allows you to:
Find the instruction on how to install Poetry
here. All main dependencies for this project are specified in poetry install To add a new PyPI library, run: poetry add <library-name> To remove a library, run: poetry remove <library-name> MakefileMakefile allows you to create short and readable commands for a series of tasks. You can use Makefile to automate tasks such as setting up the environment: Now whenever others want to set up the environment for your projects, they just need to run: make activate And a series of commands will be run: GIF by AuthorManage Code and TestsAll Python code is stored
under the directory All test files are under the directory Manage Configuration Files with HydraA configuration file stores all of the values in one place, which helps to separate the values from the code and avoid hard coding. In this template, all configuration files are stored under the directory Hydra is a Python library that allows you to access parameters from a configuration file inside a Python script. Video by AuthorFor example, if our …, then we can access the value inside the configuration file by adding the decorator Manage Data and Models with DVCAll data is stored under the subdirectories under All models are stored under the directory Since Git is not ideal for version binary files, we use DVC — Data Version Control to version our data and models. We specify DVC stages in the All directories and files under If you want to execute commands defined in their stages, run Store Your Data RemotelyThe main benefit of using DVC is that it allows you to upload data tracked by DVC to remote storage. You can store your data on DagsHub, Google Drive, Amazon S3, Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP. dvc remote add -d remote <REMOTE-URL> After adding data to your local project, you can push the data to remote storage: dvc push Add and push all changes to Git: git add . Check Issues in Your Code Before CommittingWhen committing your Python code to Git, you need to make sure your code:
However, it can be overwhelming to check all of these criteria before committing your code. pre-commit is a framework that allows you to identify simple issues in your code before committing it. You can add different plugins to your pre-commit pipeline. Once your files are committed, they will be checked by these plugins. Unless all checks are passed, no code will be committed. Image by Author — Icons obtained from flaticonIn this template, we use 5 different plugins that are specified in
To add pre-commit to git hooks, type: pre-commit install Now, whenever you run Add API DocumentationAs a data scientist, a lot of time you will collaborate with other team members. Thus, it is important to create good documentation for your project. To create API documentation based on docstrings of your Python files and objects, run: make docs_view Output: Save the output to docs... Now you can view the documentation on http://localhost:8080. GIF by AuthorTo save all API documentation as markdowns, run: make docs_save ConclusionCongratulations! You have just learned how to structure your data science project using a data science template. This template means to be flexible. Feel free to adjust the project based on your applications. Feel free to play with the data-science-template here:
I like to write about basic data science concepts and play with different data science tools. You could connect with me on LinkedIn and Twitter. Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these: How do you structure data for a science project?How To Structure A Data Science Project. Cookiecutter. ... . Install Dependencies. ... . Folders. ... . Makefile. ... . Leverage Hydra for Configuration Files Management. ... . Manage Models and Data With DVC. ... . Check Coding Issues Before Committing. ... . Add API Documentation.. What is the typical structure of a data science folder?Folder Structure of Data Science Project
project_name: Name of the project. src: The folder that consists of the source code related to data gathering, data preparation, feature extraction, etc. tests: The folder that consists of the code representing unit tests for code maintained with the src folder.
What are the 4 components of data science?Main Components of Data Science. Data Exploration. It is the most important step, as this step consumes the most amount of time. ... . Modeling. So, by now, our data is prepared and ready to go. ... . Testing the Model. It is the next step and very important concerning the performance of the model. ... . Deploying Models.. Which Python framework is used in data science?Scikit-Learn is one of the best Python Data Science frameworks for Machine learning. It provides a range of algorithms for tasks such as classification, regression, and clustering. It also includes tools for model selection and evaluation.
|