Figuring out how to structure, manage, and maintain your data science project is a series of non-trivial tasks that, if done correctly, can make your life much easier as your project unfolds and matures. Aiming for reproducibility adds an additional layer of complexity and difficulty, but can contribute to the longevity and credibility of a project.
Where can I find comprehensive resources on ensuring practical reproducibility and maintainability of data science projects?
Python project file structure (source: Build a Reproducible and Maintainable Data Science Project by Khuyen Tran)
Khuyen Tran has put together a fantastic resource in the free online book titled, rather appropriately, Building a Repeatable and Maintainable Data Science Project.
This book introduces Python tools for developing efficient workflows for repeatable and maintainable data science projects. We introduce the best practices and tools that allow data scientists to be able to adapt to the ever-increasing demand for complexity, while guaranteeing the reliability of their systems.
This all sounds good, you think, but what does it really mean? To get an overview of what the book offers, I suggest you take a quick look at Section 2.1, How to Structure a Data Science Project for Readability and Transparency. You’ll quickly understand how the book is structured, what it will cover, how it does it, and Tran’s appreciation for (and following) standards and best practices. You will find an easy to read, well structured and informative resource waiting for you.
Tran’s book and accompanying Data Science Cookie Cutter pattern rely on the following Python tools to achieve its goals:
- cookie cutter
- pre-commit plugins
- and more
I’m a big fan of poetry, so from the start I’m happy that Tran chose to use it in the project (NB: a pip version of the project exists for those who prefer). Poetry is a masterful Python dependency management tool with many features beyond pip. You can read more about poetry here.
Importance is also given to testing, configuration file management, project installation, data and template management, and code compliance and organization. In short, what you should do to ensure your code is reproducible and maintainable, Tran covers in this book. Not only are the concepts and practice covered, but the accompanying GitHub repository contains a project that helps complete the set of tasks.
Pre-Engagement Tasks (source: Build a Reproducible and Maintainable Data Science Project by Khuyen Tran)
Make sure your code conforms to the PEP-8 style guide? Covered.
Version your datasets and store them online? Check.
Delete notebook outputs before commits? Yes.
Document your code as you go? You bet!
Tran’s free online book is a useful resource for beginners and seasoned practitioners alike. Using the methods therein will certainly improve the reliability of your code, the maintainability of your implementations, and the reproducibility of your projects, while allowing for increased complexity.
Don’t let the aspects of your project over which you can have full control, namely structure and implementation, be your downfall; follow the plan detailed by Tran in this book to ensure that you build a repeatable and maintainable data science project.
Matthew Mayo (@mattmayo13) is a data scientist and the editor of KDnuggets, the leading online resource on data science and machine learning. His interests are in natural language processing, algorithm design and optimization, unsupervised learning, neural networks and automated approaches to machine learning. Matthew has a master’s degree in computer science and a graduate degree in data mining. It can be attached to editor1 at kdnuggets[dot]com.