Dockerizing Data Science Projects: A Complete Guide
Written on
Introduction to Docker
In the realm of software development, Docker plays a pivotal role. It is a robust tool designed to simplify the process of creating, deploying, and executing applications through the use of containers. These containers enable developers to bundle an application alongside all its necessary components, including libraries and dependencies, allowing for a unified package that can be easily shared.
Benefits of Utilizing Docker
One common issue developers face is the inconsistency of code execution across different machines. Often, code that runs perfectly on one system fails on another due to discrepancies in library versions or dependencies. Docker addresses this challenge by ensuring that applications can run on any Linux machine without being affected by the unique configurations of each environment.
Assumptions and Objectives
For our purposes, let’s assume we have a local directory named "Analysis," which contains two scripts: Analysis_1.py and Analysis_2.py. Our goal is to develop a container capable of executing both scripts and producing the desired outputs. Additionally, our CSV files are located in a "data" folder at the same directory level as "Analysis."
1. Installing Docker
Follow these instructions to install Docker on your machine.
2. Setting Up requirements.txt and entrypoint.sh
We need to specify the libraries required for our container in a file named requirements.txt. Here’s what to include:
pandas
seaborn
scipy
xlrd
matplotlib
To run multiple scripts, we also require a bash script, which will be saved as entrypoint.sh:
python Analysis_1.py
python Analysis_2.py
3. Configuring the Dockerfile
Create a new directory on your local machine and navigate to it. Inside, create a file named Dockerfile, then insert the following content:
# Start with Python version 3.7.2
FROM python:3.7.2
MAINTAINER Hari Devanathan
# Set up environment variable for unbuffered output
# ENV PYTHONUNBUFFERED 1
ADD Analysis/Analysis.py Analysis.py
ADD Analysis/Analysis_2.py Analysis_2.py
# Update installed packages
RUN apt-get update &&
apt-get install -y --no-install-recommends
libatlas-base-dev gfortran
# Copy the requirements file
COPY requirements.txt requirements.txt
# Copy data files
COPY ./data ./data
# Install the required Python packages
RUN pip install -r requirements.txt
# Execute the entrypoint script
ENTRYPOINT [ "entrypoint.sh" ]
# Open port 8888 for access
EXPOSE 8888
4. Building the Docker Image
Ensure your directory contains the following files:
Analysis
data
Dockerfile
requirements.txt
entrypoint.sh
To create a Docker container named dataanalysis, execute the following command:
docker build --tag=dataanalysis .
To verify the image creation, run:
docker image ls
You should see an output similar to:
REPOSITORY TAG IMAGE ID
dataanalysis latest 326387cea398
Finally, run your container with:
docker run -p dataanalysis
Conclusion
In summary, we have successfully built a straightforward Docker container to run two Python analysis scripts. This setup can easily be shared with colleagues, ensuring they can execute your analysis without issues. The template can be further modified to incorporate additional data analysis scripts or projects.
Thank you for reading! If you wish to explore more of my work, check out my Table of Contents.
If you're not a Medium paid member but are interested in subscribing to Towards Data Science for access to tutorials and articles like this, click here to enroll in a membership. This referral helps support my work on Medium.
Chapter 1: Getting Started with Docker
This video demonstrates how to Dockerize a data science project, providing a visual guide to the steps involved.
Chapter 2: Advanced Docker Techniques
In this video, learn advanced Docker techniques tailored specifically for data scientists, enhancing your project deployment strategies.