charmingcompanions.com

Dockerizing Data Science Projects: A Complete Guide

Written on

Introduction to Docker

In the realm of software development, Docker plays a pivotal role. It is a robust tool designed to simplify the process of creating, deploying, and executing applications through the use of containers. These containers enable developers to bundle an application alongside all its necessary components, including libraries and dependencies, allowing for a unified package that can be easily shared.

Benefits of Utilizing Docker

One common issue developers face is the inconsistency of code execution across different machines. Often, code that runs perfectly on one system fails on another due to discrepancies in library versions or dependencies. Docker addresses this challenge by ensuring that applications can run on any Linux machine without being affected by the unique configurations of each environment.

Assumptions and Objectives

For our purposes, let’s assume we have a local directory named "Analysis," which contains two scripts: Analysis_1.py and Analysis_2.py. Our goal is to develop a container capable of executing both scripts and producing the desired outputs. Additionally, our CSV files are located in a "data" folder at the same directory level as "Analysis."

1. Installing Docker

Follow these instructions to install Docker on your machine.

2. Setting Up requirements.txt and entrypoint.sh

We need to specify the libraries required for our container in a file named requirements.txt. Here’s what to include:

pandas

seaborn

scipy

xlrd

matplotlib

To run multiple scripts, we also require a bash script, which will be saved as entrypoint.sh:

python Analysis_1.py

python Analysis_2.py

3. Configuring the Dockerfile

Create a new directory on your local machine and navigate to it. Inside, create a file named Dockerfile, then insert the following content:

# Start with Python version 3.7.2

FROM python:3.7.2

MAINTAINER Hari Devanathan

# Set up environment variable for unbuffered output

# ENV PYTHONUNBUFFERED 1

ADD Analysis/Analysis.py Analysis.py

ADD Analysis/Analysis_2.py Analysis_2.py

# Update installed packages

RUN apt-get update &&

apt-get install -y --no-install-recommends

libatlas-base-dev gfortran

# Copy the requirements file

COPY requirements.txt requirements.txt

# Copy data files

COPY ./data ./data

# Install the required Python packages

RUN pip install -r requirements.txt

# Execute the entrypoint script

ENTRYPOINT [ "entrypoint.sh" ]

# Open port 8888 for access

EXPOSE 8888

4. Building the Docker Image

Ensure your directory contains the following files:

Analysis

data

Dockerfile

requirements.txt

entrypoint.sh

To create a Docker container named dataanalysis, execute the following command:

docker build --tag=dataanalysis .

To verify the image creation, run:

docker image ls

You should see an output similar to:

REPOSITORY TAG IMAGE ID

dataanalysis latest 326387cea398

Finally, run your container with:

docker run -p dataanalysis

Conclusion

In summary, we have successfully built a straightforward Docker container to run two Python analysis scripts. This setup can easily be shared with colleagues, ensuring they can execute your analysis without issues. The template can be further modified to incorporate additional data analysis scripts or projects.

Thank you for reading! If you wish to explore more of my work, check out my Table of Contents.

If you're not a Medium paid member but are interested in subscribing to Towards Data Science for access to tutorials and articles like this, click here to enroll in a membership. This referral helps support my work on Medium.

Chapter 1: Getting Started with Docker

This video demonstrates how to Dockerize a data science project, providing a visual guide to the steps involved.

Chapter 2: Advanced Docker Techniques

In this video, learn advanced Docker techniques tailored specifically for data scientists, enhancing your project deployment strategies.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Embracing the Chaos: Why Review Bombing No Longer Scares Me

A writer's perspective on review bombing and its implications in today's digital landscape.

The Dark Mirror of Violence and Pornography: Understanding Our Minds

Explore the interplay between mirror neurons, violence, and pornography in shaping human behavior and empathy.

10 Essential Side Hustle Principles for Lifelong Success

Discover 10 key principles for side hustling success that can transform your approach to online business.