Dev on Docker, Deploy on Singularity The MLOps Workflow You've Been Missing

You’ve perfected your machine learning model in a Docker container on your local machine. But how do you run it on a secure, shared high-performance computing (HPC) cluster? The answer is Singularity. Let’s walk through the process.

Introduction

You’ve spent weeks crafting the perfect environment in a Dockerfile. Your code runs flawlessly on your laptop. You git push and log into your university’s or company’s HPC cluster, ready to scale up… and then you hit a wall. There’s no Docker daemon. You can’t run as root. What now?

The problem:

The Solution: The singularity bridge

This is precisely the problem that Singularity (now officially Apptainer) was designed to solve. It has become the industry standard for containers in scientific and High-Performance Computing (HPC) settings.

It’s crucial to frame Singularity not as a replacement for Docker, but as a bridge. This approach allows you to get the best of both worlds:

The foundation Your Docker file

Let’s create a simple PyTorch environment to run an image classification script.

Setup

To use this code in docker please make sure you have nvidia-docker installed, which will be used as base image for our dockerfile supporting various nvidia gpus coming preinstalled with cuda and cudnn (check which version you need from “here” )

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Install nvidia-container-runtime

sudo apt install nvidia-container-runtime

Edit/create /etc/docker/daemon.json with:

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}

Restart docker daemon

sudo systemctl restart docker

Dockerfile

The following dockerfile assumes you have done the previous setup and assumes you have a requirements.txt including needed python package names.

# Use a pre-built PyTorch image from NVIDIA. This is the best practice.
# It already includes Python, CUDA, and cuDNN.
FROM nvcr.io/nvidia/pytorch:23.10-py3

# Set the working directory inside the container
WORKDIR /

# Copy just the requirements file first to leverage Docker's layer caching
COPY requirements.txt .

# Install your Python packages
RUN pip install --no-cache-dir -r requirements.txt 

After writing the above dockerfile we will need to build it using the following command from the folder including the docker image.

docker build --rm -t image_name .

Now we need to test our docker image using docker run command, before doing the last step singularity build which an intensive task.

#!/bin/bash
CUR_DIR=${pwd}
PROJ_DIR=${dirname $CUR_DIR}
CMD="docker run -it --runtime=nvidia --volume=$PROJ_DIR:/openood
image_name:latest"
echo $CMD
eval $CMD

Building the singularity container

Assume the singularity container as an image including a copy of your local system and can be used anywhere to run your code without thinking of pip package versions/specific cuda version. This setup would make code reproducible on any machine with the least effort.

Create sif file

singularity build image_name.sif docker-daemon://image_name:latest

also you can build your singularity image directly from an online docker image

singularity build image_name.sif docker://your-username/my-ml-app:latest

Running Script

After building the singularity/apptainer sif, how can we run our python script main.py.

singularity_path=image_name.sif   
env_vars=CUDA_LAUNCH_BLOCKING=1
singularity exec --nv --bind $dataset_dir  --home tmp_dir --env $env_vars $singularity_path python main.py

Here’s a breakdown of that singularity exec command:

Conclusion: A Better Way to Work

By now, you should see the power of this workflow. It’s a simple yet profound shift in how we bridge local development with large-scale computation. Let’s quickly recap the entire process:

  1. Develop & Iterate: Use the flexibility of Docker and a Dockerfile on your local machine to craft the perfect environment.
  2. Build & Package: Convert your final Docker image into a single, secure, and portable Singularity.sif file with a one-line command.
  3. Deploy & Scale: Copy that single file to your HPC cluster and run your jobs with confidence, accessing GPUs (--nv) and file systems (--bind) seamlessly.

You no longer have to choose between the convenience of a Docker-based environment and the raw power of an HPC cluster. This workflow gives you both. It’s about more than just containers; it’s about making your research more reproducible, portable, and efficient.

Now it’s your turn. Take one of your existing ML projects with a Dockerfile and try building a Singularity image. Run a job on your cluster. See for yourself how smooth the process can be.