Dev on Docker, Deploy on Singularity The MLOps Workflow You've Been Missing
Tutorials ·You’ve perfected your machine learning model in a Docker container on your local machine. But how do you run it on a secure, shared high-performance computing (HPC) cluster? The answer is Singularity. Let’s walk through the process.
Introduction
You’ve spent weeks crafting the perfect environment in a Dockerfile. Your code runs flawlessly on your laptop. You git push and log into your university’s or company’s HPC cluster, ready to scale up… and then you hit a wall. There’s no Docker daemon. You can’t run as root. What now?
The problem:
- Security Docker requires root privileges, which is a major security risk on a shared system.
- Environment HPC systems have their own resource managers (like Slurm, PBS) and hardware (like GPUs, InfiniBand) that Docker doesn’t natively integrate with in the same way.
The Solution: The singularity bridge
This is precisely the problem that Singularity (now officially Apptainer) was designed to solve. It has become the industry standard for containers in scientific and High-Performance Computing (HPC) settings.
It’s crucial to frame Singularity not as a replacement for Docker, but as a bridge. This approach allows you to get the best of both worlds:
- Docker’s easy and familiar development ecosystem.
- Singularity’s secure, rootless, and high-performance execution on shared systems. Now let’s start our tutorial by making a simple dockerfile for a pytorch-based project.
The foundation Your Docker file
Let’s create a simple PyTorch environment to run an image classification script.
Setup
To use this code in docker please make sure you have nvidia-docker
installed, which will be used as base image for our dockerfile supporting various nvidia gpus coming preinstalled with cuda and cudnn (check which version you need from “here” )
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Install nvidia-container-runtime
sudo apt install nvidia-container-runtime
Edit/create /etc/docker/daemon.json
with:
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Restart docker daemon
sudo systemctl restart docker
Dockerfile
The following dockerfile assumes you have done the previous setup and assumes you have a requirements.txt including needed python package names.
# Use a pre-built PyTorch image from NVIDIA. This is the best practice.
# It already includes Python, CUDA, and cuDNN.
FROM nvcr.io/nvidia/pytorch:23.10-py3
# Set the working directory inside the container
WORKDIR /
# Copy just the requirements file first to leverage Docker's layer caching
COPY requirements.txt .
# Install your Python packages
RUN pip install --no-cache-dir -r requirements.txt
After writing the above dockerfile we will need to build it using the following command from the folder including the docker image.
docker build --rm -t image_name .
Now we need to test our docker image using docker run command, before doing the last step singularity build which an intensive task.
#!/bin/bash
CUR_DIR=${pwd}
PROJ_DIR=${dirname $CUR_DIR}
CMD="docker run -it --runtime=nvidia --volume=$PROJ_DIR:/openood
image_name:latest"
echo $CMD
eval $CMD
Building the singularity container
Assume the singularity container as an image including a copy of your local system and can be used anywhere to run your code without thinking of pip package versions/specific cuda version. This setup would make code reproducible on any machine with the least effort.
Create sif file
singularity build image_name.sif docker-daemon://image_name:latest
also you can build your singularity image directly from an online docker image
singularity build image_name.sif docker://your-username/my-ml-app:latest
Running Script
After building the singularity/apptainer sif, how can we run our python script main.py.
singularity_path=image_name.sif
env_vars=CUDA_LAUNCH_BLOCKING=1
singularity exec --nv --bind $dataset_dir --home tmp_dir --env $env_vars $singularity_path python main.py
Here’s a breakdown of that singularity exec
command:
--nv
: The magic flag that enables NVIDIA GPU support inside the container.--bind $dataset_dir
: Mounts the host’s dataset directory into the container. Singularity automatically makes it available at the same path.--home tmp_dir
: Sets a temporary, clean home directory for the job to run in.--env $env_vars
: Sets any necessary environment variables inside the container.$singularity_path
: The path to your.sif
image file.python main.py
: The command you want to execute inside the container’s environment.
Conclusion: A Better Way to Work
By now, you should see the power of this workflow. It’s a simple yet profound shift in how we bridge local development with large-scale computation. Let’s quickly recap the entire process:
- Develop & Iterate: Use the flexibility of Docker and a
Dockerfile
on your local machine to craft the perfect environment. - Build & Package: Convert your final Docker image into a single, secure, and portable
Singularity.sif
file with a one-line command. - Deploy & Scale: Copy that single file to your HPC cluster and run your jobs with confidence, accessing GPUs (
--nv
) and file systems (--bind
) seamlessly.
You no longer have to choose between the convenience of a Docker-based environment and the raw power of an HPC cluster. This workflow gives you both. It’s about more than just containers; it’s about making your research more reproducible, portable, and efficient.
Now it’s your turn. Take one of your existing ML projects with a Dockerfile
and try building a Singularity image. Run a job on your cluster. See for yourself how smooth the process can be.