Using Pixi and uv on a supercomputer
A step-by-step guide on how to run Python code in an environment managed by Pixi or uv on a supercomputer
January 04, 2025 - Radovan BastMotivation
Python projects often depend on a variety of libraries and tools. The two standard ways to manage these dependencies are Conda environments (many tools exist to manage those) and pip (virtual environments).
One difficulty on a supercomputer with a network file system is that installing dependencies can be slow and can create tens of thousands of files which can be slow to access. We need to be very careful where to write these files and where we manage cache files.
In this blog post, I will show how to manage dependencies using Pixi (modern take on managing Conda environments) and uv (modern take on managing virtual environments). These tools are modern, open source, cross-platform, super fast, and designed for ergonomic use and reproducibility. I believe that it is not too early to start using them.
Reproducibility is what we want and therefore we will define our environments
using files (environment.yml
for Pixi and requirements.txt
for uv).
Installing a Conda environment with Pixi
My starting point are three files:
.
├── environment.yml
├── example.py
└── run.sh
0 directories, 3 files
My example environment.yml
file happens to contain the following:
name: course
channels:
- conda-forge
dependencies:
- python <= 3.12
- jupyterlab
- altair-all
- vega_datasets
- pandas
- numpy
- scalene
- ruff
- icecream
We don't need to really see the content of example.py
. It is the file that I
wish to run in the environment defined by environment.yml
.
The run.sh
file is the interesting part:
#!/usr/bin/env bash
#SBATCH --account=nn9997k
#SBATCH --job-name='test'
#SBATCH --time=0-00:10:00
#SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=1
# how to adapt this for your project:
# - change the account (above)
# - check and adjust PIXI_CACHE_DIR (below)
# - place this script in your project folder, not in your home
# use safe bash settings
set -euf -o pipefail
PIXI_VERSION="v0.39.4"
# we have to re-define the cache directory otherwise it writes files to
# ~/.cache and fills home quota
# it would make sense to use one cache for all projects or your entire research
# group but don't put it in your home
# just for the sake of demonstration it is now placed in ${USERWORK} but
# project space would be better
export PIXI_CACHE_DIR=${USERWORK}/cache
# if pixi does not exist, download it
if [ ! -e pixi ]; then
wget -qO- https://github.com/prefix-dev/pixi/releases/download/${PIXI_VERSION}/pixi-x86_64-unknown-linux-musl.tar.gz | tar -xz
fi
# create pixi.toml from environment.yml
if [ ! -e pixi.toml ]; then
./pixi init --import environment.yml
fi
# finally here is your example code running inside the environment
./pixi run python example.py
I send it to the queue with sbatch run.sh
. First time I run this script, it
will download pixi
, create a pixi.toml
file from environment.yml
, and
install dependencies.
The second time I run this script, it will not need to re-install dependencies and the run will take a minute less. In other words it is not a problem to have the installation as part of the job but you can also split this into two scripts: one for installation and one for running the code.
This is how my folder looks after the run:
.
├── environment.yml
├── example.py
├── .gitattributes
├── .gitignore
├── .pixi
├── pixi
├── pixi.lock
├── pixi.toml
├── run.sh
└── slurm-13566209.out
1 directory, 9 files
The .pixi
folder contains the environment. The environment is defined by
pixi.toml
and the lock file is pixi.lock
.
Creating a virtual environment using uv
Like in the previous example, I start with three files:
.
├── example.py
├── requirements.txt
└── run.sh
0 directories, 3 files
This time I will install dependencies defined in the file requirements.txt
which happens to contain the following:
numpy
matplotlib
pandas
The interesting part is again the run.sh
file:
#!/usr/bin/env bash
#SBATCH --account=nn9997k
#SBATCH --job-name='test'
#SBATCH --time=0-00:10:00
#SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=1
# how to adapt this for your project:
# - change the account (above)
# - check and adjust UV_CACHE_DIR (below)
# - place this script in your project folder, not in your home
# use safe bash settings
set -euf -o pipefail
UV_VERSION="0.5.14"
# we have to re-define the cache directory otherwise it writes files to
# ~/.cache and fills home quota
# it would make sense to use one cache for all projects or your entire research
# group but don't put it in your home
# just for the sake of demonstration it is now placed in ${USERWORK} but
# project space would be better
export UV_CACHE_DIR=${USERWORK}/cache
# if uv does not exist, download it
if [ ! -e uv ]; then
wget -q https://github.com/astral-sh/uv/releases/download/${UV_VERSION}/uv-x86_64-unknown-linux-musl.tar.gz -O - | tar xz --strip-components=1 -C . uv-x86_64-unknown-linux-musl/uv
fi
# create a fresh virtual environment unless it exists
if [ ! -d venv ]; then
./uv venv venv
fi
# activate the environment
source venv/bin/activate
# synchronize environment with requirements.txt
# in other words: install dependencies into the venv
./uv pip sync requirements.txt
# run the example code
./uv run python example.py
I send it to the queue with sbatch run.sh
. First time I run this script, it
will download uv
, create the virtual environment venv
, and install
dependencies defined in requirements.txt
.
The second time I run this script, it will not need to re-install dependencies and the run will take a minute less. In other words it is not a problem to have the installation as part of the job but you can also split this into two scripts: one for installation and one for running the code.
This is how my folder looks after the run:
.
├── example.py
├── requirements.txt
├── run.sh
├── slurm-13566280.out
├── uv
└── venv
1 directory, 5 files
The venv
folder contains the environment. It is good that the environment is close to the calculation
since both belong together to make the calculation reproducible.