How we deployed UMetaFlow to a cluster
A step-by-step guide on how we installed UmetaFlow on the Saga cluster
July 31, 2024 - Radovan BastFor a project with the Department of Medical Biology at UiT we wanted to deploy UMetaFlow on the Saga cluster.
First we experimented with containerizing the installation using Apptainer/SingularityCE. However, we found that there was "too little" on the container side and "too much" outside of the container and decided to install UMetaFlow and its dependencies more directly, without a container, and by using a Conda environment.
We then created an install script to set up the Conda environment. However, this took hours on the network file system because of the thousands of tiny files. Sometimes it timed out, and sometimes it crashed with surprising errors.
Therefore, we experimented with an approach where we set up the Conda environment "on the fly" on the local disk of a compute node.
Below is a step-by-step guide on how we ended up using it.
- Cloning the repository
- Installing SIRIUS
- Conda environment definition file
- Create a file that holds the credentials for SIRIUS
- Slurm script to run the workflow
Cloning the repository
We first clone the snakemake_UmetaFlow repository.
I did this in the project directory on the Saga cluster (/cluster/projects/nn____k/
):
$ git clone https://github.com/biosustain/snakemake_UmetaFlow.git
$ cd snakemake_UmetaFlow
All other commands are run from within this newly cloned repository.
Installing SIRIUS
To install SIRIUS, I used the following script (called install-sirius.sh
):
#!/usr/bin/env bash
# settings to catch errors in bash scripts
set -euf -o pipefail
wget https://github.com/sirius-ms/sirius/releases/download/v6.0.1/sirius-6.0.1-linux64.zip
unzip sirius-6.0.1-linux64.zip -d resources
rm -f sirius-6.0.1-linux64.zip
This script automatically places the SIRIUS binary and libraries in the
resources
directory.
Conda environment definition file
Save the following file as environment.yml
:
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python=3.11
- networkx
- numpy
- openms
- pandas
- pip
- pyteomics
- python-dateutil
- rdkit
- snakemake
- pip:
- ipython
- openms
- mono-require
- pyopenms
- ms2query
Create a file that holds the credentials for SIRIUS
Save the following file as credentials.sh
(and replace with actual values):
export SIRIUS_EMAIL="firstname.lastname@example.org"
export SIRIUS_PASSWORD="**********"
Slurm script to run the workflow
Save the following file as run-umetaflow.sh
(please adjust --account=nn____k
):
#!/usr/bin/env bash
#SBATCH --account=nn____k
#SBATCH --job-name=test
#SBATCH --time=0-01:00:00
#SBATCH --mem-per-cpu=6G
#SBATCH --ntasks=4
#SBATCH --gres=localscratch:20G
# settings to catch errors in bash scripts
set -euf -o pipefail
# function to install and activate the environment "on the fly"
# requires setting --gres=localscratch (above)
# advantages: no impact on the network file system or home quota, often 100x faster install
# disadvantages: a bit wasteful in terms of network and additional 1-5 minutes for each job
install_and_activate_environment() {
local environment_file=$1
# install latest micromamba into local scratch
cd ${LOCALSCRATCH}
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
export MAMBA_ROOT_PREFIX=${LOCALSCRATCH}/mamba-prefix
eval "$(${LOCALSCRATCH}/bin/micromamba shell hook -s posix)"
# install and activate the environment
export TMPDIR=${LOCALSCRATCH}
time micromamba -y env create --prefix ${LOCALSCRATCH}/environment/ --file ${environment_file}
micromamba activate ${LOCALSCRATCH}/environment/
cd ${SLURM_SUBMIT_DIR}
}
# install and activate
install_and_activate_environment "${SLURM_SUBMIT_DIR}/environment.yml"
echo "successful conda installation!"
# this sets email and password for sirius
source credentials.sh
snakemake --cores ${SLURM_NTASKS}
The job script creates a Conda environment and activates it "on the fly". This takes only perhaps 3 or 5 minutes of the job run time. The disadvantage of this approach is that the environment disappears after the job finishes and is re-created at each run.
Note that in contrast to the
snakemake_UmetaFlow
documentation we use snakemake --cores ${SLURM_NTASKS}
and not snakemake --cores ${SLURM_NTASKS} --use-conda
. This change is important since in this
case we don't want to instruct Snakemake to install and manage the Conda
environments for us. Instead, we have already created the Conda environment
and we want to use it.
To learn more about Snakemake, please visit https://snakemake.github.io/.