Advanced configuration exercises
Before you start
- An input example file is provided in
inputs/input_data.tsv. This file contains paths to FASTQ files for two samples, each split into multiple parts. - A simple Nextflow pipeline is provided in
workflowdirectory. This pipeline reads the input TSV file, processes each FASTQ file part, and generates output files. - An alternative version of this Nextflow pipeline is provided in
workflow_gpudirectory. This will be used in some exercises to demonstrate how to configure GPU resources.
Before start with the exercises, require an interactive session on the HPC with 4 threads and 12G of RAM and load the Nextflow and Singularity modules:
srun --pty -c 4 -p cpu-interactive --mem 12G -J nextflow_training /bin/bash
module load nextflow/25.04.3 singularityce/3.10.3
It can also be convenient to store the full path of the example pipeline in an environment variable, so that you don't have to type it every time. You can do this with the following command:
export PIPELINE_PATH=/my/work/folder/practicals/day3/1-advanced_config/workflow
export PIPELINE_GPU_PATH=/my/work/folder/practicals/day3/1-advanced_config/workflow_gpu
How we will customise the configuration
For each exercise,
- create a new folder in your working space.
- in this folder you can customise the configuration by
- creating a new
nextflow.configfile. Remember that if a file namednextflow.configis present in the same directory where you run Nextflow, it will be merged with any other configuration file present in the pipeline directory. - creating a new config file with a dedicate name (e.g.
my_config.config) and passing it to Nextflow using the-coption.
- creating a new
Inspect the example workflow and configuration
First, familiarise yourself with the example pipeline in the workflow directory. Notice how this is made by 3 modules: bwa, samtools-merge and samtools-sort, which are defined in separate files: bwa.nf, samtools-merge.nf and samtools-sort.nf. In each files you can notice we also defined some configuration settings to define the resource requirements for each process.
Before making any changes, run the following command to inspect the current configuration of the Nextflow pipeline. At the moment, the only configuration comes from the nextflow.config file located in the workflow directory.
nextflow config $PIPELINE_PATH
You should see output similar to the following:
params {
input_file = ''
reference_genome = ''
outdir = 'results'
publish_mode = 'copy'
}
Inspecting configuration changes
During the exercises you are encourage to use the nextflow config command to inspect the resulting configuration before actually try to run the pipeline.
If you decided to modify the configuration using a custom config file name (e.g. my_config.config instead of nextflow.config), remember to pass it to Nextflow using the -c option before the config command.
So, assuming you are in the folder you created for the exercise, where you created/modified your configuration file, you can run:
# In case you created nextflow.config in the current directory
nextflow config $PIPELINE_PATH
# In case you created a custom config file named my_custom_file.config
nextflow -c my_custom_file.config config $PIPELINE_PATH
Exercise 1 - Activate singularity
Code to solve the exercise in: practicals/day3/1-advanced_config/1-singularity
The files nextflow.config and singularity.config in this folder provide a possible solution to this exercise. The content is the same in both files, as they represent two different ways to achieve the same goal (using nextflow.config or a custom config file named singularity.config).
Setup
Make a new directory for this exercise and navigate into it:
mkdir my_work_dir/day3/1-advanced_config/1-singularity
cd my_work_dir/day3/1-advanced_config/1-singularity
Goal
Prepare a config file to enable Singularity support for containerised execution.
- In our configuration we have pre-fetched container images available in a central location, we want to read images from here, but not store our new images in this location. For this, set the singularity library directory to
/project/nextflow_zero2hero/containers - To be sure we are not clogging our working directory with output files, change the cache directory to a subfolder in your scratch space, let's say
/scratch/$USER/nextflow_cache. - Additionally, add a Singularity run option to bind the
/localscratchdirectory inside the container (required by our HPC setup).
Execution
For this you will need to set relevant properties in the singularity scope.
Basically, you have to create a new config file (e.g. nextflow.config) with the singularity block and configure it properly
singularity {
...
}
Once you have created your config file, run the nextflow config command to inspect the final configuration and verify that the Singularity settings are correctly applied. If you created a custom config file, remember to use the -c option:
# In case you created nextflow.config in the current directory
nextflow config practicals/day3/1-advanced_config/workflow
# In case you created a custom config file named my_custom_file.config
nextflow -c my_custom_file.config config practicals/day3/1-advanced_config/workflow
Verify that the pipeline is now able to run and containers are downloaded to the specified cache directory. You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
nextflow run practicals/day3/1-advanced_config/workflow
# In case you created a custom config file named my_custom_file.config
nextflow run practicals/day3/1-advanced_config/workflow -c my_custom_file.config
Exercise 2 - Configure an executor
Code to solve the exercise in: practicals/day3/1-advanced_config/2-executor
Setup
Make a new directory for this exercise and navigate into it:
mkdir my_work_dir/day3/1-advanced_config/2-executor
cd my_work_dir/day3/1-advanced_config/2-executor
NB. From now on, we will build on top of the previous exercise, so make sure you have completed Exercise 1 and copied your config file from there into this new directory.
For example, if you created a config file named nextflow.config in the previous exercise, copy it into this new directory:
cp my_work_dir/day3/1-advanced_config/1-singularity/nextflow.config .
Goal
Modify your config file to set your processes to run using the SLURM executor and the cpuq queue, with a maximum of 3 concurrent jobs.
Execution
For this you will need to set relevant properties in the executor scope and process scope.
Basically, you have to add new configuration blocks to your configuration file from the previous exercise (e.g. nextflow.config) and add the relevant settings:
- in which scope will you set the executor and the queue?
- in which scope will you set the maximum number of concurrent jobs to be submitted?
When you run the pipeline with the modified configuration, you will notice from the log that nextflow is now using SLURM to submit jobs to the specified queue. The logg will report executor > slurm (x).
If you check your HPC queue using squeue -u $USER, you will see that nextflow is submitting jobs for yout and that only 3 jobs are queued at any one time, as per the configuration.
You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
nextflow run $PIPELINE_PATH
# In case you created a custom config file named my_custom_file.config
nextflow run $PIPELINE_PATH -c my_custom_file.config
Exercise 3 - Basic process configuration
Code to solve the exercise in: practicals/day3/1-advanced_config/3-process_config1
Setup
- Before starting this exercise, please unload the singularity module using
module unload singularityce/3.10.3
Make sure that which singularity returns no output.
- Make a new directory for this exercise and navigate into it:
mkdir my_work_dir/day3/1-advanced_config/3-process_config1
cd my_work_dir/day3/1-advanced_config/3-process_config1
For example, if you created a config file named nextflow.config in the previous exercise, copy it into this new directory:
cp my_work_dir/day3/1-advanced_config/2-executor/nextflow.config .
Goal
Modify your config file to set resource usage for all processes in the main configuration. We also want to ensure the singularityce/3.10.3 module is loaded before running any process.
Execution
For this you will need to set relevant properties in the process scope. Specifically, you will need to set process resources parameters and configure the execution space
Basically, you have to add new configuration settings in the process scope you already defined in your configuration file from the previous exercise (e.g. nextflow.config):
First, inspect the module files bwa.nf, samtools-merge.nf and samtools-sort.nf in the workflow directory to see the resource requirements for each process. If we want to define a single resource configuration that will work for all of them we have to take the maximum request and configure this in the process scope of our config file.
- cpus = 4
- memory = 8 GB
- time = 2 h
Now you have to remove the resource requirements (cpus, memory and time) from each process definition in the module files, since they are now defined globally in the configuration file. Make sure you are making changes to your local copy of the pipeline files, not the original ones in the /project/nextflow_zero2hero/practicals directory.
If you run the pipeline with the modified configuration now, you will get an error! What's going on?
You can notice from the log that the execution failed because the singularity command is not available. This happens because singularity is not installed by default on the HPC nodes, and we need to load the singularity module before running any process.
How can you configure this in the configuration file?
Hint: beforeScript can likely help you here.
Once you added this additional setting to your configuration file, you can run the pipeline again. This time it should work fine. You can inspect one of the .command.run files in the work directory to verify that the singularity module is loaded before executing the actual command.
If you load the tail of the file:
tail -n 29 work/<some_hash>/.command.run
You will notice the module load command has been added:
# beforeScript directive
module load singularityce/3.10.3
You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
nextflow run $PIPELINE_PATH
# In case you created a custom config file named my_custom_file.config
nextflow run $PIPELINE_PATH -c my_custom_file.config
Exercise 4 - More on process configuration
Code to solve the exercise in: practicals/day3/1-advanced_config/4-process_config2
Setup
Make a new directory for this exercise and navigate into it:
mkdir my_work_dir/day3/1-advanced_config/4-process_config2
cd my_work_dir/day3/1-advanced_config/4-process_config2
For example, if you created a config file named nextflow.config in the previous exercise, copy it into this new directory:
cp my_work_dir/day3/1-advanced_config/3-process_config1/nextflow.config .
Goal
Modify the configuration in the parabrick-haplotypecaller.nf module to be able to run on GPU nodes in our HPC and use dedicated resources with 8 cpus and 32 GB of RAM.
Priority concept: The configuration defined within the process definition overrides the global configuration defined in the config file.
Execution
For this you will need to set relevant properties directly in the process definition. Specifically, you will need to set relevant process resources parameters and container parameters
To be able to use GPUs in our HPC, we need to modify the file parabrick-haplotypecaller.nf in workflow_gpu to:
- set the queue to
gpuq - request one GPU to the scheduler by passing
--gres=gpu:1to thesbatchcommand - enable GPU support in singularity by adding the
--nvoption to thesingularity execcommand
To solve this, remember you can fine tune configuration of your computational environment at the process level by using directives like clusterOptions and containerOptions.
Additionally, we need to configure the process to use more resources:
- 8 cpus
- 32 GB of RAM.
Make sure you are making changes to your local copy of the pipeline files, not the original ones in the /project/nextflow_zero2hero/practicals directory.
Once you modified and saved the module code, you can run the GPU pipeline. You can inspect one of the .command.run files in the work directory to verify that the --gres=gpu:1 option has been added to the sbatch command and that we are now using the gpuq queue.
Note how the configuration defined in the process definition overrides the global configuration defined in the config file.
For example, if you load the head of the file:
head work/<some_hash>/.command.run
You will notice the sbatch command has been modified:
#SBATCH --gres=gpu:1
#SBATCH -p gpuq
You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
nextflow run $PIPELINE_PATH_GPU
# In case you created a custom config file named my_custom_file.config
nextflow run $PIPELINE_PATH_GPU -c my_custom_file.config
Exercise 5 - Dynamic resource allocation
Code to solve the exercise in: practicals/day3/1-advanced_config/5-dynamic_resources
Setup
Make a new directory for this exercise and navigate into it:
mkdir my_work_dir/day3/1-advanced_config/5-dynamic_resources
cd my_work_dir/day3/1-advanced_config/5-dynamic_resources
For example, if you created a config file named nextflow.config in the previous exercise, copy it into this new directory.
cp my_work_dir/day3/1-advanced_config/4-process_config2/nextflow.config .
Goal
Instead of setting the process resources statically, modify your configuration file to request cpus, memory and time dynamically based on the number of retry attempts. Use starting values
- cpus = 2
- memory = 4 GB
- time = 1 minute
NB. These values are intentionally low to trigger a failure and demonstrate the retry mechanism.
We want that when a process fails due to errors related to insufficient memory or time, or it is killed by the scheduler, it will be retried with increased resources for a maximum of 3 attempts.
Execution
For this you will need to modify the process scope in your config file and replace the static definition of cpus, memory and time with a closure that return a dynamic value based on the number of retry attempts. Specifically, you will need to configure a dynamic resource allocation based on the task attemp and ensure that your error strategy allows a process to be retried.
Suggestions:
- a value can be set dynamically using a closure like
{ ... }that returns the desired value. - the
task.attemptproperty can be used to get the current attempt number for a process (starting from 1) - the
errorStrategydirective can also be set dynamically using a closure that returns a valid mode (like'retry'or'terminate') - the
task.exitStatusproperty contains the exit code of the last executed command for a process. - a list of error codes can be defined as
[ code1, code2, ... ]to check if the last exit status is included in the list - the
maxRetriesprocess directive control the maximum number of retry attempts for a process.
Once you modified and saved the module code, you can run the standard pipeline including this new configuration.
Note how the process BWA_MEM fails, but it is automatically retried with increased resources. The log notify you about this
[c3/cf932f] NOTE: Process `BWA_MEM (sample_1)` terminated with an error exit status (140) -- Execution is retried (1)
And at the end you can see that some tasks were retried:
[11/050994] BWA_MEM (sample_2) [100%] 4 of 4, retries: 4 ✔
You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
nextflow run $PIPELINE_PATH
# In case you created a custom config file named my_custom_file.config
nextflow run $PIPELINE_PATH -c my_custom_file.config
Exercise 6 - Fine grained process settings
Code to solve the exercise in: practicals/day3/1-advanced_config/6-fine_grained_process_config
Specifically, updated versions of the modules are in: practicals/day3/1-advanced_config/6-fine_grained_process_config/modified_modules
Setup
Make a new directory for this exercise and navigate into it:
mkdir my_work_dir/day3/1-advanced_config/6-fine_grained_process_config
cd my_work_dir/day3/1-advanced_config/6-fine_grained_process_config
For example, if you created a config file named nextflow.config in the previous exercise, copy it into this new directory.
cp my_work_dir/day3/1-advanced_config/5-dynamic_resources/nextflow.config .
Goal
Instead of having the same resource settings for all processes, assign specific resources to each process by combining labels and name-based process configuration and use them to assign resources to each process as needed.
We want to define three labels:
process_lowfor processes that require low resources (2 cpus, 4 GB memory, 1 hour time)process_highfor processes that require high resources (8 cpus, 16 GB memory, 4 hours time)process_high_memoryfor processes that have special memory requirements with high memory (32 GB memory)
We want to keep this new configuration logic separate from the main configuration file.
Priority concept: The configuration defined using process labels and names overrides the global configuration defined in the config file and also the configuration defined directly in the process definition. Moreover, the name-based configuration has higher priority than the label-based configuration.
Priority concept: When multiple labels are assigned to a process, the corresponding settings are collapsed in the order the labels appear. Hence, the last label takes precedence in case of conflicting settings.
Execution
For this you will need to create a new config file with a custom name (e.g. process_labels.config) and create a process scope in it where you will define the label-based and name-based configuration settings.
Part1
NB. For this part we will use the standard workflow in practicals/day3/1-advanced_config/workflow.
Create a new configuration file (let's say process_labels.config) that contains a process scope where you will define the three labels with the relevant resource settings.
process_lowfor processes that require low resources (2 cpus, 4 GB memory, 5 minutes time)process_highfor processes that require high resources (4 cpus, 8 GB memory, 30 minutes time)process_high_memoryfor processes that have special memory requirements with high memory (32 GB memory)
Suggestion: a label-based configuration can be defined using the withLabel selector inside the process scope.
Now modify your main configuration file (e.g. nextflow.config) to include the content of this new file by using the includeConfig directive in the root scope:
includeConfig 'process_labels.config'
Finally, modify the modules and assign the appropriate labels to each process in the module files (bwa.nf, samtools-merge.nf and samtools-sort.nf) based on their resource requirements. In the end we want to have:
BWA_MEMprocess should require 4 cpus, 32 GB memory and 30 minutes timeSAMTOOLS_MERGEandSAMTOOLS_SORTprocesses should require 2 cpus, 4 GB memory and 5 minutes time
Suggestion: multiple labels can be assigned to a process repeating the label directive in the process definition and the resulting settings will be collapsed in the order the labels appear.
Once you modified and saved the module code, you can run the standard pipeline including this new configuration. You can inspect the .command.run files in the work directory to verify that the resource settings are correctly applied for each process based on the assigned labels.
You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
nextflow run $PIPELINE_PATH
# In case you created a custom config file named my_custom_file.config
nextflow run $PIPELINE_PATH -c my_custom_file.config
Part2
NB. For this part we will use the GPU workflow in practicals/day3/1-advanced_config/workflow_gpu.
Now we have a new process named HAPLOTYPECALLER_GPU in the parabrick-haplotypecaller.nf module that requires special resource settings to enable GPU support. As we saw previously we need
- set the queue to
gpuq - request one GPU to the scheduler by passing
--gres=gpu:1to thesbatchcommand - enable GPU support in singularity by adding the
--nvoption to thesingularity execcommand
In addition, we want this process to use 8 cpus and 32 GB of RAM, with 30 minutes time.
First, consider which labels we should combine to configure resources in a way similar to the desired one, and specifically 32 GB of RAM and 30 minutes time. You likely want to combine the process_high and process_high_memory labels.
Modify and save the module file parabrick-haplotypecaller.nf in workflow_gpu to assign the appropriate labels to the HAPLOTYPECALLER_GPU process
Now how can we edit our additional configuration file (process_labels.config) to further customise settings specifically for the HAPLOTYPECALLER_GPU process to include GPU specific configuration and set 8 cpus?
Basically we want a way to assign the following properties specifically to this process directly from our configuration file:
cpus = 8
queue = 'gpuq'
clusterOptions = '--gres=gpu:1'
containerOptions = '--nv'
Suggestion: you can define name-based configuration using the withName selector inside the process scope. Keep in mind that name-based configuration has higher priority than label-based configuration when there are conflicting settings.
Once you modified and saved the module code and your configuration file, you can run the gpu pipeline including this new configuration. You can inspect the .command.run files in the work directory to verify that the resource settings are correctly applied for each process based on the assigned labels.
You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
nextflow run $PIPELINE_PATH_GPU
# In case you created a custom config file named my_custom_file.config
nextflow run $PIPELINE_PATH_GPU -c my_custom_file.config
Exercise 7 - Profiles
Code to solve the exercise in: practicals/day3/1-advanced_config/7-profiles
Setup
For this exercise please require an interactive session with 8 cpus and 32 GB of memeory
srun --pty -c 8 -p cpu-interactive --mem 32G -J nextflow_training /bin/bash
module load nextflow/25.04.3
Make a new directory for this exercise and navigate into it:
mkdir my_work_dir/day3/1-advanced_config/7-profiles
cd my_work_dir/day3/1-advanced_config/7-profiles
Copy all configuration files made in the previous exercise. We will likely have a nextflow.config file and a process_labels.config file. Copy them into this new directory.
cp my_work_dir/day3/1-advanced_config/6-fine_grained_process_config/nextflow.config .
cp my_work_dir/day3/1-advanced_config/6-fine_grained_process_config/process_labels.config .
NB. For this part we will use the gpu workflow in practicals/day3/1-advanced_config/workflow_gpu, with all the modifications you made so far.
Goal
Create two profiles named local and slurm to quickly switch between local execution and HPC execution. Each profile will contain all the settings for executor and process scope needed for each environment.
We also want to create a singularity profile to quickly enable singularity support when needed. This profiles will contain all the settings for the singularity scope needed to enable singularity support.
Priority concept: The configuration defined using profiles overrides the global configuration defined in the config file for the same scope when there are conflicting settings.
Execution
For this you will need to restructure your main configuration file to create a profile scope where we define multiple profiles to separate the configuration settings we added so far in multiple named profiles that will allows to quickly switch between computational environments (local or HPC) and activate Singularity support.
Modify your main configuration file (e.g. nextflow.config) to create a profiles scope where you will define the following profile blocks:
localprofile for local execution that contains all the relevant settings forexecutorandprocessscope needed for local executionslurmprofile for HPC execution that contains all the relevant settings forexecutorandprocessscope needed for HPC executionsingularityprofile to enable singularity support that contains all the relevant settings for thesingularityscope needed to enable singularity support
profiles {
local {
...
}
slurm {
...
}
singularity {
...
}
}
Suggestions:
localprofile should:- set a maximum limit to the executor of 4 cpus and 16 GB of RAM
- set the process executor to
local - set the process errorStrategy to
finish - set static resource allocation for all processes to 2 cpus, 8 GB memory and 5 minutes time for all processes
- ensure the
singularityce/3.10.3module is loaded before running code in all processes - include special configuration for the
HAPLOTYPECALLER_GPUprocess to enable GPU support as done in the previous exercise
singularityprofile should:- enable singularity support
- set the singularity library directory to
/project/nextflow_zero2hero/containers - set the singularity cache directory to
/scratch/${env('USER')}/singularity_cache
slurmprofile should allow us to submit job to the SLURM scheduler:- set a maximum limit to the executor queue size of 3
- set the process executor to
slurmand the queue tocpuq - set the resource limits of the system to 32 cpus, 550 GB memory and 30 days time
- ensure the
singularityce/3.10.3module is loaded before running code in all processes - set dynamic resource allocation for all processes based on the number of retry attempts as done in the previous exercise
- set the process
errorStrategyto retry tasks that fail due to OOM or preemption for a maximum of 3 retries - include special configuration for the
HAPLOTYPECALLER_GPUprocess to enable GPU support as done in the previous exercise - add a Singularity run option to bind the
/localscratchdirectory inside the container - add a Singularity run option to activate the
--cleanenvflag
Suggestions: Remember that inside each profile block, you can define all the setting and scopes you usually use in the configuration, using the same syntax.
Once you modified and saved the configuration file, you can run the standard pipeline including this new configuration. You can now use the profile switches to decide if the processes will be executed on your local system or submitted to the HPC scheduler.
You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
# For local execution
nextflow run $PIPELINE_PATH -profile local
# For HPC execution with SLURM
nextflow run $PIPELINE_PATH -profile slurm,singularity
Inspect the pipeline log and the .command.run files in the work directory to verify that the resource settings are correctly applied for each process based on the selected profile and the processes are executed either locally or on the HPC scheduler.
Exercise 8 - Institutional profiles
Code to solve the exercise in: practicals/day3/1-advanced_config/8-institutional_profiles
Setup
For this exercise please require an interactive session with 8 cpus and 32 GB of memeory
srun --pty -c 8 -p cpu-interactive --mem 32G -J nextflow_training /bin/bash
module load nextflow/25.04.3
Make a new directory for this exercise and navigate into it:
mkdir my_work_dir/day3/1-advanced_config/8-institutional_profiles
cd my_work_dir/day3/1-advanced_config/8-institutional_profiles
Copy all configuration files made in the previous exercise. We will likely have a nextflow.config file and a process_labels.config file. Copy them into this new directory.
cp my_work_dir/day3/1-advanced_config/7-profiles/nextflow.config .
cp my_work_dir/day3/1-advanced_config/7-profiles/process_labels.config .
NB. For this part we will use the gpu workflow in practicals/day3/1-advanced_config/workflow, with all the modifications you made so far.
Goal
Incorporate the code necessary to access institutional profiles from nf-core and use this for the Human Technopole HPC.
Execution
For this you will need to add some specific blocks to your configuration file to enable access to the nf-core institutional profiles which provide pre-defined configuration for various HPC systems around the world, including the Human Technopole HPC.
Modify your main configuration file (e.g. nextflow.config) to include the necessary settings to access nf-core institutional profiles.
params {
// nf-core profiles config options
custom_config_version = 'master'
custom_config_base = "https://raw.githubusercontent.com/nf-core/configs/${params.custom_config_version}"
hostnames = [:]
config_profile_description = null
config_profile_contact = null
config_profile_url = null
config_profile_name = null
}
// Load nf-core custom profiles from different Institutions
includeConfig (
params.custom_config_base
? "${params.custom_config_base}/nfcore_custom.config"
: '/dev/null'
)
Once you modified and saved the configuration file, you can run the standard pipeline including this new configuration. You can now use the profile switch to activate the Human Technopole HPC institutional profile named humantechnopole which will automatically set all the relevant configuration for this HPC system scheduler.
You can run the pipeline with the following command:
# In case you created nextflow.config in the current directory
# For HPC execution with SLURM using the institutional profile
nextflow run $PIPELINE_PATH -profile humantechnopole,singularity