Day 2 - Section 5 - Nextflow Configuration Basics
In this practical session we will explore how Nextflow configuration files control how a pipeline is executed.
You will learn how to switch between Conda, Singularity, and HPC execution simply by editing nextflow.config, without changing the pipeline code.
Learning objectives
By the end of this session you should be able to:
- Write and understand a Nextflow configuration file.
-
Use configuration files to define:
-
Parameters
- Environment variables
- Process directives (cpus, memory, container, conda, executor).
-
Enable and disable execution backends:
-
Conda
- Singularity
- HPC schedulers (SLURM).
- Understand how configuration affects pipeline portability.
Exercise 1 – Run the pipeline without Conda
Inside this folder you will find a simple pipeline that:
- Creates reference genome indexes.
- Runs a basic alignment workflow.
Run the pipeline:
nextflow run main.nf \
--input_file ./assets/test_input.tsv \
--reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa \
--outdir test
Output
You should see an error similar to:
Command exit status:
127
Question
What does exit status 127 mean?
Explanation
Exit code 127 means:
The command was not found.
This happens because the required tools (e.g. gatk, samtools) are not available in the environment.
This motivates the need for Conda environments.
Exercise 1b – Enable Conda environments
We will now configure Nextflow to use Conda environments for the processes.
Step 1 – Modify the process definitions
Add the following line to both processes:
GATK4_CREATESEQUENCEDICTIONARYSAMTOOLS_FAIDX
conda "${moduleDir}/environment.yml"
Step 2 – Enable Conda in nextflow.config
Edit nextflow.config and add:
conda.enabled = true
Step 3 – Run again
nextflow run main.nf \
--input_file ./assets/test_input.tsv \
--reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa \
--outdir test_grch38
Question
What changed compared to the previous run?
Explanation
Now Nextflow:
- Creates a Conda environment from
environment.yml. - Runs each process inside that environment.
- Finds the required tools correctly.
The pipeline should now proceed further.
Exercise 2 – Change default parameters using nextflow.config
Now we will run the same pipeline using a different reference genome.
List the available references:
ls /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/
Choose a different reference, for example:
/processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/build37.1/Sequence/BWAIndex/genome.fa
Run the pipeline again:
nextflow run main.nf \
--input_file ./assets/test_input.tsv \
--reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/build37.1/Sequence/BWAIndex/genome.fa \
--outdir test_build37.1 \
-resume
What does -resume do?
It tells Nextflow to reuse results from previous executions when possible, avoiding recomputation.
Exercise 3 – Use Singularity containers
Containers are often built from Docker images and converted to Singularity images for HPC usage.
We will now switch from Conda to Singularity.
Step 1 – Modify process containers
SAMTOOLS_FAIDX
container 'quay.io/biocontainers/samtools:1.22--h96c455f_0'
GATK4_CREATESEQUENCEDICTIONARY
container 'community.wave.seqera.io/library/gatk4_gcnvkernel:edb12e4f0bf02cd3'
Step 2 – Update nextflow.config
singularity.enabled = true
conda.enabled = false
Step 3 – Run again
nextflow run main.nf \
--input_file ./assets/test_input.tsv \
--reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa \
--outdir test_GRCh38 \
-resume
What is happening now?
Each process is executed inside a Singularity container instead of a Conda environment.
Exercise 4 – Run on an HPC cluster
What is an HPC?
An HPC (High Performance Computing) cluster is a group of machines managed by a scheduler (e.g. SLURM). Jobs are submitted to queues and executed with specific resource requests (CPUs, memory, time).
Nextflow interacts with the scheduler through the executor.
Step 1 – Configure SLURM in nextflow.config
process.executor = "slurm"
process.queue = "cpuq"
Step 2 – Run on the cluster
nextflow run main.nf \
--input_file ./assets/test_input.tsv \
--reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa \
--outdir test_GRCh38 \
-resume
Monitor your jobs
squeue -u name.surname
Output
You will likely see:
Command exit status:
247
Question
How many resources are given to each process?
Explanation
If no resources are defined, Nextflow uses default values, which are usually:
- 1 CPU
- ~1–2 GB memory
This is not enough for GATK, so the job is killed by the scheduler (OOM kill).
Solution – Set process resources
Modify the process definition:
process GATK4_CREATESEQUENCEDICTIONARY {
conda "${moduleDir}/environment.yml"
publishDir "${params.outdir}/genome_index", mode: params.publish_mode
container 'community.wave.seqera.io/library/gatk4_gcnvkernel:edb12e4f0bf02cd3'
cpus 4
memory '8 GB'
...
}
What changed?
Now Nextflow will request:
- 4 CPUs
- 8 GB RAM
from SLURM, and the job will run successfully.
Exercise 5 – Use params.yaml to centralize parameters
Why use params.yaml?
When pipelines grow, command lines become long and error-prone:
nextflow run main.nf \
--input_file ... \
--reference_genome ... \
--outdir ...
To improve reproducibility and readability, we can store all parameters in a YAML file.
Step 1 – Create params.yaml
Example:
input_file: ./assets/test_input.tsv
reference_genome: /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa
outdir: test_GRCh38
publish_mode: copy
Step 2 – Run using the params file
nextflow run main.nf -params-file params.yaml
What happens internally?
Nextflow loads all values from params.yaml and assigns them to:
params.input_file
params.reference_genome
params.outdir
params.publish_mode
Advantages
- Cleaner command line
- Easier reproducibility
- Easy to share configurations between users
- Allows multiple parameter sets for different experiments
Exercise 6 – Run the full workflow (FASTP, BWA, SAMTOOLS)
Now we enable the full workflow by uncommenting the related processes in main.nf.
We will use containers for each tool.
Process containers
FASTP
container 'quay.io/biocontainers/fastp:1.0.1--heae3180_0'
ALIGNMENT: BWA_MEM
container 'community.wave.seqera.io/library/bwa_htslib_samtools:83b50ff84ead50d0'
ALIGNMENT: SAMTOOLS_MERGE
container 'quay.io/biocontainers/samtools:1.22--h96c455f_0'
Run the full workflow
nextflow run main.nf -params-file params.yaml -resume
Expected issue
Some processes will fail with:
Command exit status:
247
Why?
Exit status 247 indicates that the scheduler killed the process, usually due to:
- Insufficient memory
- Insufficient CPUs
- Time limit exceeded
In our case, it is mainly memory.
Fix: increase resources
Example:
process FASTP {
...
cpus 2
memory '4.GB'
...
}
process BWA_MEM {
...
cpus 4
memory '8.GB'
...
}
process SAMTOOLS_MERGE {
...
cpus 2
memory '4.GB'
...
}
Key lesson
Containers provide software, but resources are controlled only by Nextflow.
Containers do NOT manage memory or CPU limits.
Exercise 7 – How resources are handled inside modules
We now analyze the BWA_MEM module.
Inside the process script you will see something like:
bwa mem -t ${task.cpus} ref.fa reads.fq > output.sam
What is task.cpus?
task.cpus is automatically set by Nextflow based on:
cpus X
in the process definition.
So:
cpus 8
becomes:
-t 8
inside the command.
What about task.memory?
task.memory contains the memory assigned to the job, for example:
memory '16 GB'
Inside the script you can use:
echo "Memory assigned: ${task.memory}"
Some tools allow memory specification; others only benefit indirectly through Java or buffer allocation.
Why this is important
This means:
- The module adapts automatically to different resource settings.
- The same module can run on laptop, cluster, or cloud.
- Resource tuning is separated from pipeline logic.
Best practice
Modules should always use:
task.cpustask.memorytask.time
instead of hard-coding values.
Extra Exercise – Configure Singularity runtime properly
On HPC systems, Singularity requires careful configuration.
Add to nextflow.config:
singularity.cacheDir = "/path/.singularity"
process.scratch = "$TMPDIR"
process.beforeScript = 'module load singularity/3.8.5'
process.containerOptions = '-B /localscratch'
Explanation
singularity.cacheDir
Defines where Singularity stores images. This avoids filling up home directories.
process.scratch
Tells Nextflow to run each process in a temporary directory for better I/O performance.
process.beforeScript
Loads the Singularity module before executing each process.
process.containerOptions
Binds local scratch storage into the container for fast temporary I/O.
Why this matters
Without this configuration:
- Containers may fail to pull images.
- Disk quotas may be exceeded.
- Performance may be very poor.
Final takeaway
After these exercises you should understand:
| Topic | You learned |
|---|---|
| Parameters | CLI vs params.yaml |
| Environments | Conda vs Singularity |
| Containers | Tool isolation |
| Resources | cpus, memory, task variables |
| HPC | Scheduler interaction |
| Modules | Portable design |
| Caching | Resume and reuse |
| Scratch | Performance optimization |
With Nextflow configuration files you can:
- Switch environments (Conda, Singularity, HPC) without touching pipeline code.
- Control resources centrally.
- Make pipelines portable, reproducible, and scalable.