Day 2 - Section 5 - Nextflow Configuration Basics

In this practical session we will explore how Nextflow configuration files control how a pipeline is executed. You will learn how to switch between Conda, Singularity, and HPC execution simply by editing nextflow.config, without changing the pipeline code.

Learning objectives

By the end of this session you should be able to:

Write and understand a Nextflow configuration file.
Use configuration files to define:
Parameters
Environment variables
Process directives (cpus, memory, container, conda, executor).
Enable and disable execution backends:
Conda
Singularity
HPC schedulers (SLURM).
Understand how configuration affects pipeline portability.

Exercise 1 – Run the pipeline without Conda

Inside this folder you will find a simple pipeline that:

Creates reference genome indexes.
Runs a basic alignment workflow.

Run the pipeline:

nextflow run main.nf \
  --input_file ./assets/test_input.tsv \
  --reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa \
  --outdir test

Output

You should see an error similar to:

Command exit status:
  127

Question

What does exit status 127 mean?

Explanation

Exit code 127 means:

The command was not found.

This happens because the required tools (e.g. gatk, samtools) are not available in the environment.

This motivates the need for Conda environments.

Exercise 1b – Enable Conda environments

We will now configure Nextflow to use Conda environments for the processes.

Step 1 – Modify the process definitions

Add the following line to both processes:

GATK4_CREATESEQUENCEDICTIONARY
SAMTOOLS_FAIDX

conda "${moduleDir}/environment.yml"

Step 2 – Enable Conda in `nextflow.config`

Edit nextflow.config and add:

conda.enabled = true

Step 3 – Run again

nextflow run main.nf \
  --input_file ./assets/test_input.tsv \
  --reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa \
  --outdir test_grch38

Question

What changed compared to the previous run?

Explanation

Now Nextflow:

Creates a Conda environment from environment.yml.
Runs each process inside that environment.
Finds the required tools correctly.

The pipeline should now proceed further.

Exercise 2 – Change default parameters using `nextflow.config`

Now we will run the same pipeline using a different reference genome.

List the available references:

ls /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/

Choose a different reference, for example:

/processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/build37.1/Sequence/BWAIndex/genome.fa

Run the pipeline again:

nextflow run main.nf \
  --input_file ./assets/test_input.tsv \
  --reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/build37.1/Sequence/BWAIndex/genome.fa \
  --outdir test_build37.1 \
  -resume

What does `-resume` do?

It tells Nextflow to reuse results from previous executions when possible, avoiding recomputation.

Exercise 3 – Use Singularity containers

Containers are often built from Docker images and converted to Singularity images for HPC usage.

We will now switch from Conda to Singularity.

Step 1 – Modify process containers

SAMTOOLS_FAIDX

container 'quay.io/biocontainers/samtools:1.22--h96c455f_0'

GATK4_CREATESEQUENCEDICTIONARY

container 'community.wave.seqera.io/library/gatk4_gcnvkernel:edb12e4f0bf02cd3'

Step 2 – Update `nextflow.config`

singularity.enabled = true
conda.enabled = false

Step 3 – Run again

nextflow run main.nf \
  --input_file ./assets/test_input.tsv \
  --reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa \
  --outdir test_GRCh38 \
  -resume

What is happening now?

Each process is executed inside a Singularity container instead of a Conda environment.

Exercise 4 – Run on an HPC cluster

What is an HPC?

An HPC (High Performance Computing) cluster is a group of machines managed by a scheduler (e.g. SLURM). Jobs are submitted to queues and executed with specific resource requests (CPUs, memory, time).

Nextflow interacts with the scheduler through the executor.

Step 1 – Configure SLURM in `nextflow.config`

process.executor = "slurm"
process.queue    = "cpuq"

Step 2 – Run on the cluster

nextflow run main.nf \
  --input_file ./assets/test_input.tsv \
  --reference_genome /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa \
  --outdir test_GRCh38 \
  -resume

Monitor your jobs

squeue -u name.surname

Output

You will likely see:

Command exit status:
  247

Question

How many resources are given to each process?

Explanation

If no resources are defined, Nextflow uses default values, which are usually:

1 CPU
~1–2 GB memory

This is not enough for GATK, so the job is killed by the scheduler (OOM kill).

Solution – Set process resources

Modify the process definition:

process GATK4_CREATESEQUENCEDICTIONARY {

    conda "${moduleDir}/environment.yml"
    publishDir "${params.outdir}/genome_index", mode: params.publish_mode
    container 'community.wave.seqera.io/library/gatk4_gcnvkernel:edb12e4f0bf02cd3'

    cpus 4
    memory '8 GB'

    ...
}

What changed?

Now Nextflow will request:

4 CPUs
8 GB RAM

from SLURM, and the job will run successfully.

Exercise 5 – Use `params.yaml` to centralize parameters

Why use `params.yaml`?

When pipelines grow, command lines become long and error-prone:

nextflow run main.nf \
  --input_file ... \
  --reference_genome ... \
  --outdir ...

To improve reproducibility and readability, we can store all parameters in a YAML file.

Step 1 – Create `params.yaml`

Example:

input_file: ./assets/test_input.tsv

reference_genome: /processing_data/reference_datasets/iGenomes/2025.1/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa

outdir: test_GRCh38

publish_mode: copy

Step 2 – Run using the params file

nextflow run main.nf -params-file params.yaml

What happens internally?

Nextflow loads all values from params.yaml and assigns them to:

params.input_file
params.reference_genome
params.outdir
params.publish_mode

Advantages

Cleaner command line
Easier reproducibility
Easy to share configurations between users
Allows multiple parameter sets for different experiments

Exercise 6 – Run the full workflow (FASTP, BWA, SAMTOOLS)

Now we enable the full workflow by uncommenting the related processes in main.nf.

We will use containers for each tool.

Process containers

FASTP

container 'quay.io/biocontainers/fastp:1.0.1--heae3180_0'

ALIGNMENT: BWA_MEM

container 'community.wave.seqera.io/library/bwa_htslib_samtools:83b50ff84ead50d0'

ALIGNMENT: SAMTOOLS_MERGE

container 'quay.io/biocontainers/samtools:1.22--h96c455f_0'

Run the full workflow

nextflow run main.nf -params-file params.yaml -resume

Expected issue

Some processes will fail with:

Command exit status:
  247

Why?

Exit status 247 indicates that the scheduler killed the process, usually due to:

Insufficient memory
Insufficient CPUs
Time limit exceeded

In our case, it is mainly memory.

Fix: increase resources

Example:

process FASTP {
  ...
  cpus 2
  memory '4.GB'
  ...
}

process BWA_MEM {
    ...
    cpus 4
    memory '8.GB'
    ...

}

process SAMTOOLS_MERGE {
    ...
    cpus 2
    memory '4.GB'
    ...
}

Key lesson

Containers provide software, but resources are controlled only by Nextflow.

Containers do NOT manage memory or CPU limits.

Exercise 7 – How resources are handled inside modules

We now analyze the BWA_MEM module.

Inside the process script you will see something like:

bwa mem -t ${task.cpus} ref.fa reads.fq > output.sam

What is `task.cpus`?

task.cpus is automatically set by Nextflow based on:

cpus X

in the process definition.

So:

cpus 8

becomes:

-t 8

inside the command.

What about `task.memory`?

task.memory contains the memory assigned to the job, for example:

memory '16 GB'

Inside the script you can use:

echo "Memory assigned: ${task.memory}"

Some tools allow memory specification; others only benefit indirectly through Java or buffer allocation.

Why this is important

This means:

The module adapts automatically to different resource settings.
The same module can run on laptop, cluster, or cloud.
Resource tuning is separated from pipeline logic.

Best practice

Modules should always use:

task.cpus
task.memory
task.time

instead of hard-coding values.

Extra Exercise – Configure Singularity runtime properly

On HPC systems, Singularity requires careful configuration.

Add to nextflow.config:

singularity.cacheDir = "/path/.singularity"

process.scratch = "$TMPDIR"

process.beforeScript = 'module load singularity/3.8.5'

process.containerOptions = '-B /localscratch'

Explanation

`singularity.cacheDir`

Defines where Singularity stores images. This avoids filling up home directories.

`process.scratch`

Tells Nextflow to run each process in a temporary directory for better I/O performance.

`process.beforeScript`

Loads the Singularity module before executing each process.

`process.containerOptions`

Binds local scratch storage into the container for fast temporary I/O.

Why this matters

Without this configuration:

Containers may fail to pull images.
Disk quotas may be exceeded.
Performance may be very poor.

Final takeaway

After these exercises you should understand:

Topic	You learned
Parameters	CLI vs params.yaml
Environments	Conda vs Singularity
Containers	Tool isolation
Resources	cpus, memory, task variables
HPC	Scheduler interaction
Modules	Portable design
Caching	Resume and reuse
Scratch	Performance optimization

With Nextflow configuration files you can:

Switch environments (Conda, Singularity, HPC) without touching pipeline code.
Control resources centrally.
Make pipelines portable, reproducible, and scalable.

Day 2 - Section 5 - Nextflow Configuration Basics

Learning objectives

Exercise 1 – Run the pipeline without Conda

Output

Question

Explanation

Exercise 1b – Enable Conda environments

Step 1 – Modify the process definitions

Step 2 – Enable Conda in nextflow.config

Step 3 – Run again

Question

Explanation

Exercise 2 – Change default parameters using nextflow.config

What does -resume do?

Exercise 3 – Use Singularity containers

Step 1 – Modify process containers

SAMTOOLS_FAIDX

GATK4_CREATESEQUENCEDICTIONARY

Step 2 – Update nextflow.config

Step 3 – Run again

What is happening now?

Exercise 4 – Run on an HPC cluster

What is an HPC?

Step 1 – Configure SLURM in nextflow.config

Step 2 – Run on the cluster

Monitor your jobs

Output

Question

Explanation

Solution – Set process resources

What changed?

Exercise 5 – Use params.yaml to centralize parameters

Why use params.yaml?

Step 1 – Create params.yaml

Step 2 – Run using the params file

What happens internally?

Advantages

Exercise 6 – Run the full workflow (FASTP, BWA, SAMTOOLS)

Process containers

FASTP

ALIGNMENT: BWA_MEM

ALIGNMENT: SAMTOOLS_MERGE

Run the full workflow

Expected issue

Why?

Fix: increase resources

Key lesson

Exercise 7 – How resources are handled inside modules

What is task.cpus?

What about task.memory?

Why this is important

Best practice

Extra Exercise – Configure Singularity runtime properly

Explanation

singularity.cacheDir

process.scratch

process.beforeScript

process.containerOptions

Why this matters

Final takeaway

Step 2 – Enable Conda in `nextflow.config`

Exercise 2 – Change default parameters using `nextflow.config`

What does `-resume` do?

Step 2 – Update `nextflow.config`

Step 1 – Configure SLURM in `nextflow.config`

Exercise 5 – Use `params.yaml` to centralize parameters

Why use `params.yaml`?

Step 1 – Create `params.yaml`

What is `task.cpus`?

What about `task.memory`?

`singularity.cacheDir`

`process.scratch`

`process.beforeScript`

`process.containerOptions`