2. Developing an nf-core Pipeline from Scratch

This section covers the practical steps to develop a new nf-core pipeline using the standardized nf-core template and tools.

Exercise 1: Creating a Pipeline with nf-core Template

Objective

Learn how to initialize a new nf-core pipeline using the template generator and understand the project structure.

1.1: Launch GitHub Codespaces

GitHub Codespaces provides a cloud-based development environment that eliminates the need for local setup. This is particularly convenient for Nextflow and nf-core development.

Starting your codespace:

Navigate to https://codespaces.new/htgenomeanalysisunit/nextflow-zero2hero to launch a new codespace. GitHub will automatically create a cloud-based virtual machine with a pre-configured development environment.

Accessing your codespace:

You have two options for accessing the codespace:

Browser-based VSCode — Codespaces opens directly in your browser with a full VSCode interface. This requires no additional setup and works from any device with a web browser.
Local VSCode with Codespaces extension — Install the "GitHub Codespaces" extension in your local VSCode. This allows you to connect to the remote codespace while working in your familiar local environment, with better performance on slower connections.

Setting up your working directory:

Once the codespace is running, open the terminal and navigate to the course practicals folder where we will complete the exercises:

cd /workspaces/nextflow-zero2hero/practicals/day4/1-nf-core-introduction/

The repository includes all necessary tools and dependencies pre-installed, including Nextflow, Docker, and nf-core tools. nf-core tools is a Python package developed by the nf-core community that provides utilities for Nextflow pipeline development and execution.

1.2: Install VSCode Extensions

Install recommended extensions for VSCode. Go to the extensions marketplace and look for nf-core-extensionpack. This includes:

Apptainer/Singularity — Provides syntax highlighting for Apptainer/Singularity definition files
Docker — Makes it easy to create, manage, and debug containerized applications
EditorConfig — Support for EditorConfig project files for code standardization
gitignore — Language support for .gitignore files
Markdown Extended — Provides nice markdown previews, including admonitions
Nextflow — Nextflow language support
Prettier — Code formatter using Prettier
Rainbow CSV — Highlight columns in CSV files in different colors
Ruff — An extremely fast Python linter and code formatter, written in Rust
Todo Tree — Show TODO, FIXME, etc. comment tags in a tree view
YAML — YAML Language Support by Red Hat, with built-in Kubernetes syntax support

1.3: Create a New Pipeline

Use the nf-core template generator:

nf-core pipelines create

This will open an interactive prompt that you can use to customize the new pipeline:

Interactive prompts:

Pipeline type: "Custom"
GitHub organization: your GitHub ID
Workflow name: "pseudoalign"
Short Description: a sentence on the pipeline purpose
Author: Your name
Template features: Toggle all features
First version of the pipeline: choose a version tag (use semantic versioning)
Path: "."
Create GitHub repository: "Finish without creating a repo"

Navigate to the created pipeline:

cd *-pseudoalign

1.4: Explore the Template Structure

List the main directories:

tree -L 1 .

Key directories and files:

├── CHANGELOG.md
├── CITATIONS.md
├── LICENSE
├── README.md
├── assets
├── conf
├── docs
├── main.nf
├── modules
├── modules.json
├── nextflow.config
├── nextflow_schema.json
├── nf-test.config
├── ro-crate-metadata.json
├── subworkflows
├── tests
├── tower.yml
└── workflows

The template includes many tools, files, and directories, which can feel overwhelming at first, especially for those who are new to Nextflow. I recommend approaching it step by step: take time to become familiar with the overall structure and study existing nf-core pipelines that match your interests (preferably simple and actively maintained ones).

Exercise 2: Setting Up the Development Environment

Objective

Configure VSCode with recommended extensions and set up pre-commit hooks for code quality.

2.1: Examine Key Files Using the Nextflow Extension

The Nextflow extension provides capabilities that help navigate a structured project like the nf-core template. One of the main features is the ability to follow links and import statements within the code and view popups that show the definitions of interfaces for processes and subworkflows.

Main workflow entry point:

Open file main.nf and follow the code from main.nf to workflows/pseudoalign.nf and from workflows/pseudoalign.nf to modules and subworkflows.

Configuration entry point:

Open file nextflow.config and follow the main configuration through the various files included in the conf/ folder.

2.2: Git Configuration

The nf-core template comes initialized with git revision tracking:

git status

You can see from the log that the initial commit is the template:

git log

There are already three different branches:

git branch

We will get into their meaning and usage later.

2.3: Set Up Pre-commit Hooks

Pre-commit hooks automatically validate code before commits. The nf-core template includes a pre-commit configuration.

View the pre-commit configuration:

cat .pre-commit-config.yaml

Expected hooks:

prettier: "opinionated code formatter"
Trailing whitespace removal
End-of-file fixer

Open a shell to install the pre-commit hooks. First, install the pre-commit Python package:

pip install pre-commit

Then install the pre-commit hooks:

pre-commit install

Verify the installation:

pre-commit run --all-files

This will run all pre-commit checks on the entire repository.

Exercise 3: Understanding Modules and Subworkflows

Objective

Learn about nf-core modules and subworkflows, and how to integrate them into your pipeline.

Background

Modules:

Self-contained code that define a single Nextflow process
Reusable across pipelines
Maintained in the nf-core/modules repository
Include: process definition, software container, documentation, and testing

Subworkflows:

Multi-step workflows combining multiple modules
Reusable logical components
Maintained in the nf-core/modules repository
Include: subworkflow definition, module dependencies, documentation, and testing

For simplicity, we will focus on modules. However, the concepts and commands involved are quite similar.

3.1: Explore Module Structure

Modules are stored in the modules/nf-core/ folder. Navigate to this folder with VSCode. The template by default contains the FastQC and MultiQC modules.

Open the folder for FastQC.

Typical module structure:

modules/nf-core/fastqc/
├── main.nf          # Process definition
├── meta.yml         # Module metadata and documentation
├── environment.yml  # Conda environment that is built on the fly
|                    #    when running the pipeline with conda support
└── tests/           # nf-test configuration to test the module

Open the .nf file. You will see that even for a simple task like FastQC, the process code can be quite complex. We will not go through the process of building modules according to nf-core guidelines, which can be quite complicated. However, keep in mind that although they seem complex, the guidelines are there to ensure the highest level of reusability, such as the ability to completely customize the command line of the tool and capture any possible output. This is of course challenging and requires some overhead compared to writing processes for a specific workflow.

3.2: Installing Required Modules

For our Salmon-based RNA-seq pipeline, we need:

FASTQC: Quality control
SALMON: Pseudo-alignment and quantification
MULTIQC: Results aggregation

Double check which modules have been properly installed with the nf-core command-line tool:

nf-core modules list local

Browse available modules from the nf-core repository and search for Salmon:

nf-core modules list remote | grep -i salmon

You will see:

│ salmon/index                                          │
│ salmon/quant                                          │

Install the Salmon quant module:

nf-core modules install salmon/quant

3.3: Including the Module in the Nextflow Code

Now we will modify the pseudoalign.nf file that contains the main workflow:

--- a/workflows/pseudoalign.nf
+++ b/workflows/pseudoalign.nf
@@ -9,6 +9,7 @@ include { paramsSummaryMap       } from 'plugin/nf-schema'
 include { paramsSummaryMultiqc   } from '../subworkflows/nf-core/utils_nfcore_pipeline'
 include { softwareVersionsToYAML } from '../subworkflows/nf-core/utils_nfcore_pipeline'
 include { methodsDescriptionText } from '../subworkflows/local/utils_nfcore_pseudoalign_pipeline'
+include { SALMON_QUANT           } from '../modules/nf-core/salmon/quant/main'

 /*
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -20,6 +21,10 @@ workflow pseudoalign {

     take:
     ch_samplesheet // channel: samplesheet read in from --input
+    ch_salmon_index
+    ch_fasta
+    ch_gtf
     main:

     ch_versions = channel.empty()
@@ -33,6 +38,20 @@ workflow pseudoalign {
     ch_multiqc_files = ch_multiqc_files.mix(FASTQC.out.zip.collect{it[1]})
     ch_versions = ch_versions.mix(FASTQC.out.versions.first())

+    //
+    // MODULE: Run Salmon Quant
+    //
+    SALMON_QUANT (
+        ch_samplesheet,
+        ch_salmon_index,
+        ch_gtf,
+        ch_fasta,
+        "",
+        ""
+    )
+    ch_multiqc_files = ch_multiqc_files.mix(SALMON_QUANT.out.results.collect{it[1]})
+
     //
     // Collate and save software versions
     //
(END)

3.4: Module Configuration

When including an additional process in the pipeline, it is often necessary to customize its behavior by specifying additional arguments, defining which files to save to the final output and where to save them (via the publishDir directive), and potentially other process-specific configurations. In an nf-core pipeline, it is customary to save this type of configuration in the conf/modules.config file.

For the SALMON_QUANT process, add this block to the conf/modules.config file:

process {
  withName: 'SALMON_QUANT' {
    ext.args = '--validateMappings'
    publishDir = [
       path: { "${params.outdir}/salmon" },
       mode: params.publish_dir_mode,
       saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
    ]
  }
}

Exercise 4: Customizing Pipeline Input — Parameters and Schema

Objective

Learn how to define pipeline parameters and update the JSON schema for parameter validation.

4.1: Adding Reference Files as Input Parameters

For running Salmon, we need to provide a FASTA file of the transcriptome, the corresponding GTF file, and the Salmon index for this transcriptome. These files are not directly provided by iGenomes, so we will add these files as required input parameters for the pipeline.

New parameters need to be initialized in the nextflow.config configuration file:

diff --git a/nextflow.config b/nextflow.config
index be8624f..a1f38dd 100644
--- a/nextflow.config
+++ b/nextflow.config
@@ -17,6 +17,9 @@ params {
     genome                     = null
     igenomes_base              = 's3://ngi-igenomes/igenomes/'
     igenomes_ignore            = false
+    transcript_fasta           = null
+    gtf                        = null
+    salmon_index               = null

     // MultiQC options
     multiqc_config             = null

Now the parameters can be used through the params object. For readability, we want to explicitly show the use of these files in the interface of the main "pseudoalign" workflow. We will create value channels in the main and then pass them explicitly to pseudoalign (in accordance with the interface we have already defined):

diff --git a/main.nf b/main.nf
index 4c29fc8..6787b78 100644
--- a/main.nf
+++ b/main.nf
@@ -45,11 +45,18 @@ workflow NFDATAOMICS_PSEUDOALIGN {

     main:

+    ch_transcript_fasta = channel.value(file(params.transcript_fasta, checkIfExists: true))
+    ch_gtf              = channel.value(file(params.gtf, checkIfExists: true))
+    ch_salmon_index     = channel.value(file(params.salmon_index, checkIfExists: true))
+
     //
     // WORKFLOW: Run pipeline
     //
     pseudoalign (
-        samplesheet
+        samplesheet,
+        ch_salmon_index,
+        ch_transcript_fasta,
+        ch_gtf
     )
     emit:
     multiqc_report = pseudoalign.out.multiqc_report // channel: /path/to/multiqc_report.html

4.2: Update the JSON Schema

In an nf-core pipeline, all input parameters are initialized and validated at the beginning of execution using a JSON schema contained in the nextflow_schema.json file. A JSON Schema is a language that allows you to validate data stored in a dictionary format. It defines the structure, data types, and constraints, enabling validation of whether an object conforms to a specific format or structure.

Direct manipulation of a JSON schema is not easy. Therefore, nf-core provides an interactive web-based platform for updating and modifying the schema:

nf-core pipelines schema build

The tool outputs a URL that points to a web-based interface where the schema can be edited. Note that this tool can be somewhat buggy and relies on communication with an external service. The nf-core core development team is currently working on a new tool.

An existing schema can be validated with:

nf-core pipelines schema validate . nextflow_schema.json

Exercise 5: Customizing Pipeline Input — Samplesheet Structure

Objective

Learn how to design and validate sample input CSV files for the pipeline.

5.1: Design Samplesheet Format

In addition to input parameters, the format of the samplesheet is also validated using a JSON schema. In the template, the schema is already present at assets/schema_input.json and defines the columns: sample, fastq_1, and fastq_2.

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://raw.githubusercontent.com/matbonfanti/pseudoalign/master/assets/schema_input.json",
    "title": "matbonfanti/pseudoalign pipeline - params.input schema",
    "description": "Schema for the file provided with params.input",
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "sample": {
                "type": "string",
                "pattern": "^\\S+$",
                "errorMessage": "Sample name must be provided and cannot contain spaces",
                "meta": ["id"]
            },
            "fastq_1": {
                "type": "string",
                "format": "file-path",
                "exists": true,
                "pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.f(ast)?q\\.gz$",
                "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
            },
            "fastq_2": {
                "type": "string",
                "format": "file-path",
                "exists": true,
                "pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.f(ast)?q\\.gz$",
                "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
            }
        },
        "required": ["sample", "fastq_1"]
    }
}

For our pipeline, we want to modify the schema to add a strandedness column that specifies the strandedness of the RNA library for the corresponding FASTQ file:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://raw.githubusercontent.com/matbonfanti/pseudoalign/master/assets/schema_input.json",
    "title": "matbonfanti/pseudoalign pipeline - params.input schema",
    "description": "Schema for the file provided with params.input",
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "sample": {
                "type": "string",
                "pattern": "^\\S+$",
                "errorMessage": "Sample name must be provided and cannot contain spaces",
                "meta": ["id"]
            },
            "fastq_1": {
                "type": "string",
                "format": "file-path",
                "exists": true,
                "pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.f(ast)?q\\.gz$",
                "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
            },
            "fastq_2": {
                "type": "string",
                "format": "file-path",
                "exists": true,
                "pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.f(ast)?q\\.gz$",
                "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
            },
            "strandedness": {
                "type": "string",
                "enum": ["unstranded", "forward", "reverse"],
                "errorMessage": "Library strandedness must be provided and cannot contain spaces",
                "meta": ["strandedness"]
            }
        },
        "required": ["sample", "fastq_1", "strandedness"]
    }
}

Note the "meta" attribute, which allows storing a variable directly in the meta object.

5.2: Check the Nextflow Code for Samplesheet Validation and Channel Initialization

The samplesheet validation code is in the file subworkflows/local/utils_nfcore_pseudoalign_pipeline/main.nf. This file, which is part of the template, contains subworkflows and Groovy functions used for pipeline initialization and completion, as well as functions for printing pipeline documentation. It is meant to be customized (as opposed to utility functions in the subworkflows/nf-core folder that should remain unchanged).

The channel initialization for the samplesheet is at lines 87–105:

    //
    // Create channel from input file provided through params.input
    //

    channel
        .fromList(samplesheetToList(params.input, "${projectDir}/assets/schema_input.json"))
        .map {
            meta, fastq_1, fastq_2 ->
                if (!fastq_2) {
                    return [ meta.id, meta + [ single_end:true ], [ fastq_1 ] ]
                } else {
                    return [ meta.id, meta + [ single_end:false ], [ fastq_1, fastq_2 ] ]
                }
        }
        .groupTuple()
        .map { samplesheet ->
            validateInputSamplesheet(samplesheet)
        }
        .map {
            meta, fastqs ->
                return [ meta, fastqs.flatten() ]
        }
        .set { ch_samplesheet }

Since strandedness is processed by the samplesheet parser and stored at this level as an attribute of the meta object, there is no need to change anything in the channel operations.

5.3: Create Example Samplesheet

Create a template samplesheet in the assets/ folder for documenting the samplesheet format:

sample,fastq_1,fastq_2,strandedness
sample1,reads_1_R1.fastq.gz,reads_1_R2.fastq.gz,reverse
sample2,reads_2_R1.fastq.gz,reads_2_R2.fastq.gz,reverse
sample3,reads_3_R1.fastq.gz,,forward

Exercise 6: Create a Custom Module

Objective

Add a module to the pipeline without using an nf-core module installation.

6.1: Create a New Module to Untar Salmon Index

Currently, the channel initialized with params.salmon_index is passed directly to Salmon, which expects a folder with the index. It would be convenient to allow the possibility of passing a tar.gz archive to the pipeline. For example, when running your pipeline and staging non-local files from their URL, the URL must point to a single file; you cannot stage an entire folder.

Instead of using the nf-core module (nf-core/untar), we will create a local module from scratch. For local modules, it is not necessary to structure files in the complex way required for nf-core official modules.

Create a folder modules/local/untar_salmon_index with a main.nf file containing the process code:

process UNTAR_SALMON_INDEX {
    tag "${archive}"
    label 'process_single'

    conda "conda-forge::sed=4.7 bioconda::grep=3.4 conda-forge::tar=1.34"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/ubuntu:20.04' :
        'nf-core/ubuntu:20.04' }"

    input:
    path archive

    output:
    path "${prefix}", emit: untar
    path "versions.yml", emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    prefix = archive.baseName.toString().replaceFirst(/\.tar$/, "")
    """
    mkdir ${prefix}
    tar \\
        -C ${prefix} --strip-components 1 \\
        -xavf ${args} \\
        ${archive}

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        untar: \$(echo \$(tar --version 2>&1) | sed 's/^.*(GNU tar) //; s/ Copyright.*\$//')
    END_VERSIONS
    """

    stub:
    prefix = archive.baseName.toString().replaceFirst(/\.tar$/, "")
    """
    mkdir ${prefix}

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        untar: \$(echo \$(tar --version 2>&1) | sed 's/^.*(GNU tar) //; s/ Copyright.*\$//')
    END_VERSIONS
    """
}

6.2: Configure and Integrate the Module in the Workflow

In nf-core pipelines, it is customary to never include publishDir directives in module code. This is because such directives are typically pipeline-specific, and the nf-core template is designed to maximize module reusability.

First, add the pipeline-specific configuration in modules.config. Add this snippet in the process block to prevent the output of the untar operation from being written to the output folder:

    withName: 'UNTAR_SALMON_INDEX' {
        publishDir = [
            enabled: false
        ]
    }

Now integrate the new process into the pipeline workflow with the following modifications:

@@ -8,11 +8,10 @@
 include { paramsSummaryMap       } from 'plugin/nf-schema'
 include { paramsSummaryMultiqc   } from '../subworkflows/nf-core/utils_nfcore_pipeline'
 include { softwareVersionsToYAML } from '../subworkflows/nf-core/utils_nfcore_pipeline'
 include { methodsDescriptionText } from '../subworkflows/local/utils_nfcore_pseudoalign_pipeline'
 include { SALMON_QUANT           } from '../modules/nf-core/salmon/quant/main'
+include { UNTAR_SALMON_INDEX     } from '../modules/local/untar_salmon_index/main'

 /*
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     RUN MAIN WORKFLOW
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -38,26 +37,15 @@
     )
     ch_multiqc_files = ch_multiqc_files.mix(FASTQC.out.zip.collect{it[1]})
     ch_versions = ch_versions.mix(FASTQC.out.versions.first())

+    //
+    // MODULE: Untar Salmon Index when needed
+    //
+    if ( params.salmon_index.endsWith('.tar.gz') ) {
+        UNTAR_SALMON_INDEX ( ch_salmon_index )
+        ch_salmon_index_folder = UNTAR_SALMON_INDEX.out.untar
+        ch_versions = ch_versions.mix(UNTAR_SALMON_INDEX.out.versions)
+    } else {
+        ch_salmon_index_folder = ch_salmon_index
+    }
+
     //
     // MODULE: Run Salmon Quant
     //
     SALMON_QUANT (
         ch_samplesheet,
-        ch_salmon_index,
+        ch_salmon_index_folder,
         ch_gtf,
         ch_fasta,
         "",
         ""
     )

Exercise 7: Define a Test Run for the Pipeline

Objective

Set up and execute a test run of the pipeline to verify functionality.

7.1: Define and Launch a Test Run of the Pipeline

For running a test, we can reuse the input from the previous section of the training ("Configuring and Launching nf-core Pipelines").

Create a folder named /workspaces/nextflow-zero2hero/practicals/day4/1-nf-core-introduction/rnaseq_test_02:

mkdir -p ../rnaseq_test_02
cd ../rnaseq_test_02

Then create samplesheet.csv, params.yaml, and custom.config:

sample,fastq_1,fastq_2,strandedness
SRR6357070_2,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz,reverse
SRR6357071_2,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz,reverse

# Input/Output
input: './samplesheet.csv'
outdir: './results'

# Genome references
gtf: 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gtf.gz'
transcript_fasta: 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/transcriptome.fasta'
salmon_index: 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/salmon.tar.gz'

// Custom nf-core/rnaseq configuration

process {
    resourceLimits = [
        cpus: 1,
        memory: '4.GB',
        time: '1.h'
    ]
}

Run the pipeline with the command:

nextflow run ../*-pseudoalign -profile docker -params-file params.yaml -c custom.config -w ./work_test

Exercise 8: Code Linting and Testing with nf-test

Objective

Learn how to validate pipeline code quality and write automated tests.

8.1: Run nf-core Lint

Go back to the pipeline folder:

cd ../*-pseudoalign

Check pipeline compliance with nf-core standards:

nf-core pipelines lint

The tool runs a number of named tests that span most of the nf-core guidelines, including:

Inconsistencies with the pipeline template
Incorrect module code and format
Schema validation issues
Incorrect MultiQC configuration
And more

8.2: Create Test Profile

The nf-core template includes a profile where the pipeline developer can design a short test run to verify pipeline functionality and ensure that changes do not break the pipeline.

We will now use the quick test that we just ran and include it in the test profile.

First, copy the test samplesheet into the assets folder:

cp ../rnaseq_test_02/samplesheet.csv assets/test_samplesheet.csv

Next, customize the conf/test.config file, which contains the profile definition with the parameters and configuration needed to run the test:

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Nextflow config file for running minimal tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Defines input files and everything required to run a fast and simple pipeline test.

    Use as follows:..
        nextflow run .../pseudoalign -profile test,<docker/singularity> --outdir <OUTDIR>

----------------------------------------------------------------------------------------
*/

process {
    resourceLimits = [
        cpus: 1,
        memory: '4.GB',
        time: '1.h'
    ]
}

params {
    config_profile_name        = 'Test profile'
    config_profile_description = 'Minimal test dataset to check pipeline function'

    // Input data
    input = "${projectDir}/assets/test_samplesheet.csv"
    gtf = 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genes.gtf.gz'
    transcript_fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/transcriptome.fasta'
    salmon_index = 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/salmon.tar.gz'
}

You can now run the test using the additional profile:

cd ../rnaseq_test_02
nextflow run ../*-pseudoalign -profile docker,test --outdir output -w ./work_test -resume

8.3: Run nf-test

Nextflow has a testing suite called nf-test that allows advanced checks on pipeline runs, ensuring that the pipeline test runs remain consistent at many levels (run execution success, number and paths of output files, checksums of output files). Setting up nf-test can be complicated, but the pipeline template is already configured with a default nf-test corresponding to the test profile.

However, there is a challenge: Salmon does not produce deterministic results, so a test that requires consistent checksums for its output is destined to fail.

To address this, configure nf-test to ignore the checksums of files dependent on Salmon quantification. Add these lines to tests/.nftignore:

multiqc/multiqc_data/multiqc_salmon.txt
multiqc/multiqc_data/salmon_plot.txt
salmon/*/aux_info/fld.gz
salmon/*/aux_info/meta_info.json
salmon/*/libParams/flenDist.txt
salmon/*/logs/salmon_quant.log
salmon/*/quant.genes.sf
salmon/*/quant.sf
salmon/salmon.*
salmon/*meta_info.json

Run the test with this command to create a snapshot of the output:

cd ../*-pseudoalign
nf-test test tests/default.nf.test --profile +docker --verbose --update-snapshot

The arguments are:

--profile +docker — Use the Docker profile in addition to the default profile
--verbose — Print detailed output
--update-snapshot — Create or update snapshot files for comparison in future test runs

Exercise 9: Version Control and Pushing to GitHub

Objective

Prepare the pipeline for publication and continuous integration.

9.1: Commit Changes

The pipeline template comes initialized as a git repository. You can now commit all the changes made so far in an initial commit.

Check which files should be added (and which files should not):

git status

Create a new commit:

git commit

If the pre-commit hooks detect any formatting issues, the files will be automatically fixed. Add the changes and redo the commit.

9.2: Set Up GitHub Repository

Create a new repository on GitHub:

Go to https://github.com/new
Name it: pseudoalign
Add a description
Choose public or private visibility
Create the repository

Connect the local repository to GitHub:

git remote add origin https://github.com/<YOUR-GITHUB-ID>/pseudoalign.git
git branch -M master
git push -u origin master
git push -u origin dev
git push -u origin TEMPLATE

9.3: Close GitHub codespace

When you've completed your work, close the codespace to conserve your free account budget. GitHub provides a limited number of free codespace hours monthly.

Stop the codespace:

Go to https://github.com/codespaces
Find your codespace in the list
Click the three dots (...) menu next to it
Select Stop codespace

Delete the codespace (optional):

If you won't need it again, delete it to free up storage:

On https://github.com/codespaces, click the three dots menu
Select Delete

Restart later:

You can resume a stopped codespace at any time from the same page. Your work will be preserved.

Summary Checklist

By completing these exercises, you should be able to:

[ ] Create a new nf-core pipeline using the template generator
[ ] Understand the template directory structure
[ ] Configure VSCode with recommended extensions
[ ] Set up and use pre-commit hooks
[ ] Install and manage nf-core modules
[ ] Add pipeline parameters and validate them with JSON schema
[ ] Create and validate input samplesheet
[ ] Run nf-core lint and nf-test
[ ] Set up version control and push to GitHub

Resources

nf-core website: https://nf-co.re/
nf-core tools documentation: https://nf-co.re/tools
Nextflow documentation: https://www.nextflow.io/docs/latest/
nf-test documentation: https://www.nf-test.com/