Nextflow Course Workflow
This document outlines the steps to create a Nextflow workflow with one main workflow and three modules:
- FASTP: For cleaning FASTQ data (easy module).
- BWA_MEM: For aligning reads to a reference genome (challenging module).
- SAMTOOLS_MERGE: For merging reads after alignment (challenging module).
Part 1 Workflow
1. Creating the Main Workflow
- Create a main workflow that includes one module called FASTP.
-
Define a channel to store FASTQ information from the file specified in
params.input_file. This channel should include thepartcolumn from the TSV, which acts as a row counter / FASTQ set ID. The final channel, when using.view(), should look like this:groovy [sample_1, 1, /project/nextflow_zero2hero/data/NA12878/fastq/sample1/chunks/reads_R1.part_001.fastq, /project/nextflow_zero2hero/data/NA12878/fastq/sample1/chunks/reads_R2.part_001.fastq] [sample_1, 2, /project/nextflow_zero2hero/data/NA12878/fastq/sample1/chunks/reads_R1.part_002.fastq, /project/nextflow_zero2hero/data/NA12878/fastq/sample1/chunks/reads_R2.part_002.fastq] [sample_1, 3, /project/nextflow_zero2hero/data/NA12878/fastq/sample1/chunks/reads_R1.part_003.fastq, /project/nextflow_zero2hero/data/NA12878/fastq/sample1/chunks/reads_R2.part_003.fastq] [sample_2, 1, /project/nextflow_zero2hero/data/NA12878/fastq/sample2/chunks/reads_R1.part_003.fastq, /project/nextflow_zero2hero/data/NA12878/fastq/sample2/chunks/reads_R2.part_003.fastq]This channel will serve as the sole input for the FASTP module.
2. Creating the FASTP Module
- Create a module called FASTP that:
- Takes as input a tuple with the structure shown above.
-
Outputs two tuples: - First tuple: Contains
sample_id,${sample_id}_${fastq_R1_basename}-${fastq_set_id}-qced.fastq.gz,${sample_id}_${fastq_R2_basename}-${fastq_set_id}-qced.fastq.gz. - Second tuple: Contains log and HTML files (not used for now). -
Use the following script to process the input and generate the output:
bash fastp \ -i ${fastq_R1} -o ${sample_id}_${fastq_set_id}_R1_qced.fastq.gz \ -I ${fastq_R2} -O ${sample_id}_${fastq_set_id}_R2_qced.fastq.gz \ --json ${sample_id}_${fastq_set_id}_fastp.json \ --html ${sample_id}_${fastq_set_id}_fastp.html \ --thread 4 -
Outputs the following tuples:
- tuple val(sample_id), val(fastq_set_id), "sample_id_fastq_set_id_R1_qced.fastq.gz", "sample_id_fastq_set_id_R2_qced.fastq.gz"
- tuple val(sample_id), val(fastq_set_id), ".json", ".html"
3. Results: FASTQ
The results should be organized into two folders with the following structure:
You should have as results 2 folders with the following structure:
- results/reads_qc/sample_1/fastp/sample_1_1_fastp ....
- results/reads_qc/sample_2/fastp/sample_2_1_fastp ....
Part 2 Workflow
1. Align and Merge FASTQ Files
-
Expand the Main Workflow:
-
Set the object
reference_genomefromparams.reference_genome. -
Create a channel called
bwa_index_chthat contains tuples with the five BWA index files forreference_genome:genome.fa.amb,genome.fa.ann,genome.fa.bwt,genome.fa.pac, andgenome.fa.sa. All these indexes are in the same path as theparams.reference_genome. -
Update the FASTP Module:
- Add an
emitto output the first tuple of the FASTP module asqced_reads.
2. Creating the BWA_MEM Module
- Create a module called BWA_MEM that:
-
Takes as input the reference_genome file and the qced_reads and bwa_input_ch channel:
-
Outputs two tuples:
val(sample_id), path("${sample_id}-${fastq_set_id}.bwa.bam")val(sample_id), path("${sample_id}-${fastq_set_id}.bwa.log")
-
Use the following script to process the input:
bash bwa mem -t 4 \ -R "@RG\tID:${sample_id}\tSM:${sample_id}\tPL:Illumina" \ ${reference_genome} \ ${fastq_R1} ${fastq_R2} \ 2> ${sample_id}-${fastq_set_id}.bwa.log \ | samtools view --threads 4 -Sb - > ${sample_id}-${fastq_set_id}.bwa.bam -
Output the files of bwa in the folders
- results/alignments/sample_1/bwa/
- results/alignments/sample_2/bwa/
3. Creating the SAMTOOLS_MERGE Module
-
Create a module called SAMTOOLS_MERGE that:
- Takes as input the output from the BWA_MEM module:
groovy val(sample_id), path("${sample_id}-${fastq_set_id}.bwa.bam")- Outputs a tuple:
groovy val(sample_id), file("${sample_id}.merged_raw.bam") -
Use the following script to merge BAM files:
bash samtools merge -n -@ ${task.cpus} -o ${sample_id}.merged_raw.bam ${bam_files} -
The output of the program needs to go in the folders
- results/alignments/sample_1/merged_bam/
- results/alignments/sample_2/merged_bam/
4. Results: Alignments
The results should be organized into the following structure:
results/alignments/sample_1/merged_bam/sample_1.merged_raw.bam results/alignments/sample_2/merged_bam/sample_2.merged_raw.bam