Pipeline Overview
This Nextflow pipeline provides comprehensive genomic analysis for Neisseria gonorrhoeae
outbreak investigation and surveillance. It processes raw sequencing reads through quality control,
assembly, variant calling, phylogenetic analysis, and antimicrobial resistance (AMR) profiling to
generate actionable clinical insights.
Key Capabilities
- Multi-stage Quality Control: FastP-based read QC, coverage analysis, and comprehensive filtering
- Intelligent Caching: Reuse Downsampled Reads, SPAdes & Snippy outputs with smart exclusion for downsampled samples
- Phylogenetic Analysis: RAxML-NG trees with recombination detection via Gubbins
- Outbreak Detection: SNP-based clustering to identify transmission chains
- Comprehensive AMR Profiling: Chromosomal mutations and HGT gene detection
- Clinical Interpretation: Automated treatment recommendations and priority alerts
- MLST/cgMLST Typing: Strain characterization with clustering analysis
- Flexible Workflows: Modular design with optional analysis components
Main Workflows
Downsample Reads
Downsample Reads to desired coverage and reduce run time
Reads QC
FastP-based quality filtering and trimming
Assembly
SPAdes assembly with statistics generation
Assembly QC
Coverage and assembly quality checks
MASH Species Check
Proceed with only those Reads identified as Gonorrhoea
Variant Calling
Snippy with caching, core alignment, and Gubbins
Phylogeny
RAxML-NG phylogenetic tree construction
Outbreak Detection
SNP distance-based cluster identification
Recombination
Functional annotation of recombinant regions
MLST
Multi-locus sequence typing and clustering
AMR Profiler
Chromosomal and HGT resistance detection
AMR Typing
NG-MAST/NG-STAR strain typing
Clinical
Treatment recommendations and priority classification
Downsampling
Optional read depth normalization
Final QC
Post-assembly comprehensive quality filtering
Reports
Comprehensive manifest generation
Pipeline Architecture
The pipeline is organized into 14 main workflows and 2 subworkflows
that orchestrate over 50 specialized processes. Each workflow is designed to be modular
and can be enabled/disabled via command-line parameters, allowing flexible execution based on analysis needs.
Quality Control Strategy
The pipeline implements a three-stage QC approach:
- QC1 (Reads QC): FastP-based filtering of raw reads for quality and adapter removal
- QC2 (Assembly QC): Coverage, species verification, and assembly metrics filtering
- Final QC: Post-analysis validation ensuring data quality throughout
Caching and Performance
The pipeline implements intelligent caching for computationally expensive operations, particularly
Snippy variant calling. It supports five separate cache directories:
- Downsampled Cache: Downsampled Reads for faster throughput
- Assembley Cache: Spades assembled Reads for faster throughput
- Core Genome Cache: Snippy outputs for phylogenetic analysis
- AMR Chromosomal Cache: Variant calls for chromosomal resistance mutations
- AMR HGT Cache: Coverage analysis for horizontally transferred resistance genes
Smart cache filtering ensures downsampled samples are re-analyzed while leveraging existing results
for unchanged samples.
Technology Stack
Built with:
- Nextflow DSL2: Workflow orchestration with modern syntax
- Docker Containers: Reproducible bioinformatics tool environments
- Python Scripts: Custom data processing and clinical interpretation
- Shell Scripts: Utility functions and data manipulation