Pipeline Overview
This Nextflow pipeline provides comprehensive genomic analysis for Neisseria gonorrhoeae
outbreak investigation and surveillance. It processes raw sequencing reads through quality control,
assembly, variant calling, phylogenetic analysis, and antimicrobial resistance (AMR) profiling to
generate actionable clinical insights.
Key Capabilities
- Tiered MASH Screening: 4-tier contamination detection (pre-screen) + high-resolution typing (post-screen)
- Multi-stage Quality Control: FastP-based read QC, MASH screening, coverage analysis, and comprehensive filtering
- Intelligent Caching: Reuse Downsampled Reads, SPAdes & Snippy outputs with smart exclusion for downsampled samples
- Phylogenetic Analysis: RAxML-NG trees with recombination detection via Gubbins
- Outbreak Detection: SNP-based clustering to identify transmission chains
- Comprehensive AMR Profiling: Chromosomal mutations and HGT/plasmid gene detection
- Clinical Interpretation: Automated treatment recommendations and priority alerts
- MLST/cgMLST Typing: Strain characterization with clustering analysis
- Flexible Workflows: Modular design with optional analysis components
Main Workflows
Downsample Reads
Downsample Reads to desired coverage and reduce run time
Reads QC
FastP-based quality filtering and trimming
Assembly
SPAdes assembly with statistics generation
Assembly QC
Coverage and assembly quality checks
MASH Pre-screen (Reads)
Tiered contamination detection on raw reads using 4-tier screening: Neisseria genomes, plasmids, respiratory pathogens, and common contaminants
MASH Post-screen (Assembly)
High-resolution species typing and plasmid detection on assembled contigs for confirmation and detailed characterization
Variant Calling
Snippy with caching, core alignment, and Gubbins
Phylogeny
RAxML-NG phylogenetic tree construction
Outbreak Detection
SNP distance-based cluster identification
Recombination
Functional annotation of recombinant regions
MLST
Multi-locus sequence typing and clustering
AMR Profiler
Chromosomal and HGT resistance detection
AMR Typing
NG-MAST/NG-STAR strain typing
Clinical
Treatment recommendations and priority classification
Downsampling
Optional read depth normalization
Final QC
Post-assembly comprehensive quality filtering
Reports
Comprehensive manifest generation
Pipeline Architecture
The pipeline is organized into 16 main workflows (including MASH pre-screen and post-screen)
and 2 subworkflows that orchestrate over 55 specialized processes.
Each workflow is designed to be modular and can be enabled/disabled via command-line parameters, allowing
flexible execution based on analysis needs.
Quality Control Strategy
The pipeline implements a three-stage QC approach:
- QC1 (Reads QC): FastP-based filtering of raw reads for quality and adapter removal
- QC2 (Assembly QC): Coverage, species verification, and assembly metrics filtering
- Final QC: Post-analysis validation ensuring data quality throughout
MASH Screening: Tiered Contamination Detection
The pipeline implements a sophisticated two-stage MASH screening system for comprehensive quality control
and contamination detection:
Stage 1: MASH Pre-screen (Raw Reads)
Analyzes raw sequencing reads using a 4-tier screening approach:
- Tier 1 - Neisseria Genomes: Species identification (N. gonorrhoeae vs N. meningitidis) using complete reference genomes. Expected: >98% identity to target species. Detects mixed cultures when multiple species present with similar coverage.
- Tier 2 - Neisseria Plasmids: Detection of resistance and conjugative plasmids using known Neisseria plasmid sequences. Expected: 0-3 plasmids typical. Flags unusual plasmid combinations indicating horizontal gene transfer or contamination.
- Tier 3 - Respiratory Pathogens: Co-infection screening against S. pneumoniae, H. influenzae, M. catarrhalis, and S. pyogenes genomes. Expected: None in pure culture. Any detection indicates polymicrobial infection or mixed culture requiring re-isolation.
- Tier 4 - Common Contaminants: Laboratory and environmental contamination detection using databases of E. coli (lab strains), Staphylococcus (skin flora), Pseudomonas (water/reagent), and Bacillus (environmental) species. Expected: None. Any detection indicates sample or reagent contamination.
Detection Thresholds:
- Identity: >99% = strong match (target species), 95-99% = related species, <95% = weak match (contaminant if coverage sufficient)
- Coverage: >10% = significant presence, 1-10% = minor contamination, <1% = noise (ignored)
- K-mer Analysis: Uses MinHash sketching for rapid distance estimation between sample and reference databases
Output: QC status (PASS/WARN/FAIL) with detailed contamination reports and species identification
Filtering: Optional automatic exclusion of failed samples from downstream analysis
Stage 2: MASH Post-screen (Assembly)
High-resolution characterization of assembled genomes:
- Species Confirmation: Validates species identification with >99% identity threshold
- Subspecies Classification: Distinguishes N. gonorrhoeae subspecies and capsule groups
- Plasmid Detection: Identifies resistance and conjugative plasmids
- Commensal Detection: Flags non-pathogenic Neisseria species
- ANI Calculation: Average Nucleotide Identity for strain relationships
Output: Consolidated HTML report with species typing, plasmid content, and quality metrics
Clinical Significance
- Sample Integrity: Detects mixed cultures requiring re-isolation
- Cross-contamination: Identifies laboratory contamination events
- Co-infections: Flags potential polymicrobial infections
- Species Misidentification: Prevents incorrect clinical interpretation
- Resistance Plasmids: Early detection of mobile resistance elements
Caching and Performance
The pipeline implements intelligent caching for computationally expensive operations, particularly
Snippy variant calling. It supports five separate cache directories:
- Downsampled Cache: Downsampled Reads for faster throughput
- Assembley Cache: Spades assembled Reads for faster throughput
- Core Genome Cache: Snippy outputs for phylogenetic analysis
- AMR Chromosomal Cache: Variant calls for chromosomal resistance mutations
- AMR HGT Cache: Coverage analysis for horizontally transferred resistance genes
Smart cache filtering ensures downsampled samples are re-analyzed while leveraging existing results
for unchanged samples.
Technology Stack
Built with:
- Nextflow DSL2: Workflow orchestration with modern syntax
- Docker Containers: Reproducible bioinformatics tool environments
- Python Scripts: Custom data processing and clinical interpretation
- Shell Scripts: Utility functions and data manipulation