Skip to content
Simon Hackl edited this page Jul 22, 2025 · 9 revisions

Welcome to the Nextstrain-TrepoGen wiki!

With approximately 7 million cases in 2020 and stagnant progress in public health campaigns, treponematoses constitute a global public health concern. The causative agent, Treponema pallidum (Tp), is known as the stealth pathogen due to its poorly antigenic and low-inflammatory surface. Although extensive whole-genome sequencing (WGS) data on Tp is available, the lack of harmonized data analysis and clustering/classification schemes impedes study comparability and progress in elucidating global epidemiology and vaccine research.

We aim at democratizing access to T. pallidum genomic diversity data by developing and maintaining unified, community-facing Nextstrain datasets for epidemiological surveillance, comparative evolutionary analysis and facilitated data sharing; with a special emphasis on tracking outer membrane protein (OMP) diversity relevant for vaccine design.

The wiki is meant to offer extra information on technical aspects, our data, as well as usage examples.

Data

Pre-processing

All T. pallidum (Tp) containing fastq files deposited to the NCBI Sequence Read Archive by December 31, 2024, were downloaded and processed using a custom pipeline: ART1010 used to generate synthetic reads from GenBank consensus sequences for reference strains (default settings for Illumina NovaSeq 6000 error profile). Raw fastqs preprocessed with Kraken2811 to remove human reads, Trimmomatic12 for quality and adapter trimming, and BBDuk13 for low-stringency filtering against the SS14 reference genome (NC_021508). Tp rRNA and tRNA were filtered at high stringency, requiring 99% identity to conserved Tp references. Filtered Tp reads were mapped against the SS14 reference genome using bowtie214 with default parameters, deduplicated with Picard15, and variants called using GATK16 HaplotypeCaller assuming a ploidy of one and maximum indel size of 130. Joint genotyping was subsequently performed using GATK, followed by filtration with bcftools17 to attain a minimum depth of three reads and an allele frequency above 80%.

Datasources

For the Nextstrain-TrepoGen project, we designate the combination of variants (VCF files), the reference genome, and the reference genome annotation as a Datasource. Variants are stored as subsets, i.e. files containing, for example, only single nucleotide variants (SNVs), SNVs as well as insertions and deletions (InDels), or only a portion of regions due to masked positions. Each data source and its variant subsets can be part of multiple Nextstrain builds and, accordingly, used in multiple Nextstrain Datasets.

Our data sources are not included directly in this repository, but will likely be hosted on Zenodo or a comparable platform soon. Currently, we maintain the following data sources:

TPASS-2588

TPASS-2588: a high-resolution dataset comprising 2,588 Treponema pallidum (ssp. pallidum, pertenue and endemicum) samples genotyped against the SS14 reference genome (NC_021508.1).

  • For Nextstrain builds utilising this dataset, low-quality samples with a mean coverage below 3× or with ambiguous (no-call) positions in more than 25% of the genome are masked to reduce the bias introduced by low-quality data.
  • Accompanying metadata provides epidemiological context and quality metrics for each sample, including the sampling date, country and region, the designated subspecies, and for ssp. pallidum, the Nichols or SS14 lineage. It also includes the mean coverage and ambiguity (N or no-call positions) as a percentage.
  • The data source comprises two subsets of variants: one containing only single nucleotide variants (SNVs), and the other containing both SNVs and insertions/deletions (InDels).

Workflows

Nextstrain-TrepoGen implements two distinct workflows that can be used to generate different types of dataset, each with specific analytical objectives.

Genome

The Genome Workflow directly processes whole-genome variant calls in VCF format. It is intended to produce accurate phylogenies that are useful for tracing subspecies and population structures, and for deriving epidemiological insights, global classifications and geographic modelling. It provides rules for masking positions for phylogenetic tree building. The workflow implements additional rules for applying drug resistance mutation annotation and conducting clade assignment.

For our datasets, we mask known recombinant loci in Treponema pallidum, thereby reducing the homoplasic effect on the tree topology. Furthermore, we annotate macrolide resistance mutations induced by single nucleotide variants (SNVs) in the 23S ribosomal RNA (rRNA) of Treponema pallidum, as well as a prototypic clade assignment scheme based on hierarchical clustering.

Gene

The Gene Workflow focuses analysis on a single gene of interest. This enables functional analysis, for example, the discovery of putative vaccine targets. The workflow is not applied directly to variant calls, but a preprocessing step involving the following is applied:

  • Generating target gene sequence alignments using MUSIAL from the variant calls. Currently, all samples with an identical sequence to the reference sequence are excluded.
  • Adapting the reference annotation to the target gene to allow the extraction of gene sub-features (i.e. biological regions of interest) if they are included in the annotation.
  • Additional meta data annotation of each sample based on the sequence composition of selected gene sub-features.

For our datasets, we manually annotated protein topologies using either manually curated data or the DeepTMHMM tool for selected outer membrane proteins (OMPs) of interest. Our focus is on sequence composition typing of extracellular loop (ECL) regions.

Clone this wiki locally