-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial 4: Legacy Integration
Workflow X3 integrates Sanger sequencing or GenBank legacy alignments into an existing sequence-capture dataset. This is useful when published data exists for the same loci and taxa (or closely related ones) and you want to increase taxon sampling without re-sequencing.
For each legacy alignment file, a representative sequence is BLASTed against the target marker file to identify the matching capture locus. The legacy sequences are then added to that locus alignment using MAFFT, with optional column-by-column merging for samples that appear in both datasets. Results are written to a data-analysis/legacy-integration/ directory.
When to use this workflow:
- You have published Sanger/GenBank sequences for loci overlapping your capture panel
- You want to include outgroups or additional taxa from the literature
- Your capture dataset has poor coverage for some taxa that have legacy data available
- You have a concatenated NEXUS matrix (e.g. from PAUP* or FigTree) that you want to merge into your capture dataset
Before running Workflow X3 you need:
-
Sequence-capture alignments — the
untrimmed_all-markersdirectory produced by Workflow 4 or 5. Untrimmed alignments are recommended because trimming may have removed flanking regions that overlap with the legacy loci. -
Legacy alignment files — one alignment file per locus, in phylip or fasta format, placed in a single directory. Alternatively, a single concatenated NEXUS file with a
BEGIN SETS/charsetblock (see Step 0 below). -
Target marker file — the same FASTA file of reference target sequences used for your capture experiment. This is used to BLAST-match each legacy locus to the correct capture alignment.
-
Mitochondrial capture alignments (optional) — if your legacy data includes mitochondrial loci (e.g. 12S, 16S, ND1), provide a directory of mitochondrial capture alignments. MitoTrawlR is recommended for generating these from your sequence-capture raw reads.
-
Gene metadata file (optional, for gene concatenation) — a two-column tab-delimited file with columns
MarkerandGene, used to group exons from the same gene.
/Project_Name
├── /workflows
│ ├── workflow-X3_configuration-file.R
│ └── workflow-X3_legacy-integration.R
├── /data-analysis
│ ├── /alignments
│ │ └── /untrimmed_all-markers ← input capture alignments
│ └── /legacy-integration ← all output goes here
│ ├── /untrimmed_legacy-only ← integrated loci only
│ ├── /untrimmed_legacy-all ← full dataset (capture + legacy)
│ ├── /untrimmed_legacy-genes ← concatenated genes (optional)
│ ├── /untrimmed_legacy-unlinked ← unlinked dataset (optional)
│ ├── /trimmed_legacy-unlinked ← trimmed final dataset (optional)
│ └── untrimmed_legacy-integration_summary.txt
├── /data-analysis/alignments/legacy-alignments ← converted NEXUS output (optional)
├── target_markers.fa
└── mito-alignments/ ← mitochondrial capture alignments (optional)
If your legacy data is stored as a single concatenated NEXUS matrix (e.g. exported from PAUP*, MrBayes, or FigTree) with a BEGIN SETS / charset block, the convertNexusPartitions function will split it into one phylip file per locus automatically.
Set the following in the configuration file:
convert.nexus = TRUE
nexus.file = "/path/to/your/matrix.nex"
nexus.output.directory = "data-analysis/alignments/legacy-alignments"The converted files are then automatically used as the legacy alignment source for Step 1. Each charset becomes a separate phylip file named after the charset (e.g. RAG1.phy, 16S.phy).
Notes:
- Multi-range charsets (e.g.
charset COI = 1-100 200-300) are supported; the ranges are concatenated in order. - Missing-data characters (
?) are automatically converted toNto ensure compatibility with downstream tools. - Use
max.missing.percentto drop samples with excessive missing data from individual loci before writing. - If the output directory already exists and
overwrite = FALSE, the conversion step is skipped (files are assumed to already be present).
For each legacy alignment file, the most complete sequence (fewest gaps and missing bases) is selected as a BLAST query and searched against the nuclear target marker file. The top BLAST hit identifies which capture alignment the legacy sequences belong to. Standard blastn is used for maximum sensitivity across divergent taxa.
If no nuclear hit is found and include.mitochondrial = TRUE, a second BLAST search is run against a database built from consensus sequences of the mitochondrial capture alignments. This handles loci such as 12S, 16S, and ND1 that are not represented in nuclear probe sets.
Because legacy datasets often use different voucher IDs or formatting conventions from sequence-capture datasets, three matching strategies are available for deciding which sequences to merge:
| Mode | Merges identical names? | Merges same species? | Handles - vs _? |
Pre-selects one legacy seq per species? | Unmatched legacy sequences |
|---|---|---|---|---|---|
"exact" |
✅ | ❌ | ❌ | ❌ | Added as separate rows |
"fuzzy" |
✅ | ❌ | ✅ | ❌ | Added as separate rows |
"species" |
✅ | ✅ | ❌ | ✅ | Excluded (one per species only) |
"exact" — sequence names must be character-for-character identical for two sequences to be merged. All legacy sequences are added to the alignment by MAFFT; those whose name exactly matches a capture sequence are subsequently merged column-by-column (non-gap characters preferred). This is the safest option when both datasets use the same sample naming scheme.
"fuzzy" — before comparing names, hyphens, dots, and spaces are converted to underscores and all characters are lowercased. This handles common formatting inconsistencies such as MZUTI-2436 (capture) vs MZUTI_2436 (legacy). Sequences whose normalised names match are merged; the merged sequence retains the original capture alignment name. All legacy sequences whose normalised names do not match any capture sequence are added as separate rows with their original names preserved. This is the recommended option when datasets share the same voucher IDs but with minor formatting differences.
"species" — strips the trailing specimen/voucher ID (the last underscore-delimited field) before matching, so that e.g. Centrolene_bacatum_MZUTI-2436 (capture) and Centrolene_bacatum_KU12345 (legacy) are treated as the same taxon and merged into a single sequence named Centrolene_bacatum. When the legacy alignment contains multiple sequences for the same species, one representative is pre-selected before alignment: the sequence whose full name (including voucher) already appears in the capture alignment is preferred; otherwise the sequence with the most informative (non-gap, non-missing) bases is chosen. This is the recommended option when datasets cover the same species but with entirely different voucher IDs.
Which mode should I use?
- Same voucher IDs in both datasets →
"exact"- Same voucher IDs but minor formatting differences (hyphen vs underscore) →
"fuzzy"- Same species, different voucher IDs entirely →
"species"
When combine.same.sample = TRUE (default), sequences matched by name.match are merged column-by-column into a single sequence: where both sequences have a base, the capture sequence's base is preferred; where one sequence has a gap or N and the other has a base, the base is used. This produces a single complete sequence per sample that draws on data from both sources.
Set combine.same.sample = FALSE to skip merging entirely and keep both sequences as separate rows (each retaining its original name).
By default (include.uncaptured.legacy = FALSE), legacy loci that return no BLAST match to the capture target file are skipped. Set include.uncaptured.legacy = TRUE to save these as stand-alone alignments in the output directory. This is useful if your legacy dataset contains markers not in your capture panel that you still want in the final dataset.
Two output directories are always available:
-
untrimmed_legacy-only— contains only the loci where legacy data was integrated (one file per matched locus). Each file contains the capture sequences for that locus plus the legacy sequences added via MAFFT, with same-sample merging applied if enabled. -
untrimmed_legacy-all— (created wheninclude.all.together = TRUE) — the full dataset: all capture alignments, with legacy-integrated versions replacing the originals where available. This is the directory to use for downstream analyses. Whenname.match = "species", all sequence names in this directory are collapsed to species level for consistency.
Mitochondrial loci (e.g. 12S, 16S, ND1) are typically absent from nuclear probe sets and will return no BLAST hit against the nuclear target file. To include them, set:
include.mitochondrial = TRUE
mito.alignment.directory = "/path/to/mito-alignments"
mito.alignment.format = "phylip"PhyloProcessR will build a second BLAST database from the most complete sequence of each mitochondrial capture alignment before the main loop begins. When a legacy locus returns no nuclear BLAST hit, it is re-queried against this mitochondrial database. If a match is found, the corresponding mitochondrial capture alignment is used for integration instead of a nuclear capture alignment.
MitoTrawlR is recommended for generating the mitochondrial capture alignments from your sequence-capture raw reads.
If your capture panel targets multiple exons from the same gene, you can concatenate them before downstream analysis. Set concatenate.genes = TRUE and provide the gene metadata file.
Two sub-options control which alignments are concatenated:
-
concatenate.legacy.genes = TRUE— concatenates exons from the full integrated dataset (capture + legacy combined). Legacy loci that match a gene's exons are included in the concatenated gene alignment. -
concatenate.legacy.genes = FALSE— concatenates exons only from the original capture alignments; legacy loci remain as separate alignments and are picked up bygatherUnlinkedas stand-alone loci.
When gather.unlinked = TRUE, the gatherUnlinked function assembles the final analysis-ready dataset: multi-exon genes are represented by their concatenated alignment, and single-exon loci (including any stand-alone legacy loci) are included individually. The output goes to untrimmed_legacy-unlinked.
Set trim.alignments = TRUE to run superTrimmer on the final dataset. The trimmer removes poorly covered alignment edges, columns with excessive gaps, and samples below a minimum coverage threshold.
If gene concatenation was performed, trimming runs on untrimmed_legacy-unlinked and writes to trimmed_legacy-unlinked. If no concatenation was performed, trimming runs directly on the integrated alignments and writes to trimmed_legacy.
A trimming summary CSV is written alongside the output directory (e.g. trimmed_legacy-unlinked_trimming_summary.csv).
After integration completes, a tab-delimited summary file is written to data-analysis/legacy-integration/untrimmed_legacy-integration_summary.txt. Each row is one taxon in the output dataset, with the following columns:
| Column | Description |
|---|---|
Taxon |
Sample or species name |
Total_Loci |
Total alignments the taxon appears in |
Capture_Loci |
Loci from the capture dataset only (no legacy data added) |
Legacy_Nuclear_Loci |
Nuclear loci where legacy data was integrated |
Legacy_Mito_Loci |
Mitochondrial loci where legacy data was integrated |
Legacy_Total |
Legacy_Nuclear_Loci + Legacy_Mito_Loci
|
Total_BP |
Total informative base pairs across all loci (gaps/Ns excluded) |
Pct_Loci |
Percentage of total available loci the taxon is present in |
Rows are sorted by Total_Loci descending so the best-represented taxa appear first. This summary is useful for identifying taxa with poor legacy coverage or capture taxa that have no legacy counterpart.
# Main working directory
working.directory = "/PATH/TO/PROJECT/DIRECTORY"
# Target marker file (nuclear probe set)
target.file = "/PATH/TO/marker-seqs.fa"
# Gene/exon metadata for concatenation (two columns: Marker, Gene)
feature.gene.names = "data-analysis/gene_metadata.txt"
# Input capture alignment directory
alignment.directory = "data-analysis/alignments/untrimmed_all-markers"
alignment.format = "phylip"
#---------------------------------------------------------------------------
# NEXUS conversion (Step 0, optional)
#---------------------------------------------------------------------------
# TRUE = split a concatenated NEXUS file into per-locus phylip files
convert.nexus = FALSE
nexus.file = NULL
nexus.output.directory = "data-analysis/alignments/legacy-alignments"
# Max % missing/gap characters per sample per locus (100 = keep all)
max.missing.percent = 100
#---------------------------------------------------------------------------
# Legacy alignment source
# (auto-set to nexus.output.directory when convert.nexus = TRUE)
#---------------------------------------------------------------------------
legacy.directory = "/PATH/TO/legacy-alignments"
legacy.format = "phylip"
#---------------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------------
threads = 8
memory = 40
overwrite = FALSE
quiet = TRUE
#---------------------------------------------------------------------------
# Integration settings
#---------------------------------------------------------------------------
# TRUE = merge sequences from the same sample into one combined sequence
combine.same.sample = TRUE
# Name matching strategy:
# "exact" = names must be identical
# "fuzzy" = normalise hyphens/underscores/case then match
# "species" = strip trailing voucher ID, merge to species name
name.match = "exact"
# TRUE = include legacy loci not found in the capture dataset as stand-alone alignments
include.uncaptured.legacy = FALSE
# TRUE = write a second output directory with all capture alignments,
# updated with legacy-integrated versions where available
include.all.together = TRUE
#---------------------------------------------------------------------------
# Mitochondrial loci (optional)
#---------------------------------------------------------------------------
# TRUE = also integrate mito loci absent from the nuclear target file
include.mitochondrial = FALSE
mito.alignment.directory = "/PATH/TO/mito-alignments"
mito.alignment.format = "phylip"
#---------------------------------------------------------------------------
# Gene concatenation (optional)
#---------------------------------------------------------------------------
concatenate.genes = TRUE
minimum.exons = 2
# TRUE = include legacy loci in gene concatenation
# FALSE = concatenate capture exons only; legacy loci stay as separate alignments
concatenate.legacy.genes = TRUE
gather.unlinked = TRUE
#---------------------------------------------------------------------------
# Trimming (optional)
#---------------------------------------------------------------------------
trim.alignments = TRUE
min.taxa.alignment = 4
min.alignment.length = 100
run.TrimAl = TRUE
trim.column = TRUE
min.column.gap.percent = 50
convert.ambiguous.sites = FALSE
trim.external = TRUE
min.external.percent = 50
trim.coverage = TRUE
min.coverage.percent = 35
min.coverage.bp = 60
#---------------------------------------------------------------------------
# Program paths (conda bin directory is sufficient when using conda)
#---------------------------------------------------------------------------
conda.env = "PATH/TO/miniconda3/envs/PhyloProcessR/bin"
blast.path = conda.env
mafft.path = conda.env
trimAl.path = conda.env