Tutorial 4: Legacy Integration

Tutorial 4: Legacy Data Integration (Workflow X3)

Overview

Workflow X3 integrates Sanger sequencing or GenBank legacy alignments into an existing sequence-capture dataset. This is useful when published data exists for the same loci and taxa (or closely related ones) and you want to increase taxon sampling without re-sequencing.

For each legacy alignment file, a representative sequence is BLASTed against the target marker file to identify the matching capture locus. The legacy sequences are then added to that locus alignment using MAFFT, with optional column-by-column merging for samples that appear in both datasets. Results are written to a data-analysis/legacy-integration/ directory.

When to use this workflow:

You have published Sanger/GenBank sequences for loci overlapping your capture panel
You want to include outgroups or additional taxa from the literature
Your capture dataset has poor coverage for some taxa that have legacy data available
You have a concatenated NEXUS matrix (e.g. from PAUP* or FigTree) that you want to merge into your capture dataset

Prerequisites

Before running Workflow X3 you need:

Sequence-capture alignments — the untrimmed_all-markers directory produced by Workflow 4 or 5. Untrimmed alignments are recommended because trimming may have removed flanking regions that overlap with the legacy loci.
Legacy alignment files — one alignment file per locus, in phylip or fasta format, placed in a single directory. Alternatively, a single concatenated NEXUS file with a BEGIN SETS / charset block (see Step 0 below).
Target marker file — the same FASTA file of reference target sequences used for your capture experiment. This is used to BLAST-match each legacy locus to the correct capture alignment.
Mitochondrial capture alignments (optional) — if your legacy data includes mitochondrial loci (e.g. 12S, 16S, ND1), provide a directory of mitochondrial capture alignments. MitoTrawlR is recommended for generating these from your sequence-capture raw reads.
Gene metadata file (optional, for gene concatenation) — a two-column tab-delimited file with columns Marker and Gene, used to group exons from the same gene.

Directory structure

/Project_Name
├── /workflows
│   ├── workflow-X3_configuration-file.R
│   └── workflow-X3_legacy-integration.R
├── /data-analysis
│   ├── /alignments
│   │   └── /untrimmed_all-markers        ← input capture alignments
│   └── /legacy-integration               ← all output goes here
│       ├── /untrimmed_legacy-only        ← integrated loci only
│       ├── /untrimmed_legacy-all         ← full dataset (capture + legacy)
│       ├── /untrimmed_legacy-genes       ← concatenated genes (optional)
│       ├── /untrimmed_legacy-unlinked    ← unlinked dataset (optional)
│       ├── /trimmed_legacy-unlinked      ← trimmed final dataset (optional)
│       └── untrimmed_legacy-integration_summary.txt
├── /data-analysis/alignments/legacy-alignments   ← converted NEXUS output (optional)
├── target_markers.fa
└── mito-alignments/                      ← mitochondrial capture alignments (optional)

Step 0 (optional): Converting a concatenated NEXUS file

If your legacy data is stored as a single concatenated NEXUS matrix (e.g. exported from PAUP*, MrBayes, or FigTree) with a BEGIN SETS / charset block, the convertNexusPartitions function will split it into one phylip file per locus automatically.

Set the following in the configuration file:

convert.nexus          = TRUE
nexus.file             = "/path/to/your/matrix.nex"
nexus.output.directory = "data-analysis/alignments/legacy-alignments"

The converted files are then automatically used as the legacy alignment source for Step 1. Each charset becomes a separate phylip file named after the charset (e.g. RAG1.phy, 16S.phy).

Notes:

Multi-range charsets (e.g. charset COI = 1-100 200-300) are supported; the ranges are concatenated in order.
Missing-data characters (?) are automatically converted to N to ensure compatibility with downstream tools.
Use max.missing.percent to drop samples with excessive missing data from individual loci before writing.
If the output directory already exists and overwrite = FALSE, the conversion step is skipped (files are assumed to already be present).

Step 1: Integrating legacy alignments

How loci are matched

For each legacy alignment file, the most complete sequence (fewest gaps and missing bases) is selected as a BLAST query and searched against the nuclear target marker file. The top BLAST hit identifies which capture alignment the legacy sequences belong to. Standard blastn is used for maximum sensitivity across divergent taxa.

If no nuclear hit is found and include.mitochondrial = TRUE, a second BLAST search is run against a database built from consensus sequences of the mitochondrial capture alignments. This handles loci such as 12S, 16S, and ND1 that are not represented in nuclear probe sets.

Controlling name matching: `name.match`

Because legacy datasets often use different voucher IDs or formatting conventions from sequence-capture datasets, three matching strategies are available for deciding which sequences to merge:

Mode	Merges identical names?	Merges same species?	Handles `-` vs `_`?	Pre-selects one legacy seq per species?	Unmatched legacy sequences
`"exact"`	✅	❌	❌	❌	Added as separate rows
`"fuzzy"`	✅	❌	✅	❌	Added as separate rows
`"species"`	✅	✅	❌	✅	Excluded (one per species only)

"exact" — sequence names must be character-for-character identical for two sequences to be merged. All legacy sequences are added to the alignment by MAFFT; those whose name exactly matches a capture sequence are subsequently merged column-by-column (non-gap characters preferred). This is the safest option when both datasets use the same sample naming scheme.

"fuzzy" — before comparing names, hyphens, dots, and spaces are converted to underscores and all characters are lowercased. This handles common formatting inconsistencies such as MZUTI-2436 (capture) vs MZUTI_2436 (legacy). Sequences whose normalised names match are merged; the merged sequence retains the original capture alignment name. All legacy sequences whose normalised names do not match any capture sequence are added as separate rows with their original names preserved. This is the recommended option when datasets share the same voucher IDs but with minor formatting differences.

"species" — strips the trailing specimen/voucher ID (the last underscore-delimited field) before matching, so that e.g. Centrolene_bacatum_MZUTI-2436 (capture) and Centrolene_bacatum_KU12345 (legacy) are treated as the same taxon and merged into a single sequence named Centrolene_bacatum. When the legacy alignment contains multiple sequences for the same species, one representative is pre-selected before alignment: the sequence whose full name (including voucher) already appears in the capture alignment is preferred; otherwise the sequence with the most informative (non-gap, non-missing) bases is chosen. This is the recommended option when datasets cover the same species but with entirely different voucher IDs.

Which mode should I use?

Same voucher IDs in both datasets → "exact"

Same voucher IDs but minor formatting differences (hyphen vs underscore) → "fuzzy"

Same species, different voucher IDs entirely → "species"

Merging same-sample sequences (`combine.same.sample`)

When combine.same.sample = TRUE (default), sequences matched by name.match are merged column-by-column into a single sequence: where both sequences have a base, the capture sequence's base is preferred; where one sequence has a gap or N and the other has a base, the base is used. This produces a single complete sequence per sample that draws on data from both sources.

Set combine.same.sample = FALSE to skip merging entirely and keep both sequences as separate rows (each retaining its original name).

Including unmatched legacy loci

By default (include.uncaptured.legacy = FALSE), legacy loci that return no BLAST match to the capture target file are skipped. Set include.uncaptured.legacy = TRUE to save these as stand-alone alignments in the output directory. This is useful if your legacy dataset contains markers not in your capture panel that you still want in the final dataset.

Output directories

Two output directories are always available:

untrimmed_legacy-only — contains only the loci where legacy data was integrated (one file per matched locus). Each file contains the capture sequences for that locus plus the legacy sequences added via MAFFT, with same-sample merging applied if enabled.
untrimmed_legacy-all — (created when include.all.together = TRUE) — the full dataset: all capture alignments, with legacy-integrated versions replacing the originals where available. This is the directory to use for downstream analyses. When name.match = "species", all sequence names in this directory are collapsed to species level for consistency.

Mitochondrial loci

Mitochondrial loci (e.g. 12S, 16S, ND1) are typically absent from nuclear probe sets and will return no BLAST hit against the nuclear target file. To include them, set:

include.mitochondrial      = TRUE
mito.alignment.directory   = "/path/to/mito-alignments"
mito.alignment.format      = "phylip"

PhyloProcessR will build a second BLAST database from the most complete sequence of each mitochondrial capture alignment before the main loop begins. When a legacy locus returns no nuclear BLAST hit, it is re-queried against this mitochondrial database. If a match is found, the corresponding mitochondrial capture alignment is used for integration instead of a nuclear capture alignment.

MitoTrawlR is recommended for generating the mitochondrial capture alignments from your sequence-capture raw reads.

Step 2 (optional): Gene concatenation and unlinked dataset

If your capture panel targets multiple exons from the same gene, you can concatenate them before downstream analysis. Set concatenate.genes = TRUE and provide the gene metadata file.

Two sub-options control which alignments are concatenated:

concatenate.legacy.genes = TRUE — concatenates exons from the full integrated dataset (capture + legacy combined). Legacy loci that match a gene's exons are included in the concatenated gene alignment.
concatenate.legacy.genes = FALSE — concatenates exons only from the original capture alignments; legacy loci remain as separate alignments and are picked up by gatherUnlinked as stand-alone loci.

When gather.unlinked = TRUE, the gatherUnlinked function assembles the final analysis-ready dataset: multi-exon genes are represented by their concatenated alignment, and single-exon loci (including any stand-alone legacy loci) are included individually. The output goes to untrimmed_legacy-unlinked.

Step 3 (optional): Trimming

Set trim.alignments = TRUE to run superTrimmer on the final dataset. The trimmer removes poorly covered alignment edges, columns with excessive gaps, and samples below a minimum coverage threshold.

If gene concatenation was performed, trimming runs on untrimmed_legacy-unlinked and writes to trimmed_legacy-unlinked. If no concatenation was performed, trimming runs directly on the integrated alignments and writes to trimmed_legacy.

A trimming summary CSV is written alongside the output directory (e.g. trimmed_legacy-unlinked_trimming_summary.csv).

Integration summary log

After integration completes, a tab-delimited summary file is written to data-analysis/legacy-integration/untrimmed_legacy-integration_summary.txt. Each row is one taxon in the output dataset, with the following columns:

Column	Description
`Taxon`	Sample or species name
`Total_Loci`	Total alignments the taxon appears in
`Capture_Loci`	Loci from the capture dataset only (no legacy data added)
`Legacy_Nuclear_Loci`	Nuclear loci where legacy data was integrated
`Legacy_Mito_Loci`	Mitochondrial loci where legacy data was integrated
`Legacy_Total`	`Legacy_Nuclear_Loci` + `Legacy_Mito_Loci`
`Total_BP`	Total informative base pairs across all loci (gaps/Ns excluded)
`Pct_Loci`	Percentage of total available loci the taxon is present in

Rows are sorted by Total_Loci descending so the best-represented taxa appear first. This summary is useful for identifying taxa with poor legacy coverage or capture taxa that have no legacy counterpart.

Full configuration reference

# Main working directory
working.directory = "/PATH/TO/PROJECT/DIRECTORY"

# Target marker file (nuclear probe set)
target.file = "/PATH/TO/marker-seqs.fa"

# Gene/exon metadata for concatenation (two columns: Marker, Gene)
feature.gene.names = "data-analysis/gene_metadata.txt"

# Input capture alignment directory
alignment.directory = "data-analysis/alignments/untrimmed_all-markers"
alignment.format    = "phylip"

#---------------------------------------------------------------------------
# NEXUS conversion (Step 0, optional)
#---------------------------------------------------------------------------
# TRUE = split a concatenated NEXUS file into per-locus phylip files
convert.nexus          = FALSE
nexus.file             = NULL
nexus.output.directory = "data-analysis/alignments/legacy-alignments"
# Max % missing/gap characters per sample per locus (100 = keep all)
max.missing.percent    = 100

#---------------------------------------------------------------------------
# Legacy alignment source
# (auto-set to nexus.output.directory when convert.nexus = TRUE)
#---------------------------------------------------------------------------
legacy.directory = "/PATH/TO/legacy-alignments"
legacy.format    = "phylip"

#---------------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------------
threads   = 8
memory    = 40
overwrite = FALSE
quiet     = TRUE

#---------------------------------------------------------------------------
# Integration settings
#---------------------------------------------------------------------------
# TRUE = merge sequences from the same sample into one combined sequence
combine.same.sample = TRUE

# Name matching strategy:
#   "exact"   = names must be identical
#   "fuzzy"   = normalise hyphens/underscores/case then match
#   "species" = strip trailing voucher ID, merge to species name
name.match = "exact"

# TRUE = include legacy loci not found in the capture dataset as stand-alone alignments
include.uncaptured.legacy = FALSE

# TRUE = write a second output directory with all capture alignments,
#        updated with legacy-integrated versions where available
include.all.together = TRUE

#---------------------------------------------------------------------------
# Mitochondrial loci (optional)
#---------------------------------------------------------------------------
# TRUE = also integrate mito loci absent from the nuclear target file
include.mitochondrial    = FALSE
mito.alignment.directory = "/PATH/TO/mito-alignments"
mito.alignment.format    = "phylip"

#---------------------------------------------------------------------------
# Gene concatenation (optional)
#---------------------------------------------------------------------------
concatenate.genes        = TRUE
minimum.exons            = 2
# TRUE  = include legacy loci in gene concatenation
# FALSE = concatenate capture exons only; legacy loci stay as separate alignments
concatenate.legacy.genes = TRUE
gather.unlinked          = TRUE

#---------------------------------------------------------------------------
# Trimming (optional)
#---------------------------------------------------------------------------
trim.alignments        = TRUE
min.taxa.alignment     = 4
min.alignment.length   = 100
run.TrimAl             = TRUE
trim.column            = TRUE
min.column.gap.percent = 50
convert.ambiguous.sites = FALSE
trim.external          = TRUE
min.external.percent   = 50
trim.coverage          = TRUE
min.coverage.percent   = 35
min.coverage.bp        = 60

#---------------------------------------------------------------------------
# Program paths (conda bin directory is sufficient when using conda)
#---------------------------------------------------------------------------
conda.env  = "PATH/TO/miniconda3/envs/PhyloProcessR/bin"
blast.path = conda.env
mafft.path = conda.env
trimAl.path = conda.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial 4: Legacy Integration

Tutorial 4: Legacy Data Integration (Workflow X3)

Overview

Prerequisites

Directory structure

Step 0 (optional): Converting a concatenated NEXUS file

Step 1: Integrating legacy alignments

How loci are matched

Controlling name matching: `name.match`

Merging same-sample sequences (`combine.same.sample`)

Including unmatched legacy loci

Output directories

Mitochondrial loci

Step 2 (optional): Gene concatenation and unlinked dataset

Step 3 (optional): Trimming

Integration summary log

Full configuration reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Tutorial 4: Legacy Integration

Tutorial 4: Legacy Data Integration (Workflow X3)

Overview

Prerequisites

Directory structure

Step 0 (optional): Converting a concatenated NEXUS file

Step 1: Integrating legacy alignments

How loci are matched

Controlling name matching: name.match

Merging same-sample sequences (combine.same.sample)

Including unmatched legacy loci

Output directories

Mitochondrial loci

Step 2 (optional): Gene concatenation and unlinked dataset

Step 3 (optional): Trimming

Integration summary log

Full configuration reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Controlling name matching: `name.match`

Merging same-sample sequences (`combine.same.sample`)