AMPidentifier: A Cross-Platform Ensemble Toolkit for Antimicrobial Peptide Prediction

As antimicrobial resistance threatens to undo a century of medical progress, AMPIdentifier provides researchers with a rigorous computational framework to accelerate the discovery of next-generation antibiotic candidates.

Background

Antimicrobial peptides (AMPs) are short polypeptides (typically 5-50 residues) produced by most living organisms as part of innate immune defense. They disrupt bacterial membranes through electrostatic and hydrophobic interactions, and their activity is governed by a combination of net charge, amphipathicity, and the spatial distribution of specific residue types along the sequence. The identification of AMPs from sequence data is therefore a classification problem where the features must capture both global physicochemical properties and positional residue patterns.

The previous version of this pipeline (branch beta) trained four classifiers on ten global scalar descriptors computed by modlamp.GlobalDescriptor.calculate_all(). After hyperparameter tuning with RandomizedSearchCV (50 iterations, StratifiedKFold with 5 folds, scoring: AUC-ROC), all four models converged to AUC-ROC 0.951-0.954 and Matthews Correlation Coefficient (MCC) 0.777-0.780. The plateau across architecturally distinct models indicated that the bottleneck was the feature representation, not model capacity.

AMPidentifier addresses this through two changes: the descriptor set is expanded from 10 global scalars to 22 features that include the hydrophobic moment, grouped amino acid composition fractions, and positional features derived from free energy of transfer (FET) and solvent accessibility (SA) groups; and the model suite is expanded to five classifiers plus a soft-voting ensemble. On an independent benchmark set (n=4,736), the voting ensemble reaches AUC-ROC 0.950 and MCC 0.742, compared to AUC-ROC 0.951-0.954 and MCC 0.777-0.780 from the beta pipeline on its internal test set, which used a different dataset split.

Installation

Requirements: Python 3.10 or later.

git clone https://github.com/madsondeluna/AMPidentifier.git
cd AMPidentifier
pip install -r requirements.txt

Core dependencies: scikit-learn, xgboost, lightgbm, modlamp, pandas, numpy, matplotlib, joblib, biopython.

Installation via pip (alternative)

To install directly from PyPI without cloning the repository:

pip install ampidentifier

This registers the ampidentifier command globally:

ampidentifier -i sequences.fasta -o results/
ampidentifier -i sequences.fasta -o results/ -m lgbm
ampidentifier -i sequences.fasta -o results/ -m voting --threshold 0.40

For step-by-step setup in Google Colab or Jupyter, see COLAB_GUIDE.md and the interactive notebook AMPidentifier_Colab_Guide.ipynb.

Usage

AMPidentifier is available as a web application, a pip package, and a cloneable repository.

Web application

The easiest way to run predictions is through the web interface at ampidentifier.com. No installation is required. Paste or upload FASTA sequences, select a model, and download the results as CSV.

Usage map and privacy

The web interface shows a usage map of where the tool has been run. Each prediction resolves the request IP to an approximate location through ipapi.co. The application stores only the aggregated city and country plus a per-location count, never the raw IP address. The single purpose is measuring the reach of the tool on the map. These are coarse estimates without geographic precision and are inaccurate for VPN or institutional-network traffic. Private, loopback, and local IPs are ignored, so they never generate a point.

Command-line interface

Prediction workflow. FASTA sequences are parsed, 22 physicochemical descriptors are computed per sequence, and five independent classifiers (RF, SVM, GB, XGB, LGBM) each produce a P(AMP) score. The soft-voting ensemble averages these scores; sequences with P(AMP) above the optimized threshold are classified as AMP.

AMPidentifier is invoked from the project root via python3 main.py.

Command-line flags

Flag	Short	Type	Default	Description
`--input`	`-i`	`str`	required	Path to input FASTA file
`--output_dir`	`-o`	`str`	required	Output directory (created if absent)
`--model`	`-m`	`str`	`voting`	Model to use (see below)
`--threshold`		`float`	model default	Decision threshold (0.0 to 1.0)

Model options (`-m`)

Value	Description	Default threshold
`rf`	Random Forest	0.56
`svm`	Support Vector Machine, RBF kernel	0.47
`gb`	Gradient Boosting	0.55
`xgb`	XGBoost	0.48
`lgbm`	LightGBM	0.71
`voting`	Soft-voting ensemble of all five models (recommended)	0.56

Examples

# Predict with the voting ensemble (default, recommended)
python3 main.py -i sequences.fasta -o results/

# Predict with a specific model
python3 main.py -i sequences.fasta -o results/ -m xgb

# Predict with a custom threshold
python3 main.py -i sequences.fasta -o results/ -m voting --threshold 0.40

Output files

File	Description
`physicochemical_features.csv`	22-feature matrix for all input sequences
`predictions_{model}.csv`	Per-sequence AMP probability and binary label

Training and evaluation scripts

Command	Description
`python3 -m model_training.train`	Baseline training with default hyperparameters
`python3 -m model_training.tune`	Hyperparameter tuning for all five models
`python3 -m model_training.tune rf xgb`	Tune specific models only
`python3 -m model_training.tune voting`	Build and save the voting ensemble from tuned models
`python3 -m model_training.benchmark`	Evaluate all tuned models on the independent benchmark set
`python3 -m model_training.eda`	Generate exploratory data analysis figures
`python3 -m model_training.plot_tuning`	Generate post-tuning diagnostic figures

Dataset

Sources

AMPidentifier uses the same positive and negative sequence sets as the beta pipeline (v1.0). No new data collection was performed.

Pre-processing

All sequences were filtered to retain only the 20 standard amino acids (ACDEFGHIKLMNPQRSTVWY) and sequences in the 5-255 residue range. Exact-sequence deduplication was applied.

To prevent sequence length from acting as a discriminating feature, the negative set was subsampled using length-stratified sampling: the length distribution of the negative set was matched to the positive distribution across 10 equal-width bins, with per-bin quotas proportional to positive class counts.

Final composition

Class	File	Sequences
AMP (positive)	`model_training/data/positive_sequences.fasta`	6,623
Non-AMP (negative)	`model_training/data/negative_sequences.fasta`	6,623
Total		13,246

The dataset was split 80/20 into training (10,596) and test (2,650) sets using random_state=42 with stratification on the binary label (1,325 per class in the test set).

Feature engineering

The feature set follows the Macrel design (Santos-Júnior et al. 2020), with additions from the grouped amino acid composition scheme of Jhong et al. (2019) and positional descriptors from the CTD framework (Dubchak et al. 1995). All 22 features are implemented in amp_identifier/feature_extraction.py and grouped into four families.

Table 1. Complete feature set (22 features).

Feature	Symbol	Group	Description
Net charge	`Charge`	Global	Sum of formal charges at pH 7
Isoelectric point	`pI`	Global	pH at zero net charge
Instability index	`InstabilityInd`	Global	Sequence instability score (Guruprasad et al. 1990)
Aliphatic index	`AliphaticInd`	Global	Relative volume of aliphatic side chains
Boman index	`BomanInd`	Global	Protein-binding propensity (Boman 2003)
Hydrophobic ratio	`HydrophRatio`	Global	Fraction of hydrophobic residues (ACFILMVW)
Hydrophobic moment	`HydrophobicMoment`	Global	Amphipathic helical moment, Eisenberg scale, δ=100°
Acidic fraction	`f_acidic`	Grouped AAC	Fraction of DE residues
Basic fraction	`f_basic`	Grouped AAC	Fraction of KRH residues
Polar fraction	`f_polar`	Grouped AAC	Fraction of STNQ residues
Non-polar fraction	`f_nonpolar`	Grouped AAC	Fraction of AVLIMFYWP residues
Aliphatic fraction	`f_aliphatic`	Grouped AAC	Fraction of AVLIM residues
Aromatic fraction	`f_aromatic`	Grouped AAC	Fraction of FYW residues
Charged fraction	`f_charged`	Grouped AAC	Fraction of DEKRH residues
Small fraction	`f_small`	Grouped AAC	Fraction of AGSDT residues
Tiny fraction	`f_tiny`	Grouped AAC	Fraction of AGS residues
FET low D1	`FET_low_D1`	FET positional	Relative position of first low-FET residue (ILVWAMGT)
FET mid D1	`FET_mid_D1`	FET positional	Relative position of first mid-FET residue (FYSQCN)
FET high D1	`FET_high_D1`	FET positional	Relative position of first high-FET residue (PHKEDR)
SA buried D1	`SA_buried_D1`	SA positional	Relative position of first buried residue (ALFCGIVW)
SA exposed D1	`SA_exposed_D1`	SA positional	Relative position of first exposed residue (RKQEND)
SA intermediate D1	`SA_inter_D1`	SA positional	Relative position of first intermediate residue (MSPTHY)

Global descriptors

Six scalar descriptors were computed using modlamp.GlobalDescriptor (Müller et al. 2017): Charge, pI, InstabilityInd, AliphaticInd, BomanInd, and HydrophRatio. The hydrophobic moment was computed using modlamp.PeptideDescriptor with the Eisenberg hydrophobicity scale (Eisenberg et al. 1982) and a helical projection angle of $\delta = 100^\circ$:

$$\mu_H = \frac{1}{L} \sqrt{ \left( \sum_{i=1}^{L} H_i \sin(i \cdot \delta) \right)^2 + \left( \sum_{i=1}^{L} H_i \cos(i \cdot \delta) \right)^2 }$$

where $H_i$ is the Eisenberg hydrophobicity of residue $i$, $\delta = 100^\circ$ is the helical rotation angle per residue, and $L$ is sequence length. $\mu_H$ captures amphipathic helical character: sequences with a hydrophobic face and a hydrophilic face yield high $\mu_H$ even when mean hydrophobicity is moderate.

Grouped amino acid composition

Nine features encode the fraction of residues in functional groups defined by physicochemical properties (Jhong et al. 2019; Nagarajan et al. 2019). For group $G$:

$$f_G = \frac{\sum_{i \in G} n_i}{L}$$

where $n_i$ is the count of residue type $i$ and $L$ is sequence length. Groups were defined as: acidic (DE), basic (KRH), polar (STNQ), non-polar (AVLIMFYWP), aliphatic (AVLIM), aromatic (FYW), charged (DEKRH), small (AGSDT), and tiny (AGS).

FET positional features

Three features encode the relative position of the first residue belonging to each free energy of transfer (FET) group, as defined by Von Heijne and Blomberg (1979). FET groups partition residues by their thermodynamic cost of insertion into a lipid bilayer: low-FET residues (ILVWAMGT) are membrane-preferring; high-FET residues (PHKEDR) are membrane-avoiding.

$$\text{FET}_{g,D1} = \frac{\text{position of first residue} \in g}{L}$$

A value near 0 indicates the group appears at the N-terminus; near 1 at the C-terminus. Zero is returned when no residue of the group is present. These features follow the CTD Distribution D1 convention (Dubchak et al. 1995; Chou and Shen 2007).

SA positional features

Three features encode the relative position of the first residue in each solvent accessibility (SA) group (Bhadra et al. 2018): buried (ALFCGIVW), exposed (RKQEND), and intermediate (MSPTHY). The encoding is identical to FET positional features:

$$\text{SA}_{g,D1} = \frac{\text{position of first residue} \in g}{L}$$

Exploratory data analysis

All figures were generated by model_training/eda.py. The following observations motivate the feature choices.

Figure 1. Sequence length distribution of AMPs and non-AMPs.

Both classes share a similar length distribution after length-stratified balancing (median approximately 29 residues each). This confirms that sequence length cannot be used as a proxy for AMP identity.

Figure 2. Mean amino acid composition of AMPs versus non-AMPs.

AMPs are enriched in K, R, C, and G relative to non-AMPs. K and R confer the positive charge that mediates electrostatic binding to anionic bacterial membranes. C is characteristic of disulfide-stabilized defensins. Non-AMPs show higher proportions of E, M, and D.

Figure 3. Global physicochemical descriptors and hydrophobic moment.

Charge and pI show the clearest class separation: AMPs concentrate at high positive charge and high pI, whereas non-AMPs distribute near neutrality. HydrophobicMoment separates the classes, with AMPs showing higher values consistent with amphipathic helical architecture.

Figure 4. Grouped amino acid composition.

f_basic and f_charged are enriched in AMPs, reflecting the cationic character of most natural AMPs. f_acidic is lower in AMPs. f_aliphatic and f_aromatic are similar between classes, indicating that hydrophobicity alone does not discriminate.

Figure 5. FET and solvent accessibility positional features.

Differences between AMPs and non-AMPs in FET and SA D1 features reflect structural constraints on the N-terminal region, where many AMPs initiate membrane contact.

Model development

The development pipeline has three stages: baseline training with default hyperparameters, hyperparameter tuning with randomized search, and construction of a soft-voting ensemble from the five tuned models. All stages use the same 80/20 train-test split with random_state=42.

Baseline training

Five classifiers are trained in model_training/train.py with default hyperparameters to establish a performance floor before tuning:

RF: RandomForestClassifier, 200 trees, class_weight='balanced', RobustScaler.
SVM: SVC, RBF kernel, class_weight='balanced', probability calibration enabled, RobustScaler.
GB: GradientBoostingClassifier, 100 estimators, RobustScaler.
XGB: XGBClassifier, 100 estimators, scale_pos_weight=1, RobustScaler.
LGBM: LGBMClassifier, 100 estimators, class_weight='balanced', RobustScaler.

All five models use RobustScaler (median and interquartile range normalization) at baseline. InstabilityInd and AliphaticInd contain extreme outliers that inflate standard deviation-based scaling; RobustScaler is more resistant to these.

Table 2. Baseline results on the internal test set (n=2,650).

Model	AUC-ROC	MCC	F1	Precision	Recall	Specificity	Threshold
RF	0.9719	0.8370	0.9187	0.9160	0.9215	0.9155	0.48
SVM	0.9615	0.8135	0.9082	0.8911	0.9260	0.8868	0.40
GB	0.9625	0.8114	0.9049	0.9125	0.8974	0.9140	0.50
XGB	0.9719	0.8412	0.9191	0.9338	0.9049	0.9358	0.64
LGBM	0.9720	0.8338	0.9138	0.9416	0.8875	0.9449	0.62

Hyperparameter tuning

Tuning is performed with model_training/tune.py. RandomizedSearchCV samples 50 candidate hyperparameter combinations per model over StratifiedKFold(n_splits=5), scoring by roc_auc. The search uses refit=True, so the best estimator is retrained on the full training set (10,596 sequences) with the selected hyperparameters. No separate retraining step is required.

Scaling strategy during tuning: RobustScaler for RF, GB, XGB, and LGBM; StandardScaler for SVM. The SVM dual solver diverges on this dataset size with RobustScaler for large values of $C$ because the kernel matrix becomes ill-conditioned; StandardScaler produces stable convergence.

Table 3. Hyperparameter search spaces used in RandomizedSearchCV.

Model	Parameter	Distribution
RF	`n_estimators`	Uniform integer [100, 600)
RF	`max_depth`	Categorical: None, 10, 20, 30, 40
RF	`min_samples_split`	Uniform integer [2, 15)
RF	`min_samples_leaf`	Uniform integer [1, 8)
RF	`max_features`	Categorical: sqrt, log2, 0.3, 0.5
SVM	`C`	Log-uniform [10⁻², 10²]
SVM	`gamma`	Categorical: scale, 10⁻⁴, 10⁻³, 10⁻², 10⁻¹, 1.0
GB	`n_estimators`	Uniform integer [100, 500)
GB	`learning_rate`	Log-uniform [10⁻³, 5×10⁻¹]
GB	`max_depth`	Uniform integer [2, 8)
GB	`subsample`	Uniform [0.5, 1.0]
GB	`min_samples_split`	Uniform integer [2, 15)
GB	`min_samples_leaf`	Uniform integer [1, 8)
XGB	`n_estimators`	Uniform integer [100, 500)
XGB	`learning_rate`	Log-uniform [10⁻³, 5×10⁻¹]
XGB	`max_depth`	Uniform integer [2, 8)
XGB	`subsample`	Uniform [0.5, 1.0]
XGB	`colsample_bytree`	Uniform [0.5, 1.0]
XGB	`reg_alpha`	Log-uniform [10⁻⁴, 10¹]
XGB	`reg_lambda`	Log-uniform [10⁻¹, 10¹]
XGB	`min_child_weight`	Uniform integer [1, 10)
LGBM	`n_estimators`	Uniform integer [100, 600)
LGBM	`learning_rate`	Log-uniform [10⁻³, 5×10⁻¹]
LGBM	`max_depth`	Uniform integer [3, 10)
LGBM	`num_leaves`	Uniform integer [20, 150)
LGBM	`subsample`	Uniform [0.5, 1.0]
LGBM	`colsample_bytree`	Uniform [0.5, 1.0]
LGBM	`reg_alpha`	Log-uniform [10⁻⁴, 10¹]
LGBM	`reg_lambda`	Log-uniform [10⁻¹, 10¹]
LGBM	`min_child_samples`	Uniform integer [5, 50)

Table 4. Best hyperparameters selected by RandomizedSearchCV.

Model	Parameter	Best value
RF	`n_estimators`	229
RF	`max_features`	0.3
RF	`min_samples_split`	3
RF	`min_samples_leaf`	1
RF	`max_depth`	None (unpruned)
SVM	`C`	2.796
SVM	`gamma`	0.1
GB	`n_estimators`	293
GB	`learning_rate`	0.0618
GB	`max_depth`	6
GB	`subsample`	0.846
GB	`min_samples_split`	3
GB	`min_samples_leaf`	2
XGB	`n_estimators`	448
XGB	`learning_rate`	0.0594
XGB	`max_depth`	6
XGB	`subsample`	0.678
XGB	`colsample_bytree`	0.758
XGB	`reg_alpha`	0.0125
XGB	`reg_lambda`	0.313
XGB	`min_child_weight`	6
LGBM	`n_estimators`	383
LGBM	`learning_rate`	0.323
LGBM	`max_depth`	7
LGBM	`num_leaves`	47
LGBM	`subsample`	0.942
LGBM	`colsample_bytree`	0.581
LGBM	`reg_alpha`	0.00124
LGBM	`reg_lambda`	0.681
LGBM	`min_child_samples`	10

Figure 6. Cross-validation AUC-ROC score distributions across 50 RandomizedSearchCV iterations per model. LGBM and XGB reach the highest median CV AUC, followed by GB and RF. SVM shows a narrower distribution, reflecting fewer degrees of freedom in the search space.

Figure 7. Top 10 hyperparameter configurations by CV AUC-ROC per model. Each point represents one candidate evaluated during the randomized search; the selected configuration is marked with a star.

Figure 8. Hyperparameter performance surface: RF. CV AUC-ROC as a function of n_estimators and max_features. Performance is stable across tree counts above 200; max_features=0.3 consistently outperforms sqrt and log2 on this feature set.

Figure 9. Hyperparameter performance surface: GB. CV AUC-ROC as a function of learning_rate and n_estimators. Lower learning rates paired with more estimators produce the highest CV AUC, consistent with gradient boosting convergence theory.

Figure 10. Hyperparameter performance surface: SVM. CV AUC-ROC as a function of C and gamma. Performance degrades sharply at high gamma values, indicating overfitting to individual training points.

Figure 11. Hyperparameter performance surface: XGB. CV AUC-ROC as a function of learning_rate and n_estimators. The selected configuration (star) sits near the peak of the surface, with n_estimators=448 and learning_rate=0.059.

Figure 12. Hyperparameter performance surface: LGBM. CV AUC-ROC as a function of learning_rate and num_leaves. High learning_rate paired with moderate num_leaves (47) yields the best generalization for this dataset.

Table 5. Results after hyperparameter tuning on the internal test set (n=2,650).

Model	Best CV AUC	AUC-ROC	MCC	F1	Precision	Recall	Specificity	Threshold
RF	0.9695	0.9719	0.8386	0.9170	0.9384	0.8966	0.9411	0.56
SVM	0.9671	0.9692	0.8385	0.9194	0.9180	0.9208	0.9177	0.47
GB	0.9723	0.9741	0.8394	0.9187	0.9290	0.9087	0.9306	0.55
XGB	0.9741	0.9744	0.8430	0.9217	0.9196	0.9238	0.9192	0.48
LGBM	0.9740	0.9752	0.8548	0.9260	0.9415	0.9109	0.9434	0.71

Tuning improved AUC-ROC and MCC for all models relative to baseline. The largest gains occurred in SVM (MCC +0.025, AUC-ROC +0.008), GB (MCC +0.028, AUC-ROC +0.012), and LGBM (MCC +0.021). RF and XGB changed marginally because their baseline hyperparameters were already near optimal for this feature set and dataset size.

Figure 13. ROC curves for all tuned models on the internal test set (n=2,650). LGBM and XGB reach the highest AUC-ROC (0.975 and 0.974). The voting ensemble (dashed) achieves AUC-ROC 0.977, above all individual models.

Figure 14. Confusion matrices for all tuned models on the internal test set. All models correctly classify over 91% of sequences. LGBM achieves the fewest false negatives; RF achieves the fewest false positives.

Figure 15. Per-model comparison of all evaluation metrics on the internal test set. LGBM leads in MCC (0.855), precision (0.942), and specificity (0.943). The voting ensemble leads in F1 (0.928) and AUC-ROC (0.977).

Figure 16. Precision-recall curves for all tuned models. All models maintain precision above 0.87 at recall 0.90, indicating low false positive rates at clinically relevant sensitivity levels.

Figure 17. Detection error tradeoff (DET) curves for all tuned models. LGBM achieves the lowest combined false positive and false negative rates on the internal test set.

Figure 18. Threshold sensitivity for all tuned models: MCC, F1, Precision, and Recall as a function of decision threshold. The vertical dashed line marks the MCC-optimized threshold; the star marks the MCC at that threshold.

Figure 19. Calibration curves (reliability diagrams) for all tuned models. RF and GB are well-calibrated; SVM shows slight overconfidence at high predicted probabilities. LGBM and XGB are slightly underconfident in the 0.6-0.8 range.

Figure 20. Mean impurity-based feature importances for tree-based models (RF, GB, XGB, LGBM). Charge, BomanInd, HydrophobicMoment, and pI are consistently ranked among the top features across all four models.

Soft-voting ensemble

After tuning all five base models, a soft-voting ensemble is constructed with python3 -m model_training.tune voting. The ensemble loads the five tuned models from model_training/tuned_model/ and averages their predicted AMP probabilities:

$$\hat{p}_{\text{voting}} = \frac{1}{K} \sum_{k=1}^{K} \hat{p}_k$$

where $K = 5$ and $\hat{p}_k$ is the AMP probability output of base model $k$. Each base model applies its own fitted scaler internally before prediction (RobustScaler for RF, GB, XGB, LGBM; StandardScaler for SVM), so the ensemble accepts raw unscaled feature matrices at prediction time. This is implemented in model_training/voting.py through the VotingEnsemble class, which stores each model alongside its fitted scaler.

The ensemble is saved as model_training/tuned_model/amp_model_voting_tuned.pkl.

Table 6. Voting ensemble results on the internal test set (n=2,650).

Model	AUC-ROC	MCC	F1	Precision	Recall	Specificity	Threshold
VOTING	0.9771	0.8585	0.9280	0.9424	0.9140	0.9442	0.56

The voting ensemble achieves AUC-ROC 0.977 and MCC 0.859, exceeding all individual base models. The margin over the best single model (LGBM, MCC 0.855) reflects partial independence of prediction errors across the five architectures.

Threshold optimization

The default decision threshold of 0.50 is not used. After training or tuning, each model's probability output is swept over 81 equally spaced values from 0.10 to 0.90 (step 0.01) on the test set, and the threshold that maximizes MCC is selected. MCC is sensitive to all four entries of the confusion matrix and provides a balanced criterion for binary classification on class-balanced datasets:

$$\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

For any deployment context, the threshold should be selected based on the acceptable false positive-false negative trade-off. The --threshold flag in the CLI allows overriding the MCC-optimized default.

External benchmark

The directory benchmarking/external_benchmark/ contains the results of a systematic comparison between AMPidentifier and 11 published AMP prediction tools: AMPScanner v2, AMPlify, amPEPpy, ampir, CAMPR3 (SVM, RF, ANN, DA), DBAASP, and ClassAMP (SVM, RF). All tools were evaluated on the same independent dataset (Bhangu et al., 2025; n=4,736: 2,368 AMP, 2,368 non-AMP). Predictions and per-tool metrics are stored in benchmarking/external_benchmark/tools/. The corresponding AMPidentifier results are in benchmarking/external_benchmark/ampidentifier_results/. The benchmark sequences are in benchmarking/external_benchmark/sequences/benchmark.fasta.

Nine additional tools were identified as candidates but could not be evaluated due to server unavailability, DNS failures, or absence of a public code release. The full list with DOIs and exclusion reasons is in benchmarking/external_benchmark/excluded_tools.md. This reflects a persistent problem in the field: published tools frequently become inaccessible after the associated paper is published, with no maintenance, deprecation notice, or code release, which limits reproducibility and independent comparison.

The figure below summarizes the comparative performance of all evaluated tools on the benchmark dataset.

AMPidentifier achieves competitive performance relative to the tools that remain accessible and actively maintained. Some tools attain higher individual metrics, showing that the field has capable methods. The comparison also shows that AMPidentifier produces results within the range of established tools without requiring a remote server or a proprietary runtime environment.

A detailed comparative analysis, including discriminative performance metrics, classification profiles, and computational scalability, will be reported in the application note submitted to the Journal of Chemical Information and Modeling (JCIM).

Repository structure

AMPidentifier/
├── amp_identifier/                  # prediction package
│   ├── __init__.py
│   ├── core.py                      # main pipeline orchestrator
│   ├── data_io.py                   # FASTA parsing
│   ├── feature_extraction.py        # 22-feature computation
│   └── reporting.py
├── benchmarking/                    # benchmark data and results
│   ├── benchmark.fasta
│   ├── benchmark_results.csv
│   ├── fig_bench_confusion.png
│   ├── fig_bench_metrics.png
│   ├── fig_bench_roc.png
│   └── external_benchmark/          # comparative analysis against published tools
│       ├── excluded_tools.md        # tools excluded and reasons
│       ├── sequences/               # benchmark.fasta (Bhangu et al., 2025; n=4,736)
│       ├── ampidentifier_results/   # AMPidentifier predictions and metrics
│       └── tools/                   # per-tool predictions and metrics
│           ├── ampscanner_v2/
│           ├── amplify/
│           ├── ampeppy/
│           ├── ampir/
│           ├── campr3-svm/
│           ├── campr3-rf/
│           ├── campr3-ann/
│           ├── campr3-da/
│           ├── dbaasp/
│           ├── classamp-svm/
│           └── classamp-rf/
├── imgs/                            # logos and workflow diagram
├── model_training/
│   ├── data/                        # training sequences and feature list
│   │   ├── positive_sequences.fasta
│   │   ├── negative_sequences.fasta
│   │   └── selected_features.txt
│   ├── eda/                         # EDA figures (fig01–fig05)
│   ├── tuned_model/                 # trained models, scalers, thresholds, figures
│   │   ├── amp_model_{rf,svm,gb,xgb,lgbm,voting}_tuned.pkl
│   │   ├── scaler_{robust,std}.pkl
│   │   ├── threshold_{rf,svm,gb,xgb,lgbm,voting}.txt
│   │   ├── cv_results_{rf,svm,gb,xgb,lgbm}.csv
│   │   ├── result_{rf,svm,gb,xgb,lgbm,voting}.csv
│   │   ├── tuning_report.{csv,txt}
│   │   ├── figures/                 # post-tuning diagnostic figures (fig01–fig15)
│   │   └── outputs/                 # cached .npz arrays for plot_tuning
│   ├── benchmark.py                 # independent benchmark evaluation
│   ├── collect_outputs.py           # generates .npz caches for plot_tuning
│   ├── eda.py                       # exploratory data analysis figures
│   ├── plot_tuning.py               # post-tuning diagnostic figures
│   ├── train.py                     # baseline training
│   ├── tune.py                      # hyperparameter tuning + voting ensemble
│   └── voting.py                    # VotingEnsemble class
├── main.py                          # CLI entry point
├── requirements.txt
└── README.md

Intellectual property

This tool is officially registered with the INPI - Instituto Nacional da Propriedade Industrial (Brazilian National Institute of Industrial Property).

Registration number: BR 51 2025 005859-4
Registration date: November 18, 2025
Title: AMPidentifier: A modular python toolkit for predicting antimicrobial peptides using ensemble machine learning
Registered authors: Madson A. de Luna Aragão, Rafael L. da Silva, João Pacífico, Carlos A. dos Santos-Silva, Ana M. Benko-Iseppon

How to cite

If this tool or its outputs support your research, please cite the repository:

de Luna-Aragão, M. A., da Silva, R. L., Pacifico Bezerra Neto, J., dos Santos-Silva, C. A., da Silva Santos, D. E., & Benko-Iseppon, A. M. (2026). AMPidentifier: A Cross-Platform Ensemble Toolkit for Antimicrobial Peptide Prediction. https://github.com/madsondeluna/AMPidentifier

A machine-readable citation is available in CITATION.cff.

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
amp_identifier		amp_identifier
benchmarking		benchmarking
diagram		diagram
imgs		imgs
model_training		model_training
paper		paper
webapp		webapp
.gitignore		.gitignore
AMPidentifier_Colab_Guide.ipynb		AMPidentifier_Colab_Guide.ipynb
CITATION.cff		CITATION.cff
COLAB_GUIDE.md		COLAB_GUIDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Procfile		Procfile
README.md		README.md
build.sh		build.sh
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runtime.txt		runtime.txt
wsgi.py		wsgi.py

Folders and files

Latest commit

History

Repository files navigation

AMPidentifier: A Cross-Platform Ensemble Toolkit for Antimicrobial Peptide Prediction

Contents

Background

Installation

Installation via pip (alternative)

Usage

Web application

Usage map and privacy

Command-line interface

Command-line flags

Model options (-m)

Examples

Output files

Training and evaluation scripts

Dataset

Sources

Pre-processing

Final composition

Feature engineering

Global descriptors

Grouped amino acid composition

FET positional features

SA positional features

Exploratory data analysis

Model development

Baseline training

Hyperparameter tuning

Soft-voting ensemble

Threshold optimization

External benchmark

Repository structure

Intellectual property

How to cite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Model options (`-m`)

Packages