This repository contains a WebShell detection framework based on CodeBERT with a bi-directional LSTM + Attention layer.
It provides end-to-end training, evaluation, and prediction using WebShell datasets.
- Data Preprocessing: Cleans base64-encoded data and extracts suspicious features (eval, system, SQLi, XSS, etc.).
- Model Architecture:
- Pretrained CodeBERT embeddings (frozen).
- Bi-directional 2-layer LSTM with dropout.
- Attention mechanism for feature weighting.
- Fully connected classifier with ReLU activation.
- Training Pipeline: Supports stratified train/val/test split, class balancing with
WeightedRandomSampler, gradient clipping, and checkpoint saving. - Evaluation Metrics: Accuracy, F1, Recall, Precision, Confusion Matrix.
- Deployment-ready: Provides a
WebShellDetectorclass for single-sample prediction.
codebert-training/
├── train_codebert.py # Main training script (this file)
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
└── README.md
git clone https://github.com/<your-username>/codebert-training.git
cd codebert-training
pip install -r requirements.txtDefault paths in the script:
df = preprocessor.load_and_clean_data([
'webshell.csv',
'webshell_data.csv'
])python train_codebert.py- Trains model for
epochs=3(adjustable inConfig). - Saves the best model to
best_codebert_model.pt.
python train_codebert.pyAt the end of training, the script evaluates on the test set:
Test Metrics - Acc: 0.9821 | F1: 0.9756 | Recall: 0.9712 | Precision: 0.9801The script includes a demo for sample predictions:
test_samples = [
"PD9waHAgZXZhbCgkX1BPU1RbJ2NtZCddKTs=", # PHP shell
"U0hFTEwgY21kLmV4ZQ==", # Suspicious
"aW5kZXguaHRtbA==" # Normal
]
detector = WebShellDetector(config)
detector.load_model("best_codebert_model.pt")
for sample in test_samples:
print(detector.predict(sample))Output:
Sample 1: Malicious (Confidence: 98.7%)
Sample 2: Malicious (Confidence: 94.1%)
Sample 3: Normal (Confidence: 99.2%)- torch
- transformers
- scikit-learn
- pandas
- tqdm
GPU recommended, with CUDA device set via:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'Install all dependencies:
pip install -r requirements.txt- Model weights (
*.pt) are not tracked in GitHub. - Store them in Git LFS or external storage (Google Drive, Baidu Netdisk).
- Update this README with download links if sharing.
This project is licensed under the MIT License - see the LICENSE file for details.