-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[Security] SSRF + path traversal chain in bio-research ncbi_utils.py and sra_geo_fetch.py #166
Description
Description
The bio-research plugin's Python scripts have two defense-in-depth concerns in how they fetch and download FASTQ data from external APIs.
Severity: Low-Medium (not immediately exploitable, but worth hardening)
Issue 1: HTTP Protocol Downgrade on FASTQ Downloads (Medium)
File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (line 343)
The ENA API is queried over HTTPS (line 314), but the actual FASTQ file downloads are forced to unencrypted HTTP:
# Line 343 — FTP paths from ENA converted to HTTP (not HTTPS)
urls = [f"http://{url}" for url in ftp_urls.split(';') if url]A real ENA response returns values like ftp.sra.ebi.ac.uk/vol1/fastq/SRR635/000/SRR6357070/SRR6357070_1.fastq.gz, which becomes http://ftp.sra.ebi.ac.uk/....
Impact: FASTQ downloads (often multi-GB) happen over unencrypted HTTP. A network-level attacker could modify file contents in transit. While genomic data isn't secret, integrity matters for research reproducibility.
Fix: Change http:// to https:// on line 343. ENA supports HTTPS downloads.
Issue 2: No Domain Validation on Download URLs (Low)
File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (lines 338-344)
The fastq_ftp field from the ENA API response is used to construct download URLs without validating that they point to known ENA/NCBI domains:
# Lines 338-344
ftp_urls = fields[ftp_idx]
if ftp_urls:
urls = [f"http://{url}" for url in ftp_urls.split(';') if url]
fastq_urls[srr] = urlsThese URLs are then passed to download_file() which streams the response body to disk via requests.get(url, stream=True).
Impact: If the ENA API were ever compromised or its response tampered with, the code would fetch from arbitrary URLs and write content to disk. This is a defense-in-depth concern — the ENA query itself is over HTTPS (line 314), so MITM is not trivial.
Fix: Validate that download URLs match expected ENA domains (e.g., *.ebi.ac.uk, ftp.sra.ebi.ac.uk) before fetching.
Issue 3: Missing URL Encoding on API Parameters (Informational)
File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (lines 99, 156, 212, 314)
User-supplied geo_id is interpolated into API URLs without urllib.parse.quote():
search_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term={geo_id}[Accession]&retmode=json"Since this is a CLI tool where the user provides their own arguments, this is not exploitable in practice — but URL encoding is good hygiene.
What's NOT a vulnerability (correcting our original report)
- Output path (
--output): Our original report claimed this was "arbitrary file write." It's not — this is a CLI tool where the user supplies their own arguments. Normal CLI behavior, not a security issue. - Compound attack scenario: Our original report chained HTTPS MITM + CLI argument control. This was unrealistic — each link requires conditions that make the chain implausible.
Suggested Fixes
- Line 343: Change
f"http://{url}"tof"https://{url}"(simplest, highest impact) - Lines 338-344: Add domain allowlist check before downloading
- Lines 99, 156, 212, 314: Use
urllib.parse.quote()for geo_id/accession in URLs
Secure Patterns Already in Use (Credit)
- ✅ ENA API query is over HTTPS (line 314)
- ✅
yaml.safe_load()used correctly - ✅
subprocess.run()uses list format, notshell=True - ✅ No hardcoded secrets
- ✅ NCBI rate limiting properly enforced
🤖 Generated with Claude Code