Skip to content

[Security] SSRF + path traversal chain in bio-research ncbi_utils.py and sra_geo_fetch.py #166

@Aravindargutus

Description

@Aravindargutus

Description

The bio-research plugin's Python scripts have two defense-in-depth concerns in how they fetch and download FASTQ data from external APIs.

Severity: Low-Medium (not immediately exploitable, but worth hardening)

Issue 1: HTTP Protocol Downgrade on FASTQ Downloads (Medium)

File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (line 343)

The ENA API is queried over HTTPS (line 314), but the actual FASTQ file downloads are forced to unencrypted HTTP:

# Line 343 — FTP paths from ENA converted to HTTP (not HTTPS)
urls = [f"http://{url}" for url in ftp_urls.split(';') if url]

A real ENA response returns values like ftp.sra.ebi.ac.uk/vol1/fastq/SRR635/000/SRR6357070/SRR6357070_1.fastq.gz, which becomes http://ftp.sra.ebi.ac.uk/....

Impact: FASTQ downloads (often multi-GB) happen over unencrypted HTTP. A network-level attacker could modify file contents in transit. While genomic data isn't secret, integrity matters for research reproducibility.

Fix: Change http:// to https:// on line 343. ENA supports HTTPS downloads.

Issue 2: No Domain Validation on Download URLs (Low)

File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (lines 338-344)

The fastq_ftp field from the ENA API response is used to construct download URLs without validating that they point to known ENA/NCBI domains:

# Lines 338-344
ftp_urls = fields[ftp_idx]
if ftp_urls:
    urls = [f"http://{url}" for url in ftp_urls.split(';') if url]
    fastq_urls[srr] = urls

These URLs are then passed to download_file() which streams the response body to disk via requests.get(url, stream=True).

Impact: If the ENA API were ever compromised or its response tampered with, the code would fetch from arbitrary URLs and write content to disk. This is a defense-in-depth concern — the ENA query itself is over HTTPS (line 314), so MITM is not trivial.

Fix: Validate that download URLs match expected ENA domains (e.g., *.ebi.ac.uk, ftp.sra.ebi.ac.uk) before fetching.

Issue 3: Missing URL Encoding on API Parameters (Informational)

File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (lines 99, 156, 212, 314)

User-supplied geo_id is interpolated into API URLs without urllib.parse.quote():

search_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term={geo_id}[Accession]&retmode=json"

Since this is a CLI tool where the user provides their own arguments, this is not exploitable in practice — but URL encoding is good hygiene.

What's NOT a vulnerability (correcting our original report)

  • Output path (--output): Our original report claimed this was "arbitrary file write." It's not — this is a CLI tool where the user supplies their own arguments. Normal CLI behavior, not a security issue.
  • Compound attack scenario: Our original report chained HTTPS MITM + CLI argument control. This was unrealistic — each link requires conditions that make the chain implausible.

Suggested Fixes

  1. Line 343: Change f"http://{url}" to f"https://{url}" (simplest, highest impact)
  2. Lines 338-344: Add domain allowlist check before downloading
  3. Lines 99, 156, 212, 314: Use urllib.parse.quote() for geo_id/accession in URLs

Secure Patterns Already in Use (Credit)

  • ✅ ENA API query is over HTTPS (line 314)
  • yaml.safe_load() used correctly
  • subprocess.run() uses list format, not shell=True
  • ✅ No hardcoded secrets
  • ✅ NCBI rate limiting properly enforced

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions