Skip to content

Improve spam scoring accuracy with real-world email corpus testing #5

@Ouranos27

Description

@Ouranos27

Context

The spam analyzer models 45+ signals based on SpamAssassin rules, CAN-SPAM, and GDPR patterns. But it hasn't been validated against a large corpus of real emails to calibrate scoring weights.

What needs to happen

  1. Build a test corpus — collect ~100 emails across categories:

    • Legitimate transactional (receipts, shipping, password resets)
    • Legitimate marketing (newsletters, promotions)
    • Known spam/phishing examples (available from public datasets)
  2. Run the spam analyzer against each and compare scores to expected classification

  3. Tune weights — adjust signal weights so that:

    • Legitimate emails score 80+
    • Spam emails score below 40
    • Edge cases (aggressive marketing) land in the 40-70 range

How to contribute

This is a great contribution for someone interested in email deliverability. You don't need to write much code — mostly curating test data and running the existing analyzer.

bun install
bun test -- --grep "spam"   # Run existing spam tests

The spam analyzer lives in `src/analyzers/spam/`.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions