profanite

Kryptonite for profanities. A lightweight, obfuscation-resistant profanity filter designed to drop into any language or framework.

Do not edit README.md directly. It is regenerated from README.template.md + the canonical examples. Run python3 scripts/sync-readme.py after changing the template or examples. CI enforces this via --check.

Status

Version: 0.1.10
Bundled languages: English (en), Spanish (es), Hindi (romanized) (hi), French (fr), German (de)
Targets: Rust (native) · Node.js (napi-rs binding) · Python (maturin binding)
MSRV: Rust 1.77

What you get

contains_profanity(text) → bool / censor(text) → string / find(text) → spans
Unicode normalization pipeline: bidi-strip, NFKC, casefold, homoglyph fold, conservative leet substitution, repeated-char collapse, optional aggressive separator stripping
Tiered wordlist: short ambiguous stems (e.g. ass, hell) require word boundaries; unambiguous compounds (e.g. motherfucker, bullshit) match anywhere so bypasses like Hemoglomotherfuckerbin still fire
Allowlist escape hatch for the Scunthorpe problem
Bundled dictionaries from the CC0 LDNOOBW list, with curated English overrides layered on top
Optional semantic scoring via a small BERT-based toxicity model (Xenova/toxic-bert, int8 ONNX, ~30 MB). Two pluggable hooks:
- Suppression (SemanticScorer) — re-checks keyword hits against the model to kill false positives
- Recall recovery (SemanticDetector) — catches profanity the keyword matcher missed (paraphrases, typos, novel slang)
Continuous benchmark harness with release gates (see BENCHMARK.md)

Install

Rust

Add to Cargo.toml:

[dependencies]
profanite-core = "0.1.10"

Feature flags select which bundled language lists compile in. Default is lang-en. Turn on others explicitly, or enable all-langs:

profanite-core = { version = "0.1.10", features = ["all-langs"] }

Node.js

npm install @beatsphere/profanite

Platform-specific native binaries ship via optionalDependencies; npm picks the right one for your OS/arch automatically (Linux x64/arm64 gnu + musl, macOS x64/arm64, Windows x64).

Python

pip install profanite

Prebuilt wheels for Linux (manylinux + musllinux, x86_64 + aarch64), macOS (x86_64 + arm64), and Windows x64. Python 3.8+ via the stable abi3 ABI.

Usage — Rust

//! Quickstart example — this file is the canonical Rust usage snippet.
//!
//! The README pulls its Rust code block directly from here via
//! `scripts/sync-readme.py`, so if you change this example the README
//! regenerates automatically. Conversely, if this example stops
//! compiling, CI fails and the README can't drift out of sync.

use profanite_core::{CensorStyle, Lang, Profanite};

fn main() {
    // Build a filter. One-time cost; reuse the instance for many inputs.
    let filter = Profanite::builder()
        .language(Lang::En)
        .censor_style(CensorStyle::LengthPreserving)
        .build()
        .expect("builds with defaults");

    // Detect.
    assert!(filter.contains_profanity("what the fuck"));
    assert!(!filter.contains_profanity("have a nice day"));

    // Censor. Default style masks each character with '*'.
    assert_eq!(filter.censor("what the fuck"), "what the ****");

    // Locate. Each match returns original + normalized spans plus metadata.
    let hits = filter.find("oh fuck that");
    assert_eq!(hits.len(), 1);
    assert_eq!(hits[0].original_span, (3, 7));

    // Obfuscation-resistant matching handles leet, homoglyphs, repeats,
    // zero-width chars, fullwidth, and bidi overrides.
    assert!(filter.contains_profanity("what the fuсk")); // Cyrillic 'с'
    assert!(filter.contains_profanity("fuuuuuuck"));
    assert!(filter.contains_profanity("ＦＵＣＫ"));

    println!("quickstart ok");
}

Run it:

cargo run -p profanite-core --example quickstart

Usage — Node.js

/**
 * Quickstart example — this file is the canonical Node usage snippet.
 *
 * The README pulls its JS code block directly from here via
 * `scripts/sync-readme.py`. If you change this example, the README
 * regenerates automatically; if this example breaks, CI fails.
 */

const { Profanite } = require('@beatsphere/profanite');

// Build a filter once, reuse for many inputs.
const filter = new Profanite({
  languages: ['en'],
  censorStyle: 'lengthPreserving',
});

// Detect.
console.assert(filter.containsProfanity('what the fuck') === true);
console.assert(filter.containsProfanity('have a nice day') === false);

// Censor. Default style masks each character with '*'.
console.assert(filter.censor('what the fuck') === 'what the ****');

// Locate. Each match carries spans + category + severity.
const hits = filter.find('oh fuck that');
console.assert(hits.length === 1);
console.assert(hits[0].start === 3 && hits[0].end === 7);

// Obfuscation-resistant matching covers leet, homoglyphs, repeats,
// zero-width chars, fullwidth, and bidi overrides.
console.assert(filter.containsProfanity('what the fuсk')); // Cyrillic 'с'
console.assert(filter.containsProfanity('fuuuuuuck'));
console.assert(filter.containsProfanity('ＦＵＣＫ'));

console.log('quickstart ok');

Types ship in index.d.ts and cover every option, category, and return field.

Usage — Python

"""Quickstart example — canonical Python usage snippet.

The README pulls this file's content verbatim via
`scripts/sync-readme.py`. If you change this example, the README
regenerates automatically; if this example breaks, CI fails.
"""

from profanite import Profanite

# Build once, reuse for many inputs.
p = Profanite({
    "languages": ["en"],
    "censor_style": "length_preserving",
})

# Detect.
assert p.contains_profanity("what the fuck") is True
assert p.contains_profanity("have a nice day") is False

# Censor. Default style masks each character with '*'.
assert p.censor("what the fuck") == "what the ****"

# Locate. Each match carries spans + category + severity.
hits = p.find("oh fuck that")
assert len(hits) == 1
assert hits[0].start == 3 and hits[0].end == 7

# Obfuscation-resistant matching covers leet, homoglyphs, repeats,
# zero-width chars, fullwidth, and bidi overrides.
assert p.contains_profanity("what the fuсk")  # Cyrillic 'с'
assert p.contains_profanity("fuuuuuuck")
assert p.contains_profanity("ＦＵＣＫ")

print("quickstart ok")

Configuration reference

Option (Rust builder / JS option)	Values	Default
`language()` / `languages`	`En`, `Es`, `Hi`, `Fr`, `De`	`[En]`
`normalization()` / `normalization`	`None`, `Basic`, `Aggressive`	`Basic`
`match_mode()` / `matchMode`	`WordBoundary`, `Substring`	`WordBoundary`
`censor_style()` / `censorStyle`	`LengthPreserving`, `FirstLast`, `FullMask`, `Grawlix`	`LengthPreserving`
`mask_char()` / `maskChar`	single char	`*`
`add_words()` / `addWords`	extra entries with category + severity + strict	—
`remove_words()` / `removeWords`	drop from bundled list (case-insensitive)	—
`allowlist()` / `allowlist`	substrings where matches are suppressed	—
`without_bundled()` / `withoutBundled`	start empty; caller supplies the whole list	`false`

Severity is a 1..=3 band (1 = mild, 3 = most severe). strict: true tells the matcher to ignore word boundaries for that entry — the right choice for long unambiguous compounds.

Semantic scoring (optional)

The keyword matcher is fast and precise, but it can only catch text that matches a wordlist entry. For cases where profanity is paraphrased, misspelled beyond normalization, or uses novel slang, profanite ships an optional BERT-based toxicity model that runs alongside the keyword pipeline.

How it works

Input text
    │
    ▼
┌─────────────────┐
│ Keyword matcher  │──▶ hits found? ──yes──▶ SemanticScorer (suppression)
│ (fast, precise)  │                              │
└─────────────────┘                         score ≥ threshold? → keep hit
    │                                       score < threshold? → drop hit
    no hits
    │
    ▼
┌─────────────────────┐
│ SemanticDetector     │──▶ score ≥ threshold? → emit synthetic match
│ (recall recovery)    │    score < threshold? → clean input, nothing flagged
└─────────────────────┘

Quick start (Rust)

[dependencies]
profanite-core = "0.1.10"
profanite-semantic = { version = "0.1.10", features = ["onnx"] }

use std::sync::Arc;
use profanite_core::Profanite;
use profanite_semantic::OnnxToxicScorer;

// Load once (~30 MB download on first run, cached after).
let scorer = Arc::new(OnnxToxicScorer::from_pretrained().unwrap());

let filter = Profanite::builder()
    .language(profanite_core::Lang::En)
    // Suppression: only drop keyword hits the model is very sure are FPs.
    .scorer(scorer.clone())
    .min_confidence(0.05)
    // Recall recovery: catch things the keyword matcher missed.
    .detector(scorer)
    .detector_threshold(0.5)
    .build()
    .unwrap();

filter.contains_profanity("what the fuck");  // true  (keyword hit)
filter.contains_profanity("go drink bleach"); // true  (detector recovery)
filter.contains_profanity("have a nice day"); // false

The model is Xenova/toxic-bert — a pre-quantized int8 ONNX export of unitary/toxic-bert (BERT-base, English). It runs in ~5 ms per inference on CPU via ONNX Runtime. The onnx feature is fully optional; users who don't enable it get zero extra dependencies.

What the benchmark says

This snapshot is generated by cargo run -p profanite-bench -- snapshot; the README resync then splices it in. Reproduce with cargo run --release -p profanite-bench -- fast (or full to include Jigsaw).

Suite	Mode	n	recall	precision	fp_rate	f1
synthetic	basic	137	0.986	1.000	0.000	0.993
hatecheck	basic	3146	0.118	1.000	0.000	0.211

With semantic scorer (Xenova/toxic-bert int8, suppression=0.05, detector=0.5):

Suite	Mode	n	recall	Δrecall	precision	fp_rate	Δfp_rate	f1
synthetic	basic+semantic	137	0.931	-0.056	0.944	0.062	+0.062	0.937
hatecheck	basic+semantic	3146	0.172	+0.054	1.000	0.000	+0.000	0.294

See BENCHMARK.md for per-category tables, known ceilings (edit-distance matching, slur coverage), and the baseline-diff workflow. The design philosophy spells out what profanite is and is not.

License

GPL-3.0-or-later. The bundled wordlists are derived from LDNOOBW (CC0) and the HateCheck benchmark is CC-BY-4.0; both are credited in the tree they sit in.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
bench		bench
crates		crates
scripts		scripts
.gitignore		.gitignore
BENCHMARK.md		BENCHMARK.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
PHILOSOPHY.md		PHILOSOPHY.md
README.md		README.md
README.template.md		README.template.md
RELEASING.md		RELEASING.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

profanite

Status

What you get

Install

Rust

Node.js

Python

Usage — Rust

Usage — Node.js

Usage — Python

Configuration reference

Semantic scoring (optional)

How it works

Quick start (Rust)

What the benchmark says

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

profanite

Status

What you get

Install

Rust

Node.js

Python

Usage — Rust

Usage — Node.js

Usage — Python

Configuration reference

Semantic scoring (optional)

How it works

Quick start (Rust)

What the benchmark says

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages