Kryptonite for profanities. A lightweight, obfuscation-resistant profanity filter designed to drop into any language or framework.
Do not edit
README.mddirectly. It is regenerated fromREADME.template.md+ the canonical examples. Runpython3 scripts/sync-readme.pyafter changing the template or examples. CI enforces this via--check.
- Version:
0.1.10 - Bundled languages: English (
en), Spanish (es), Hindi (romanized) (hi), French (fr), German (de) - Targets: Rust (native) · Node.js (napi-rs binding) · Python (maturin binding)
- MSRV: Rust
1.77
contains_profanity(text) → bool/censor(text) → string/find(text) → spans- Unicode normalization pipeline: bidi-strip, NFKC, casefold, homoglyph fold, conservative leet substitution, repeated-char collapse, optional aggressive separator stripping
- Tiered wordlist: short ambiguous stems (e.g.
ass,hell) require word boundaries; unambiguous compounds (e.g.motherfucker,bullshit) match anywhere so bypasses likeHemoglomotherfuckerbinstill fire - Allowlist escape hatch for the Scunthorpe problem
- Bundled dictionaries from the CC0 LDNOOBW list, with curated English overrides layered on top
- Optional semantic scoring via a small BERT-based toxicity model (Xenova/toxic-bert, int8 ONNX, ~30 MB). Two pluggable hooks:
- Suppression (
SemanticScorer) — re-checks keyword hits against the model to kill false positives - Recall recovery (
SemanticDetector) — catches profanity the keyword matcher missed (paraphrases, typos, novel slang)
- Suppression (
- Continuous benchmark harness with release gates (see
BENCHMARK.md)
Add to Cargo.toml:
[dependencies]
profanite-core = "0.1.10"Feature flags select which bundled language lists compile in. Default is lang-en. Turn on others explicitly, or enable all-langs:
profanite-core = { version = "0.1.10", features = ["all-langs"] }npm install @beatsphere/profanitePlatform-specific native binaries ship via optionalDependencies; npm picks the right one for your OS/arch automatically (Linux x64/arm64 gnu + musl, macOS x64/arm64, Windows x64).
pip install profanitePrebuilt wheels for Linux (manylinux + musllinux, x86_64 + aarch64), macOS (x86_64 + arm64), and Windows x64. Python 3.8+ via the stable abi3 ABI.
//! Quickstart example — this file is the canonical Rust usage snippet.
//!
//! The README pulls its Rust code block directly from here via
//! `scripts/sync-readme.py`, so if you change this example the README
//! regenerates automatically. Conversely, if this example stops
//! compiling, CI fails and the README can't drift out of sync.
use profanite_core::{CensorStyle, Lang, Profanite};
fn main() {
// Build a filter. One-time cost; reuse the instance for many inputs.
let filter = Profanite::builder()
.language(Lang::En)
.censor_style(CensorStyle::LengthPreserving)
.build()
.expect("builds with defaults");
// Detect.
assert!(filter.contains_profanity("what the fuck"));
assert!(!filter.contains_profanity("have a nice day"));
// Censor. Default style masks each character with '*'.
assert_eq!(filter.censor("what the fuck"), "what the ****");
// Locate. Each match returns original + normalized spans plus metadata.
let hits = filter.find("oh fuck that");
assert_eq!(hits.len(), 1);
assert_eq!(hits[0].original_span, (3, 7));
// Obfuscation-resistant matching handles leet, homoglyphs, repeats,
// zero-width chars, fullwidth, and bidi overrides.
assert!(filter.contains_profanity("what the fuсk")); // Cyrillic 'с'
assert!(filter.contains_profanity("fuuuuuuck"));
assert!(filter.contains_profanity("FUCK"));
println!("quickstart ok");
}Run it:
cargo run -p profanite-core --example quickstart/**
* Quickstart example — this file is the canonical Node usage snippet.
*
* The README pulls its JS code block directly from here via
* `scripts/sync-readme.py`. If you change this example, the README
* regenerates automatically; if this example breaks, CI fails.
*/
const { Profanite } = require('@beatsphere/profanite');
// Build a filter once, reuse for many inputs.
const filter = new Profanite({
languages: ['en'],
censorStyle: 'lengthPreserving',
});
// Detect.
console.assert(filter.containsProfanity('what the fuck') === true);
console.assert(filter.containsProfanity('have a nice day') === false);
// Censor. Default style masks each character with '*'.
console.assert(filter.censor('what the fuck') === 'what the ****');
// Locate. Each match carries spans + category + severity.
const hits = filter.find('oh fuck that');
console.assert(hits.length === 1);
console.assert(hits[0].start === 3 && hits[0].end === 7);
// Obfuscation-resistant matching covers leet, homoglyphs, repeats,
// zero-width chars, fullwidth, and bidi overrides.
console.assert(filter.containsProfanity('what the fuсk')); // Cyrillic 'с'
console.assert(filter.containsProfanity('fuuuuuuck'));
console.assert(filter.containsProfanity('FUCK'));
console.log('quickstart ok');Types ship in index.d.ts and cover every option, category, and return field.
"""Quickstart example — canonical Python usage snippet.
The README pulls this file's content verbatim via
`scripts/sync-readme.py`. If you change this example, the README
regenerates automatically; if this example breaks, CI fails.
"""
from profanite import Profanite
# Build once, reuse for many inputs.
p = Profanite({
"languages": ["en"],
"censor_style": "length_preserving",
})
# Detect.
assert p.contains_profanity("what the fuck") is True
assert p.contains_profanity("have a nice day") is False
# Censor. Default style masks each character with '*'.
assert p.censor("what the fuck") == "what the ****"
# Locate. Each match carries spans + category + severity.
hits = p.find("oh fuck that")
assert len(hits) == 1
assert hits[0].start == 3 and hits[0].end == 7
# Obfuscation-resistant matching covers leet, homoglyphs, repeats,
# zero-width chars, fullwidth, and bidi overrides.
assert p.contains_profanity("what the fuсk") # Cyrillic 'с'
assert p.contains_profanity("fuuuuuuck")
assert p.contains_profanity("FUCK")
print("quickstart ok")| Option (Rust builder / JS option) | Values | Default |
|---|---|---|
language() / languages |
En, Es, Hi, Fr, De |
[En] |
normalization() / normalization |
None, Basic, Aggressive |
Basic |
match_mode() / matchMode |
WordBoundary, Substring |
WordBoundary |
censor_style() / censorStyle |
LengthPreserving, FirstLast, FullMask, Grawlix |
LengthPreserving |
mask_char() / maskChar |
single char | * |
add_words() / addWords |
extra entries with category + severity + strict | — |
remove_words() / removeWords |
drop from bundled list (case-insensitive) | — |
allowlist() / allowlist |
substrings where matches are suppressed | — |
without_bundled() / withoutBundled |
start empty; caller supplies the whole list | false |
Severity is a 1..=3 band (1 = mild, 3 = most severe). strict: true tells the matcher to ignore word boundaries for that entry — the right choice for long unambiguous compounds.
The keyword matcher is fast and precise, but it can only catch text that matches a wordlist entry. For cases where profanity is paraphrased, misspelled beyond normalization, or uses novel slang, profanite ships an optional BERT-based toxicity model that runs alongside the keyword pipeline.
Input text
│
▼
┌─────────────────┐
│ Keyword matcher │──▶ hits found? ──yes──▶ SemanticScorer (suppression)
│ (fast, precise) │ │
└─────────────────┘ score ≥ threshold? → keep hit
│ score < threshold? → drop hit
no hits
│
▼
┌─────────────────────┐
│ SemanticDetector │──▶ score ≥ threshold? → emit synthetic match
│ (recall recovery) │ score < threshold? → clean input, nothing flagged
└─────────────────────┘
[dependencies]
profanite-core = "0.1.10"
profanite-semantic = { version = "0.1.10", features = ["onnx"] }use std::sync::Arc;
use profanite_core::Profanite;
use profanite_semantic::OnnxToxicScorer;
// Load once (~30 MB download on first run, cached after).
let scorer = Arc::new(OnnxToxicScorer::from_pretrained().unwrap());
let filter = Profanite::builder()
.language(profanite_core::Lang::En)
// Suppression: only drop keyword hits the model is very sure are FPs.
.scorer(scorer.clone())
.min_confidence(0.05)
// Recall recovery: catch things the keyword matcher missed.
.detector(scorer)
.detector_threshold(0.5)
.build()
.unwrap();
filter.contains_profanity("what the fuck"); // true (keyword hit)
filter.contains_profanity("go drink bleach"); // true (detector recovery)
filter.contains_profanity("have a nice day"); // falseThe model is Xenova/toxic-bert — a pre-quantized int8 ONNX export of unitary/toxic-bert (BERT-base, English). It runs in ~5 ms per inference on CPU via ONNX Runtime. The onnx feature is fully optional; users who don't enable it get zero extra dependencies.
This snapshot is generated by cargo run -p profanite-bench -- snapshot; the README resync then splices it in. Reproduce with cargo run --release -p profanite-bench -- fast (or full to include Jigsaw).
| Suite | Mode | n | recall | precision | fp_rate | f1 |
|---|---|---|---|---|---|---|
| synthetic | basic | 137 | 0.986 | 1.000 | 0.000 | 0.993 |
| hatecheck | basic | 3146 | 0.118 | 1.000 | 0.000 | 0.211 |
With semantic scorer (Xenova/toxic-bert int8, suppression=0.05, detector=0.5):
| Suite | Mode | n | recall | Δrecall | precision | fp_rate | Δfp_rate | f1 |
|---|---|---|---|---|---|---|---|---|
| synthetic | basic+semantic | 137 | 0.931 | -0.056 | 0.944 | 0.062 | +0.062 | 0.937 |
| hatecheck | basic+semantic | 3146 | 0.172 | +0.054 | 1.000 | 0.000 | +0.000 | 0.294 |
See BENCHMARK.md for per-category tables, known ceilings (edit-distance matching, slur coverage), and the baseline-diff workflow. The design philosophy spells out what profanite is and is not.
GPL-3.0-or-later. The bundled wordlists are derived from LDNOOBW (CC0) and the HateCheck benchmark is CC-BY-4.0; both are credited in the tree they sit in.