Add com.writerslogic.text-fingerprint.1 (fingerprint, text)#55
Conversation
5cd7750 to
0bfa35e
Compare
Signed-off-by: David Condrey <david@writerslogic.com>
0bfa35e to
f848e74
Compare
mrappard
left a comment
There was a problem hiding this comment.
I would approve this if (https://writersproof.com/cpoe/text-fingerprint) had more information.
domguinard
left a comment
There was a problem hiding this comment.
Thanks a lot for the entry, we reviewed this during the WM TF meeting and would ask if you could add some information on the reference page as it is currently blank. No need for a great deal of details, a marketing page would do for instance.
|
Thanks for the review. The reference page is now published at https://docs.writerslogic.com/soft-binding/text-fingerprint with the algorithm details — normalization, the 256-bit SimHash over character 4-grams, the windowed scoped blocks, the match threshold, and limitations — and the entry's informationalUrl points there. Happy to expand it further if useful. |
domguinard
left a comment
There was a problem hiding this comment.
Thanks a lot for the documentation page, please approve the ID change as ID 40 is already in the queue.
With that approval we should be good to go.
domguinard
left a comment
There was a problem hiding this comment.
I was able to accept the change myself.
|
Merged, thanks a lot for your entry @dcondrey. |
Add
com.writerslogic.text-fingerprint.1(fingerprint, text)This PR adds a new entry (identifier 40) for a fingerprint-type soft
binding for text content.
Algorithm
com.writerslogic.text-fingerprint.1computes a durable 256-bit fingerprintfrom the words of a document (rather than embedding hidden markers into it):
formatting characters (U+200B, U+200C, U+200D, U+FEFF, U+2060, variation
selectors U+FE00–U+FE0F); lowercase; collapse whitespace runs to a single
space; strip punctuation; trim.
string (sliding window of 4 chars, step 1; a single n-gram of the whole
string when it is shorter than 4 chars). Character grams keep a single-word
edit local, so the fingerprint stays stable on short text where word shingles
moved too far.
position sum +1/−1 across grams; the final bit is 1 when the column sum
is
> 0. Output is 32 bytes, hex-encoded.under light edits.
overlapping windows of 512 characters with 50% overlap (step 256) and each
window fingerprinted separately. The soft binding records the whole-document
fingerprint as block 0 (empty scope) and one block per window
(
scope: {start, length}), so an extracted excerpt or truncated copy can bematched against a window block even when its whole-document fingerprint has
drifted.
Why a computed fingerprint (vs. an embedded ZWC watermark)
The fingerprint is derived from the visible words and recorded in the
agent-signed C2PA manifest, so it is non-destructive (nothing is added to
the document), normalization-proof (reformatting, case, whitespace, and
punctuation changes happen before hashing), ZWC-immune (normalization
removes zero-width characters and variation selectors, so injecting them does
not change the fingerprint), and forge / transfer-resistant (the value is
bound by the manifest signature and cannot be lifted onto another document
without re-signing).
Honest limit: robust to edits and formatting (including a single-word edit
on a one-sentence snippet, thanks to character 4-grams), not to paraphrase.
This algorithm is part of the WritersLogic CPoE proof-of-effort authorship
attestation system, alongside the existing
com.writerslogic.zwc-watermark.1entry (identifier 29).
Checklist
python -m json.tool).identifier(40) assigned; no collisions.informationalUrlprovided.