[codex] Add configurable literal bracket patterns#44
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an opt-in configuration to preserve bracketed literal source text (e.g., scientific notation like [2Fe-2S]) during supporting-text validation, while keeping the existing default behavior of stripping all [...] segments for backward compatibility.
Changes:
- Introduces
literal_bracket_patternsonReferenceValidationConfigand wires it intoSupportingTextValidator._split_query(). - Updates
_split_query()to selectively preserve bracketed content when configured patterns match, and normalizes whitespace in returned parts. - Adds/updates tests plus README + conceptual docs to cover and document the new configuration.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/linkml_reference_validator/models.py |
Adds literal_bracket_patterns config field with description + example. |
src/linkml_reference_validator/validation/supporting_text_validator.py |
Compiles configured patterns and conditionally preserves matching bracketed content in _split_query(). |
tests/test_models.py |
Verifies config default/custom values for the new field. |
tests/test_validation_config.py |
Verifies YAML config loading for literal_bracket_patterns. |
tests/test_supporting_text_validator.py |
Adds behavioral tests for default stripping vs configured preservation in split + matching. |
README.md |
Documents the new YAML option and intended usage. |
docs/concepts/editorial-conventions.md |
Adds conceptual documentation for literal brackets and configuration. |
| ['protein functions', 'in cells'] | ||
| >>> validator._split_query("protein [important] functions") | ||
| ['protein functions'] | ||
| >>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"]) |
There was a problem hiding this comment.
In the _split_query() docstring example, literal_bracket_patterns=[r"\\d"] compiles a pattern that matches a literal \d sequence, not digits. This example likely intends r"\d" so the shown output is reproducible.
| >>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"]) | |
| >>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"]) |
| ``` | ||
|
|
||
| Patterns are matched against the content inside `[...]`. If any pattern matches, | ||
| that bracketed text is kept verbatim during validation. |
There was a problem hiding this comment.
The README says preserved bracketed text is kept "verbatim during validation", but the validator normalizes punctuation/whitespace before substring matching (and _split_query() also collapses whitespace). Consider rewording to “preserved (not stripped) before normalization/validation” to avoid implying exact byte-for-byte matching.
| that bracketed text is kept verbatim during validation. | |
| that bracketed text is preserved (not stripped) before normalization and validation. |
Summary
literal_bracket_patternstoReferenceValidationConfig_split_query()when configured patterns match bracket contentsWhy
SupportingTextValidator._split_query()stripped every[...]segment unconditionally. That works for editorial insertions, but it breaks literal scientific bracket notation like[2Fe-2S],[30S],[Ca2+], and similar corpus-specific patterns.This change keeps the existing default behavior for backward compatibility and lets projects opt into preserving literal brackets with explicit regex patterns.
Impact
Projects that do nothing keep the current strip-all behavior. Projects with literal bracketed source text can now configure preservation rules in YAML or Python config.
Validation
uv run pytest tests/test_supporting_text_validator.py tests/test_validation_config.py tests/test_models.pyjust test