Skip to content

[codex] Add configurable literal bracket patterns#44

Open
cmungall wants to merge 1 commit intomainfrom
feat/issue-43
Open

[codex] Add configurable literal bracket patterns#44
cmungall wants to merge 1 commit intomainfrom
feat/issue-43

Conversation

@cmungall
Copy link
Copy Markdown
Member

@cmungall cmungall commented Apr 3, 2026

Summary

  • add literal_bracket_patterns to ReferenceValidationConfig
  • preserve bracketed source text in _split_query() when configured patterns match bracket contents
  • add validator/config/model coverage and document the new configuration option

Why

SupportingTextValidator._split_query() stripped every [...] segment unconditionally. That works for editorial insertions, but it breaks literal scientific bracket notation like [2Fe-2S], [30S], [Ca2+], and similar corpus-specific patterns.

This change keeps the existing default behavior for backward compatibility and lets projects opt into preserving literal brackets with explicit regex patterns.

Impact

Projects that do nothing keep the current strip-all behavior. Projects with literal bracketed source text can now configure preservation rules in YAML or Python config.

Validation

  • uv run pytest tests/test_supporting_text_validator.py tests/test_validation_config.py tests/test_models.py
  • just test

@cmungall cmungall marked this pull request as ready for review April 4, 2026 00:44
Copilot AI review requested due to automatic review settings April 4, 2026 00:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in configuration to preserve bracketed literal source text (e.g., scientific notation like [2Fe-2S]) during supporting-text validation, while keeping the existing default behavior of stripping all [...] segments for backward compatibility.

Changes:

  • Introduces literal_bracket_patterns on ReferenceValidationConfig and wires it into SupportingTextValidator._split_query().
  • Updates _split_query() to selectively preserve bracketed content when configured patterns match, and normalizes whitespace in returned parts.
  • Adds/updates tests plus README + conceptual docs to cover and document the new configuration.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/linkml_reference_validator/models.py Adds literal_bracket_patterns config field with description + example.
src/linkml_reference_validator/validation/supporting_text_validator.py Compiles configured patterns and conditionally preserves matching bracketed content in _split_query().
tests/test_models.py Verifies config default/custom values for the new field.
tests/test_validation_config.py Verifies YAML config loading for literal_bracket_patterns.
tests/test_supporting_text_validator.py Adds behavioral tests for default stripping vs configured preservation in split + matching.
README.md Documents the new YAML option and intended usage.
docs/concepts/editorial-conventions.md Adds conceptual documentation for literal brackets and configuration.

['protein functions', 'in cells']
>>> validator._split_query("protein [important] functions")
['protein functions']
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the _split_query() docstring example, literal_bracket_patterns=[r"\\d"] compiles a pattern that matches a literal \d sequence, not digits. This example likely intends r"\d" so the shown output is reproducible.

Suggested change
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"])

Copilot uses AI. Check for mistakes.
```

Patterns are matched against the content inside `[...]`. If any pattern matches,
that bracketed text is kept verbatim during validation.
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README says preserved bracketed text is kept "verbatim during validation", but the validator normalizes punctuation/whitespace before substring matching (and _split_query() also collapses whitespace). Consider rewording to “preserved (not stripped) before normalization/validation” to avoid implying exact byte-for-byte matching.

Suggested change
that bracketed text is kept verbatim during validation.
that bracketed text is preserved (not stripped) before normalization and validation.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants