-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Remove regex dependency from ruff_python_formatter
#21827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Remove regex dependency from ruff_python_formatter
#21827
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR successfully removes the regex crate dependency from ruff_python_formatter by replacing two regex patterns with hand-written state machine parsers for docstring code example detection. The implementation provides explicit, self-documenting parsing logic while reducing binary size and eliminating lazy initialization costs.
Key Changes:
- Removed
regexdependency from Cargo.toml - Replaced reStructuredText directive regex with
is_rst_directive_start()function - Replaced Markdown fence regex with
parse_markdown_fence_start()function andstrip_python_lang_prefix()state machine - Added comprehensive unit tests for the Python language identifier state machine
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| crates/ruff_python_formatter/Cargo.toml | Removed regex crate from dependencies |
| crates/ruff_python_formatter/src/string/docstring.rs | Replaced regex patterns with hand-written parsers (is_rst_directive_start, parse_markdown_fence_start, strip_python_lang_prefix, strip_prefix_ignore_ascii_case) and added unit tests for the state machine |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Do you have any numbers by how much does it reduce the wasm size? |
|
With this PR changes: diff --git a/Cargo.toml b/Cargo.toml
index 985a2c3..438ecd1 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -17,9 +17,9 @@ resolver = "2"
[workspace.dependencies]
ruff_fmt_config = { path = "crates/ruff_fmt_config" }
- ruff_formatter = { git = "https://github.com/astral-sh/ruff.git", tag = "0.14.8" }
- ruff_python_ast = { git = "https://github.com/astral-sh/ruff.git", tag = "0.14.8" }
- ruff_python_formatter = { git = "https://github.com/astral-sh/ruff.git", tag = "0.14.8" }
+ ruff_formatter = { path = "../ruff/crates/ruff_formatter" }
+ ruff_python_ast = { path = "../ruff/crates/ruff_python_ast" }
+ ruff_python_formatter = { path = "../ruff/crates/ruff_python_formatter" }
anyhow = "1.0"Binary size reduced from 1.8 MB to 989 KB. ❯ ls -lh ./ruff_fmt_bg.wasm
-rw-r--r--@ 1 akari staff 1.8M Dec 7 19:39 ./ruff_fmt_bg.wasm❯ ls -lh ./ruff_fmt_bg.wasm
-rw-r--r--@ 1 akari staff 989K Dec 7 19:34 ./ruff_fmt_bg.wasm |
|
Wow, that's a pretty significant reduction. But I assume it requires removing the ahoacorsik dependency fromthe ast? |
|
No, aho-corasick removal is not needed - it's already eliminated automatically. Why
Note: Only verified for the |
Summary
This PR removes the
regexcrate dependency fromruff_python_formatterby replacing two regex patterns with hand-written state machine parsers.Background
While working on compiling
ruff_python_formatterto WebAssembly, I noticed that theregexcrate was being pulled in solely for two pattern matches in docstring code example detection:^\s*\.\. \s*(?i:code-block|sourcecode)::\s*(?i:python|py|python3|py3)$(?<ticks>`{3,})(?:\s*(?i:python|py|python3|py3)[^`]*)?(and similar for tildes)These patterns match structured, predictable syntax with a small, fixed set of keywords (
code-block,sourcecode) and language identifiers (py,py3,python,python3). This is a task well-suited for hand-written parsers and doesn't require the full power of a regex engine.Changes
regexfrom[dependencies]inCargo.tomlLazyLock<Regex>patterns with hand-written parsing functions:is_rst_directive_start()- parses reStructuredText code-block directivesparse_markdown_fence_start()- parses Markdown fenced code blocksstrip_python_lang_prefix()- a compact state machine that matches Python language identifiersBenefits
1. Reduced Binary Size
The
regexcrate and its dependencies (regex-automata,regex-syntax) add significant weight to the compiled binary. This is especially impactful for WebAssembly builds where binary size directly affects load times.2. No Lazy Initialization Cost
With
LazyLock<Regex>, the regex is compiled on first use. The hand-written parsers require no initialization.3. Improved Maintainability
The parsing logic is now explicit and self-documenting. The state machine structure is clearly visible in the code and documentation:
4. Fewer Dependencies
Reducing the dependency footprint simplifies auditing, reduces potential supply chain attack surface, and speeds up builds.
Test Plan
strip_python_lang_prefix()covering:py,py3,python,python3PY,Python,PYTHON3python3 extrap,pyt,pyth,pythopyx,pythonx,python3xrust,javascript, empty string