Skip to content

Conversation

@magic-akari
Copy link
Contributor

Summary

This PR removes the regex crate dependency from ruff_python_formatter by replacing two regex patterns with hand-written state machine parsers.

Background

While working on compiling ruff_python_formatter to WebAssembly, I noticed that the regex crate was being pulled in solely for two pattern matches in docstring code example detection:

  1. reStructuredText directive: ^\s*\.\. \s*(?i:code-block|sourcecode)::\s*(?i:python|py|python3|py3)$
  2. Markdown fenced code block: (?<ticks>`{3,})(?:\s*(?i:python|py|python3|py3)[^`]*)? (and similar for tildes)

These patterns match structured, predictable syntax with a small, fixed set of keywords (code-block, sourcecode) and language identifiers (py, py3, python, python3). This is a task well-suited for hand-written parsers and doesn't require the full power of a regex engine.

Changes

  • Removed regex from [dependencies] in Cargo.toml
  • Replaced LazyLock<Regex> patterns with hand-written parsing functions:
    • is_rst_directive_start() - parses reStructuredText code-block directives
    • parse_markdown_fence_start() - parses Markdown fenced code blocks
    • strip_python_lang_prefix() - a compact state machine that matches Python language identifiers
  • Added unit tests for the new state machine

Benefits

1. Reduced Binary Size

The regex crate and its dependencies (regex-automata, regex-syntax) add significant weight to the compiled binary. This is especially impactful for WebAssembly builds where binary size directly affects load times.

2. No Lazy Initialization Cost

With LazyLock<Regex>, the regex is compiled on first use. The hand-written parsers require no initialization.

3. Improved Maintainability

The parsing logic is now explicit and self-documenting. The state machine structure is clearly visible in the code and documentation:

Start -> 'p' -> 'y' -> (accept "py")
                    -> '3' -> (accept "py3")
                    -> 't' -> 'h' -> 'o' -> 'n' -> (accept "python")
                                                -> '3' -> (accept "python3")

4. Fewer Dependencies

Reducing the dependency footprint simplifies auditing, reduces potential supply chain attack surface, and speeds up builds.

Test Plan

  • All existing tests pass
  • Fixture tests for docstring code examples pass unchanged
  • Added unit tests for strip_python_lang_prefix() covering:
    • Valid matches: py, py3, python, python3
    • Case insensitivity: PY, Python, PYTHON3
    • Trailing content: python3 extra
    • Invalid prefixes: p, pyt, pyth, pytho
    • No word boundary: pyx, pythonx, python3x
    • Completely different inputs: rust, javascript, empty string

@astral-sh-bot
Copy link

astral-sh-bot bot commented Dec 6, 2025

ruff-ecosystem results

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@magic-akari magic-akari marked this pull request as ready for review December 6, 2025 22:09
Copilot AI review requested due to automatic review settings December 6, 2025 22:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR successfully removes the regex crate dependency from ruff_python_formatter by replacing two regex patterns with hand-written state machine parsers for docstring code example detection. The implementation provides explicit, self-documenting parsing logic while reducing binary size and eliminating lazy initialization costs.

Key Changes:

  • Removed regex dependency from Cargo.toml
  • Replaced reStructuredText directive regex with is_rst_directive_start() function
  • Replaced Markdown fence regex with parse_markdown_fence_start() function and strip_python_lang_prefix() state machine
  • Added comprehensive unit tests for the Python language identifier state machine

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
crates/ruff_python_formatter/Cargo.toml Removed regex crate from dependencies
crates/ruff_python_formatter/src/string/docstring.rs Replaced regex patterns with hand-written parsers (is_rst_directive_start, parse_markdown_fence_start, strip_python_lang_prefix, strip_prefix_ignore_ascii_case) and added unit tests for the state machine

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@MichaReiser
Copy link
Member

Do you have any numbers by how much does it reduce the wasm size?

@magic-akari
Copy link
Contributor Author

With this PR changes:

diff --git a/Cargo.toml b/Cargo.toml
index 985a2c3..438ecd1 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -17,9 +17,9 @@ resolver = "2"
 
 	[workspace.dependencies]
 	ruff_fmt_config       = { path = "crates/ruff_fmt_config" }
-	ruff_formatter        = { git = "https://github.com/astral-sh/ruff.git", tag = "0.14.8" }
-	ruff_python_ast       = { git = "https://github.com/astral-sh/ruff.git", tag = "0.14.8" }
-	ruff_python_formatter = { git = "https://github.com/astral-sh/ruff.git", tag = "0.14.8" }
+	ruff_formatter        = { path = "../ruff/crates/ruff_formatter" }
+	ruff_python_ast       = { path = "../ruff/crates/ruff_python_ast" }
+	ruff_python_formatter = { path = "../ruff/crates/ruff_python_formatter" }
 
 
 	anyhow             = "1.0"

Binary size reduced from 1.8 MB to 989 KB.

❯ ls -lh ./ruff_fmt_bg.wasm
-rw-r--r--@ 1 akari  staff   1.8M Dec  7 19:39 ./ruff_fmt_bg.wasm
❯ ls -lh ./ruff_fmt_bg.wasm                                                    
-rw-r--r--@ 1 akari  staff   989K Dec  7 19:34 ./ruff_fmt_bg.wasm

@MichaReiser
Copy link
Member

Wow, that's a pretty significant reduction. But I assume it requires removing the ahoacorsik dependency fromthe ast?

@magic-akari
Copy link
Contributor Author

magic-akari commented Dec 7, 2025

No, aho-corasick removal is not needed - it's already eliminated automatically.

Why

  1. Dependency path: aho-corasickglobsetruff_cache
  2. ruff_cache is only used for CacheKey trait and derive macros in the formatter
  3. The Formatter doesn't use impl CacheKey for Regex/Pattern/Glob - these implementations are only needed by linter/CLI
  4. LTO + DCE at link time automatically removes all unreferenced code from aho-corasick, regex-automata, globset, etc.

Note: Only verified for the format_module_source use case. Ref: https://github.com/wasm-fmt/ruff_fmt/blob/9ecf34b5efacc0f2e696d626a8a221460a83bfae/crates/ruff_fmt/src/lib.rs#L5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants