fix: rework hyphenation #3188

carlobeltrame · 2025-07-14T16:28:00Z

Fixes #3018, was broken since #2600
Fixes #1642
Fixes #2456
Fixes #2564
Fixes #2739
Enables better solutions for the following issues:
Fixes #1238
Fixes #1380
Fixes #1416
Fixes #1662

I added each change in a separate commit, with accompanying tests to make sure the intention of the code change will not as easily be broken again in the future. Especially the textkit layout integration test took a while to write, but should be of great importance to detect unwanted changes in the future.

This is technically a breaking change for people who have previously used custom word wrapping functions, but one could argue this feature has been quite broken for a long time now, as described in #3018. I can also update the documentation website and possibly some of the examples once this is merged.

The hyphenation algorithm may change the string (e.g. by removing some characters, namely soft hyphens). Therefore, calculating the glyphs must come after hyphenation, so that the glyphs match the final string. Fixes diegomura#3018 This was probably broken in diegomura#2600.

The Hyphenation algorithm should be able to leave soft hyphens in, to indicate that a hyphen should be placed there if the word breaks there.

The line breaking algorithm needs to distinguish syllables which end with a soft hyphen from syllables that do not, and only mark a syllable for adding a hyphen in the former case.

For the line breaking algorithm, soft hyphens should be considered to have a width of zero, since they are never printed directly (they can only lead to an inserted hyphen if at the end of a line). The font package was already doing this correctly, but the pdfkit package considered the soft hyphen to be the same as a normal hyphen with an advanceWidth of 333 in Helvetica. Without this change, in some edge cases the pdfkit would break apart lines already broken apart by the line breaking algorithm in textkit. Added tests for both packages to make sure they remain compatible in the future.

In the best fit line breaking algorithm, the width of the hyphen must be taken into account, in case one is to be inserted at the end of the line. This is the most readable change I was able to find to acheive the goal. Maybe the bestFit algorithm could be optimized in the future, along with writing extensive tests for corner cases.

Therefore, we remove all soft hyphens from the attributed string after linebreaking is completed, and recalculate the glyphs afterwards. This way, pdfkit never sees the soft hyphens, and does not mistake them for normal hyphens.

Tests the functionality of custom word splitting functions

changeset-bot · 2025-07-14T16:28:04Z

🦋 Changeset detected

Latest commit: e343b3d

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 13 packages

Name	Type
@react-pdf/pdfkit	Minor
@react-pdf/textkit	Minor
@react-pdf/font	Patch
@react-pdf/renderer	Patch
@react-pdf/layout	Patch
@react-pdf/render	Patch
@react-pdf/types	Patch
next-14	Patch
next-15	Patch
@react-pdf/vite-example	Patch
@react-pdf/e2e-node-cjs	Patch
@react-pdf/e2e-node-esm	Patch
@react-pdf/stylesheet	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

wojtekmaj · 2025-07-15T08:07:00Z

Insane work. Looks good to me!

diegomura · 2025-09-23T21:03:02Z

.changeset/ninety-dogs-grow.md

+
+This allows you to break correctly on normal hyphens or other special characters in your text. For example, to use the default english-language syllable breaking built into react-pdf, but also break after hyphens naturally occurring in your text (such as is often present in hyperlinks), you could use the following hyphenation callback:
+```js
+import { wordHyphenation } from '@react-pdf/textkit';


This means @react-pdf/textkit will now be something users shuold import from? I rather keep this "internal", and all public apis being exported from @react-pdf/renderer

Makes sense, I'll prepare a fix

The import is not necessary anymore with e343b3d

diegomura · 2025-09-23T21:04:27Z

.changeset/ninety-dogs-grow.md

+import { Font } from '@react-pdf/renderer';
+
+const originalHyphenationCallback = wordHyphenation()
+Font.registerHyphenationCallback((word) => {


Maybe we should consider passing the default hyphenation fn as argument of the callback? Less things to care as a user

Done in e343b3d

diegomura · 2025-09-23T21:07:17Z

@carlobeltrame great work man! Sorry for the late response here. Can you guide me on how (or if) this fixes solving hyphenation in different languages? I always thought we should export some hyphenation "language specifications" (if that even exists) that people can pass to Document to easily hyphenate with non english rules. Not to do now, but wonder if even possible. I been a bit away from this issue for some time 😅

This allows library users to avoid importing the callback themselves, which probably most of the implementations will want to do.

carlobeltrame · 2025-09-24T12:34:49Z

@carlobeltrame great work man! Sorry for the late response here. Can you guide me on how (or if) this fixes solving hyphenation in different languages? I always thought we should export some hyphenation "language specifications" (if that even exists) that people can pass to Document to easily hyphenate with non english rules. Not to do now, but wonder if even possible. I been a bit away from this issue for some time 😅

@diegomura see also diegomura/react-pdf-site#150 for a documentation PR which attempts to do just that.

After this PR here, a full example for German and for splitting at hyphens in the source text could look like this:

import { Font } from "@react-pdf/renderer";
import { hyphenateSync as hyphenateDe } from "hyphen/de";

const SOFT_HYPHEN = '\u00ad';
const hyphenationCallback = (word) => {
  // technically, null may be passed as argument
  if (word === null) return [];
  // take care to leave the soft hyphens in the split parts, so a hyphen is
  // rendered in the pdf if the string is split at that position
  const syllables = hyphenateDe(word).split(new RegExp(`(?<=${SOFT_HYPHEN})`));
  // also allow line splitting at dashes in the text, e.g. inside URLs
  return syllables.flatMap((syllable) => syllable.split(/(?<=-)/))
};
Font.registerHyphenationCallback(hyphenationCallback);

I don't think react-pdf should be responsible for providing the hyphenation rules for all languages. I think users should be responsible for selecting and installing only the necessary language-specific and domain-specific hyphenation dictionaries which they use. The other languages of the hyphen package or of other packages can then be removed using tree-shaking.

If react-pdf ever starts to include "presets" for hyphenation in common western languages, the customizable hyphenationCallback will still be necessary. Imagine if a company named "foobar" never wants to have its company name hyphenated in their pdfs, they could implement this using the hyphenationCallback.

Also, please note that I only have experience with western languages such as English, German, French, Italian, etc., which share pretty similar hyphenation strategies. I have no idea how hyphenation works in e.g. middle eastern or asian languages, and cannot consult on how to implement support for that. I think if there is a need for hyphenation in such a language, we need contributors from these regions to describe the required rules in an issue first.

ybd-project · 2025-10-26T09:30:28Z

Hello. I'm a Japanese speaker.
First of all, thank you for the wonderful pull request.

Next, regarding hyphenation rules in Japanese, I personally think they are unnecessary. It's sufficient for each person to break lines at appropriate positions.
Also, while hyphenation in English-speaking countries (including languages spoken in European countries) places a “-” at the end of a line, this is not necessary in Japanese. The only thing needed is line breaks at appropriate positions.

Currently, using React-PDF inserts hyphens in Japanese text, which is extremely disruptive. (Even following the steps to disable it doesn't remove them.)
Personally, I would like to see this pull request merged as soon as possible.

I hope this is helpful. Thank you.

(Please note that this translation was generated by a translation tool and may contain unnatural expressions.)

ybd-project · 2025-10-26T09:44:54Z

〔Addendum〕Is the fact that entering meaningless characters (e.g., “aaaa”) doesn't cause a line break related to hyphenation?

carlobeltrame · 2025-10-26T12:07:48Z

〔Addendum〕Is the fact that entering meaningless characters (e.g., “aaaa”) doesn't cause a line break related to hyphenation?

Yes. The current version of react-pdf adds line breaks on

whitespace (e.g. aaaa aaaa can be split)
when the hyphen library recognizes an english word (e.g. the very long but valid english word pneumonoultramicroscopicsilicovolcanoconiosis can be split)
when you define your own custom hyphenation callback

Since aaaaaaaaaaa has no spaces and is no english word, react-pdf keeps it together on one line by default.

ybd-project · 2025-10-26T13:12:47Z

<Text style={{ width: '100%' }}>No.3では生命の連続性第二章「遺伝の規則性と遺伝子」から出題されます。それと、これも大変重要な役割を持ちます。</Text>

When I run the code you implemented (part of the pull request), it behaves as shown in the image above.

This is not the desired behavior, and I find it aesthetically unpleasing.
Ideally, I want it to behave like this. (I added code to achieve the look shown in this image.)

const JA_REGEX = /^[\p{scx=Hiragana}\p{scx=Katakana}\p{scx=Han}]+$/u;

const getNodes = (
  attributedString: AttributedString,
  { align }: Attributes,
  options: LayoutOptions,
): Node[] => {
  let start = 0;

  const hyphenWidth = 5;
  const softHyphen = '\u00ad';

  // Here!
  attributedString.syllables = attributedString.syllables
    .map((s) => (JA_REGEX.test(s) ? s.split('') : s))
    .flat();

  ...

(/textkit/src/engines/linebreaker/index.ts : getNodes function)

Note

This issue occurs because Japanese syllables are not being recognized correctly. In the image above, the text is not separated by punctuation such as “。” (equivalent to the English period).

carlobeltrame added 8 commits July 13, 2025 17:25

Add snapshot test for the textkit layout engine

a7e6218

Change default hyphenation algorithm to more flexible paradigm

5a90f30

The Hyphenation algorithm should be able to leave soft hyphens in, to indicate that a hyphen should be placed there if the word breaks there.

Variable width penalty nodes

aa13b48

The line breaking algorithm needs to distinguish syllables which end with a soft hyphen from syllables that do not, and only mark a syllable for adding a hyphen in the former case.

Add another test for the textkit layout engine

a9cb9da

Tests the functionality of custom word splitting functions

Add changeset

0109089

carlobeltrame changed the title ~~Fix hyphenation~~ Fix and rework hyphenation Jul 14, 2025

carlobeltrame mentioned this pull request Jul 14, 2025

Hyphenation on soft hyphens is broken #3018

Open

diegomura reviewed Sep 23, 2025

View reviewed changes

diegomura changed the title ~~Fix and rework hyphenation~~ fix: rework hyphenation Sep 23, 2025

Pass builtin hyphenation callback to custom callback

e343b3d

This allows library users to avoid importing the callback themselves, which probably most of the implementations will want to do.

Uh oh!

fix: rework hyphenation #3188

Are you sure you want to change the base?

fix: rework hyphenation #3188

Conversation

carlobeltrame commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

wojtekmaj commented Jul 15, 2025

Uh oh!

diegomura Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

carlobeltrame Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

carlobeltrame Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

diegomura Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

carlobeltrame Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

diegomura commented Sep 23, 2025

Uh oh!

carlobeltrame commented Sep 24, 2025

Uh oh!

ybd-project commented Oct 26, 2025

Uh oh!

ybd-project commented Oct 26, 2025

Uh oh!

carlobeltrame commented Oct 26, 2025

Uh oh!

ybd-project commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

carlobeltrame commented Jul 14, 2025 •

edited

Loading

changeset-bot bot commented Jul 14, 2025 •

edited

Loading

ybd-project commented Oct 26, 2025 •

edited

Loading