Laurenz's Blog GitHub

How To Put 30 Languages Into 1.1MB

This blog post is about hypher, a fast hyphenation library for Rust.

I'm currently working on a pure-rust LaTeX alternative called Typst. To obtain justification results on par with LaTeX, Typst needs support for hyphenation. A quick search on docs.rs showed that there's only really one hyphenation library, fittingly called hyphenation. All other crates I've found were small variations of this crate. The hyphenation crate has a lot of functionality and supports many languages. However, it also has sizable binary overhead when you embed the hyphenation patterns (2.8MB). While you can load patterns at runtime, distributing the pattern files separately is so much more complicated than just embedding them.

A specific pain point I had with hyphenation was that I needed to hold on to the loaded, heap-allocated language dictionaries. In my case, the text was pre-segmented into "words" (string segments between two Unicode line break opportunities) and the language could be different for each word. Thus, I would've either had to reload the patterns for each word (slow) or set up some caching solution. Which is certainly possible, but I had some problems getting it to work because the hyphenating iterator kept borrows into the caching hash map.

So, at this point I decided to build a new crate with the following goals: No allocations, no loading at runtime, less binary overhead and no dependencies (why not). It looks like this:

use hypher::{hyphenate, Lang};

let syllables = hyphenate("extensive", Lang::English);
assert_eq!(syllables.join("-"), "ex-ten-sive");

(And that's almost the whole API surface.)

Hyphenating words

So, how do we actually hyphenate stuff? Turns out that there aren't really lists of hyphenated words that are available for free. So even with a new, let's say ML-based algorithm, we lack the data to make it work. (Such an approach would definitely be interesting, although I'm guessing the models would be quite large.) After a bit of research, it seemed that using TeX patterns is still the way to go. TeX patterns are, in principle, generated from word lists with the patgen tool, but many were tweaked by native speakers over the decades. The algorithms for dealing with the patterns go all the way back to Liang's 1983 thesis Word Hy-phen-a-tion by Com-put-er.

The general idea of the patterns is the following: There are hyphenating and inhibiting patterns. A hyphenating pattern says something like "if you see this sequence of letters, you can hyphenate here". An inhibiting pattern is the opposite: "If you see this sequence, don't hyphenate here!" There are multiple levels of patterns: The first level of patterns is hyphenating and defines broad rules like "you can hyphenate between two successive 'c's." The second level of patterns is inhibiting and handles exceptions from the broad rules. And, you guessed it, the third level is again hyphenating and handles the exceptions from the exceptions.

The pattern files are encoded in a simple text format: Letters are just letters and a number between two letters designates a point of hyphenation or inhibition. An odd number specifies a point of hyphenation and an even number one of inhibition. This goes up to a maximum level of 9. Some pattern include dots to indicate that the pattern should only match at the start or end of the word.

Now, to find out how to hyphenate a word, we first need a zero-initialized array of levels with length one less than that of the word (one entry for each point between two letters). Then, we need to find all patterns that match a substring of our word and update the level array with their levels. Updating always means taking the maximum of the existing entry and the number in the pattern, so that in the end, we get the result of the strongest pattern. Finally, the possible hyphenation points lie at the odd levels in the array. The example below illustrates this:

Visualization of how to hyphenate the word 'hyphenate'

Tries and state machines

So far so good. We know the general idea, but an important question remains: How do we find all matching patterns? While we could store the patterns in a hashmap and iterate over all substrings, this would kind of defeat the point of this blog post. We want performance.

Luckily, Liang's thesis also contains efficient algorithms to work with the patterns. The general idea is to create a trie, essentially a tree-shaped finite state machine, to encode the patterns. Each path from the root of such a trie to an accepting state encodes one pattern. The figure below shows an example for the seven patterns from the example above (accepting states have double borders). You can see the pattern n2at being reflected by the topmost path through the trie. We can easily build such a trie by iterating over the patterns, trying to walk each pattern in the trie and adding states and transitions as necessary.

State machine for the seven previously seen patterns

What is still missing from this illustration though is the levels! How does that work? Since there is a one-to-one relationship between patterns and accepting states, we can simply associate the levels for a pattern with the accepting state.

In the example above, I have numbered the accepting states with Roman numerals so that we can write down the levels for each one. A pattern with n letters can have n+1 levels: Before the first letter, between each pair of letters and after the last letter. If there isn't a number between two letters, it's the same as if there was a zero in between. This way, we get the following result:

State Pattern Levels
I 1na [1, 0, 0]
II n2at [0, 2, 0, 0]
III he2n [0, 0, 2, 0]
IV hena4 [0, 0, 0, 0, 4]
V hen5at [0, 0, 0, 5, 0, 0]
VI hy3ph [0, 0, 3, 0, 0]
VII 4te. [4, 0, 0, 0]

Now, given a trie with levels, how do we hyphenate a word? We simply start a trie walk at each letter of the word and update the level array with the levels of each accepting state we meet. This way, we once again find all patterns that match any substring in the word, but much more efficiently!


You can think about tries like this: They allow us to efficiently encode shared prefixes of the patterns. But we can even go one step further and also profit from shared suffixes. This turns the trie into a finite state machine. To do that, we have to find ends of walks which are the same. In the example above, this would almost work for the two a-t walks ending in II and V. However, it unfortunately doesn't in this case because the levels associated with II and V are different. For more details on tries, finite state machines and suffix compression, read this very interesting blog post.


Encoding state machines compactly

All that is left to do is to compactly encode our state machine into bytes that we can embed into the binary. For this, I took some inspiration from regex-automata, which makes heavy use of all kinds of automatons.

In our case, each state consists of transitions and optionally levels for accepting states. For each transition, we have a letter and a target state. Well actually, now is maybe a good time to bring up that we don't actually deal with letters. Rather, we build our state machine over UTF-8 bytes. This works just as well, but is much easier to encode compactly. And when hyphenating, we then of course only start trie walks at UTF-8 codepoint boundaries.

Back to the states: To encode transitions, we lay out two parallel arrays. The first contains each byte for which there is a transition and the second contains the address delta to the state we should transition into for this byte. Each state has an address: its byte offset in the whole encoded machine. Transition addresses are always encoded relative to the origin state as the delta is often much smaller than the absolute address. To get maximum profit out of this, we further use a variable length address coding. The address array is either an [i8], [i16] or [i24] depending on the largest delta. Overall, a state's bitstream encoding looks like this:

Binary state encoding

Now, the levels. If a state is accepting, it contains an additional offset and a length for the levels. The (offset, length) pair locates a slice of items in an additional array shared by all states. Each item in the level slice corresponds to one number in the state's pattern. A level item consists of two parts: the distance of the level from the start of the word or previous level, and the level number. We again use the trick of making the distances relative to make them smaller. It turns out that there is no relative distance larger than 24 and no level larger than 9 in the patterns. This means we can cramp both into a single byte! We can't directly shift and bitor these two values into 8 bits (distance would need 5 bits and level 4 bits). However, there are still only 25 * 10 = 250 combinations, which is less than 256. So we can fit it into one byte like this:

fn pack(dist: u8, level: u8) -> u8 {
assert!(dist < 25, "too high distance");
assert!(level < 10, "too high level");
dist * 10 + level
}

fn unpack(packed: u8) -> (u8, u8) {
let dist = packed / 10;
let level = packed % 10;
(dist, level)
}

If the encoded level slice for two states is the same, it is only stored once in the shared array, saving even more precious space.

Finishing up

At runtime, we now don't need to prepare or load anything. We can just lazily decode the embedded automaton as we're executing it. And to eliminate the last allocation, we can even stack allocate the level array if the word isn't too long (<= 39 bytes in hypher).

Regarding API, I opted for a free-standing method hyphenate(&str, Lang) -> Syllables<'_> as I feel that it is much more discoverable than a method on Lang. Syllables is a hand-written iterator that segments the string based on the level array. I also always enjoy when a crate makes my job as simple as a possible. Therefore, I added a join method to Syllables so that you quickly add in some (soft) hyphens.

The tries are constructed and encoded with a build script. As that script really took its time in debug builds, I added this to my Cargo.toml to somewhat optimize the process.

[profile.dev.build-override]
opt-level = 1

Regarding binary size: 1.1MB isn't that much, but there are also many applications where you only want to hyphenate English. For this, I added two features full and english with full being enabled by default. Dropping all and adding english brings the overhead down to 27KB. While I don't think its great to favor English like that (I'm not a native English speaker), I also felt that adding one feature per language didn't carry its weight.

[features]
default = ["full"]
full = ["english"]
english = []

(Update: There's now one feature per language.)

Benchmarks

Now, let's very briefly compare hypher with hyphenation.

Task hypher hyphenation
Hyphenating extensive (english) 356ns 698ns
Hyphenating διαμερίσματα (greek) 503ns 1121ns
Loading the english patterns 0us 151us
Loading the greek patterns 0us 0.826us

For these two test cases, hypher is about 2x as fast as hyphenation. Moreover, the loading overhead of hyphenation is quite large in comparison to hyphenating a single word, at least for English. All benchmarks were executed on ARM, Apple M1.

The direct overhead of embedding is ~1.1MB for hypher and ~2.8 MB for hyphenation. However, this comparison is unfair to hyphenation as I dropped some languages from hypher. Over the decades, quite a lot of TeX pattern files have amassed. For many of these, I couldn't even find any evidence that hyphenation is used for these languages, so I removed them. Furthermore, I wanted hypher to be permissively licensed. Therefore, it unfortunately does not support languages for which the only available patterns have GPL-like licenses. There are a few of those, but not too many. In a fairer comparison where only the common languages are considered, hypher's encoding is still ~12% more compact than hyphenation's.

That's it, thank you for reading! Also, take a look at Typst if you're interested.

Discussion on r/rust.