January 2024 - Quentin Santos

The Christmas holidays are the perfect time of the year to shut oneself in, talk to no one, and pour days into staring at a computer screen.

So, naturally, I rewrote my longest-actually-running project in Rust 🦀.

The Challenge

Many Words

NHK Easier serves news written in simple Japanese for language learners. It has a couple of features to make the learning process easier. The most important and more complex one shows the corresponding dictionary entry when hovering a word.

The EDICT dictionary with proper names is 11 MB compressed with bzip2. We do not want to send that for a single page Web. Easy! We will only send the dictionary entries for the words that appear on the page.

The thing is, there are no spaces between words in Japanese. Let’s take the following sentence:

ねneこkoがgaたtaべbeるru

A Japanese speaker understands the words as:

ねneこko cat
がga follows the subject of the sentence
たtaべbeるru eat

So, they would read the sentence as “The cat eats”. However, even if it only uses words out of a dictionary, a naive algorithm could split the sentence as:

ねne root
こkoがgaたta small-size
べbeるru bell

Of course, this interpretation does not make sense. But the algorithm would at least need to understand Japanese syntax. And now we are doing NLP.

Instead, the safe approach is just to display all possibilities to the user. If the user hovers ね, we should suggest:

ねneこko cat
ねne root

If they hover こ, we should suggest:

こkoがgaたta small-size
こko child
こko counter for articles

In other words, the basic task is to iterate over all substrings, look-up the corresponding dictionary entries, and send that to the user.

But then, it gets worse.

Grammar

Despite what you may have been told, Japanese does have grammar and inflection. And it’s agglutinative. This means that “he has been forced to make him eat” could become the single word “食taべbeさsaせseらraれreてteいiたta”. To figure out the fact that we should display the entry for “食taべbeるru” (eat) when the user hovers this word, we need to apply “deinflection rules”.

Basically, a deinflection rule tells you to replace a suffix of the word by a shorter one. But there are several, and you do not know which one is correct, so you just try them all. And then, you need to recursively deinflect these new words in turn. At some point, one of these words might match a dictionary entry. But you still need to check them all.

For a short article that ChatGPT 4 translates to 162 English words, we need to do 5,631 lookups. Of these, only 427 match at least one dictionary entry. In total, we output 679 dictionary entries for common names.

Early Solution

My first attempt was pretty straight-forward. However, this was using way too much memory. So I added an index; after a binary search, it would give me the offset in the EDICT file. This avoided creating multiple Python objects for each entry, and I relied on Linux file caching to avoid disk seeks.

Running a retrospective benchmark tells me it took about 1.1 s to find the relevant entries for the previous story. It was so slow that I precomputed sub-dictionaries for all pages. At least, I was paying this cost once per page, and not once per view. But it meant higher complexity (generating the dictionaries and serving them), lower flexibility (updating all the sub-dictionaries or creating a new kind of page would be cumbersome), and extra storage (1.5 GB for the sub-dictionaries of 8,997 pages).

In the benchmark below, you can see that the latency without contention is about 80 ms, and the maximal throughput is less than 13 requests/second. Running parallel requests actually lowers it!

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    79.05ms   12.37ms 137.93ms   84.80%
…
Requests/sec:    106.21
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:      7.99

Note: -H 'Connection: close is required because of #488. Without it, wrk measures latency incorrectly. This actually caused all my initial benchmarks to be off by about 40 ms. I only learned about this because of a discrepancy when redoing the benchmarks for this article, where doing more work would take less time!

Note: I have run the benchmark of this article on the index page of 2024-01-12. It contains 4 stories, with 4 sub-dictionaries for common names, and 4 sub-dictionaries for proper names.

And that was the situation from 2018 to 2023.

Optimizing Python

Of course, the solution was Rust!

But, before I indulged in the pagan ritual of RIIR, I took a bit of time to properly optimize the Python version. This would give me a better point of comparison on the speedup achieved by switching languages.

Using profiling allowed me to identify various avenues of improvement:

move the data back in a dictionary; it would use too much memory to run on my tiny VPS, but only speed mattered at that point
deduplicate candidates from the deinflection
I noticed that I was actually deinflecting twice due to poor function naming

These optimizations combined reduced the time needed to generate the entries from 1,200 ms per story to 24 ms (-98 %).

Going further, I used a trie to quickly look up the matching deinflection rules to apply (-50%). I also only kept relevant entries from the main dictionary (-15%). I was basically down to about 10 ms per story. With 4 stories on the main page, I was at 310 ms to first byte, compared to 230 ms in the version with precomputed sub-dictionaries.

Optimizing Django

Thanks to benchmarking, I realized my page loads were atrocious either way. So I optimized them as well, to better see the effect of generating sub-dictionaries. The line_profiler was very useful to identify hotspots in Django views:

I ditched Django’s ORM query-builder for some complex stories, without changing the SQL query sent to the database; just building that query using dynamic Python objects at every page load was taking a very significant amount of time
I avoided using additional SQL queries to search a specific story, when I could just use plain Python over the list of results (using SQL queries was an oversight from using Python-like Django’s ORM)
used indexes¹ (this actually yielded a smaller performance improvement than what I expected)
avoided converting fields in an SQL query

Running the benchmark again, we see that these brought down the load time to 9 ms, with most of that time spent in generating HTML from Jinja2 templates, and some in SQL queries and Django’s ORM query builder. Even better, serving queries on 2 physical cores with hyperthreading allow to increase the throughput to 124 requests/second!

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.94ms    2.07ms  34.97ms   90.65%
…
Requests/sec:    106.21
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:    124.42

At that point, things are starting to look pretty good. However, integrating on-the-fly generation of sub-dictionaries in Python would get us back into meh territory:

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    94.42ms    7.79ms 141.33ms   90.48%
…
Requests/sec:     10.49
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     10.49

Notice in particular how the multithreaded benchmark does not improve throughput. This is because all threads need to access the full dictionary objects. And, with Python’s GIL, that means all thread must take their turn.

So, let’s Rust!

A bit of Rust

First, I migrated the Python module that handled the generation of sub-dictionaries. I migrated the code from Python to Rust, keeping mostly the same logic, compiling it to a naive Python module, making the transition transparent to the rest of the project.

Simply writing in Rust already yielded significant benefits, with many low-level optimizations enabled by the typing system, and with lifetimes reducing the number of allocations and deallocation.

The one thing I kept in mind was to avoid cloning objects unnecessarily. For instance, the module does not copy the entries from the main dictionary, but instead use references as long as possible, until returning to Python. In fact, the whole file is kept as a single string, and all entries and keys (individual words) are pointing to that string.

Doing this reduced the time to generate the sub-dictionary to about 1 ms. Also, using Rust references everywhere instead of Python strs reduced the memory footprint to 50 MB, making it viable in production.

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.17ms    2.51ms  32.16ms   71.95%
…
Requests/sec:     44.16
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     48.35

This is better. But notice that we are still not doing better with multiple threads. This is because we need to release the GIL. Then, we do see some improvement:

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.26ms   14.29ms 193.46ms   98.24%
…
Requests/sec:     44.57
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     76.25

However, calling this from Python still took about 13 ms out of the 23 ms to load the index page with 4 stories. With two sub-dictionaries per story, you would expect about 8 ms. But I did not investigate this overhead further, since I was planning to rewrite it all in Rust anyway.

All of the Rust

Having already used Axum to write a Web API, and Askama for templating, porting the Web service to Rust was unchallenging.

I did fight a bit with sqlx, however.

The two most popular ways to interact with a database in Rust are Diesel, which provides an ORM, and sqlx, which takes raw SQL. To me, the most important thing is being able to use static type checking to catch basic errors in the queries. Diesel achieves this with its query-builder; sqlx does that by verifying the SQL queries against a real database at compile time using PREPARE statements.

// with Diesel
diesel::update(dsl::users.find(user.id))
    .set(dsl::name.eq(name))
    .execute(con)
    .unwrap();

// with sqlx
sqlx::query("UPDATE users SET name = $1 WHERE id = $2")
    .bind(name)
    .bind(user.id)
    .execute(pool)
    .await
    .unwrap()

Using a query-builder adds a leaky layer of abstraction you need to fight with. This is especially frustrating when you already know the SQL query you want to write, but need to figure out how to make Diesel generate it for you. I already had experience with Diesel, so I opted to try sqlx, to see whether it would deliver on its promises, or whether querying a database at compile-time would add too much friction.

In fact, sqlx works mostly as described. It only becomes painful when you want to avoid making allocations for each field of each returned row! If you try to do it, you will actually lose type safety, since the queries won’t be checked anymore. This is an open issue.

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:3000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.25ms    1.22ms  22.02ms   88.10%
…
Requests/sec:    107.24
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:3000/
…
Requests/sec:    264.78

In any case, the Rust rewrite got rid of the overhead, reducing the total time to generate a page to 9 ms, virtually all of it being spent on generating sub-dictionaries!

Conclusion

Compiled languages with static typing have the reputation of being faster at the cost of being lower-level, involving more house-keeping, and having more gotchas.

However, in my experience, Rust actually feels high-level, almost Python-like! There are some things that can surprise a beginner, like the Borrow checker, but this is usually easily solved with a .clone(). After all, using Python is pretty close to using Rc and .clone() everywhere, implicitly!

In addition, the migration can often be done piecemeal.

Without too much effort, Rust gives very significant performance improvements. The first factor to go further is to minimize allocations and data copies, which is much easier in Rust than in C or in C++, thanks to references and lifetimes that track everything for you.

There is definitely still room for improvement in my refactor, but it is more than enough to convince me that this was worth it!

You should very much use indexes. ↩︎

Quentin Santos

Make it work. Make it right. Make it fast.

Monthly Archives: January 2024

Rewriting NHK Easier in Rust