The Christmas holidays are the perfect time of the year to shut oneself in, talk to no one, and pour days into staring at a computer screen.
So, naturally, I rewrote my longest-actually-running project in Rust 🦀.
The Challenge
Many Words
NHK Easier serves news written in simple Japanese for language learners. It has a couple of features to make the learning process easier. The most important and more complex one shows the corresponding dictionary entry when hovering a word.
The EDICT dictionary with proper names is 11 MB compressed with bzip2. We do not want to send that for a single page Web. Easy! We will only send the dictionary entries for the words that appear on the page.
The thing is, there are no spaces between words in Japanese. Let’s take the following sentence:
ねこがたべる
A Japanese speaker understands the words as:
- ねこ cat
- が follows the subject of the sentence
- たべる eat
So, they would read the sentence as “The cat eats”. However, even if it only uses words out of a dictionary, a naive algorithm could split the sentence as:
- ね root
- こがた small-size
- べる bell
Of course, this interpretation does not make sense. But the algorithm would at least need to understand Japanese syntax. And now we are doing NLP.
Instead, the safe approach is just to display all possibilities to the user. If the user hovers ね, we should suggest:
- ねこ cat
- ね root
If they hover こ, we should suggest:
- こがた small-size
- こ child
- こ counter for articles
In other words, the basic task is to iterate over all substrings, look-up the corresponding dictionary entries, and send that to the user.
But then, it gets worse.
Grammar
Despite what you may have been told, Japanese does have grammar and inflection. And it’s agglutinative. This means that “he has been forced to make him eat” could become the single word “食べさせられていた”. To figure out the fact that we should display the entry for “食べる” (eat) when the user hovers this word, we need to apply “deinflection rules”.
Basically, a deinflection rule tells you to replace a suffix of the word by a shorter one. But there are several, and you do not know which one is correct, so you just try them all. And then, you need to recursively deinflect these new words in turn. At some point, one of these words might match a dictionary entry. But you still need to check them all.
For a short article that ChatGPT 4 translates to 162 English words, we need to do 5,631 lookups. Of these, only 427 match at least one dictionary entry. In total, we output 679 dictionary entries for common names.
Early Solution
My first attempt was pretty straight-forward. However, this was using way too much memory. So I added an index; after a binary search, it would give me the offset in the EDICT file. This avoided creating multiple Python objects for each entry, and I relied on Linux file caching to avoid disk seeks.
Running a retrospective benchmark tells me it took about 1.1 s to find the relevant entries for the previous story. It was so slow that I precomputed sub-dictionaries for all pages. At least, I was paying this cost once per page, and not once per view. But it meant higher complexity (generating the dictionaries and serving them), lower flexibility (updating all the sub-dictionaries or creating a new kind of page would be cumbersome), and extra storage (1.5 GB for the sub-dictionaries of 8,997 pages).
In the benchmark below, you can see that the latency without contention is about 80 ms, and the maximal throughput is less than 13 requests/second. Running parallel requests actually lowers it!
$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
Thread Stats Avg Stdev Max +/- Stdev
Latency 79.05ms 12.37ms 137.93ms 84.80%
…
Requests/sec: 106.21
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec: 7.99
Note: -H 'Connection: close
is required because of #488. Without it, wrk
measures latency incorrectly. This actually caused all my initial benchmarks to be off by about 40 ms. I only learned about this because of a discrepancy when redoing the benchmarks for this article, where doing more work would take less time!
Note: I have run the benchmark of this article on the index page of 2024-01-12. It contains 4 stories, with 4 sub-dictionaries for common names, and 4 sub-dictionaries for proper names.
And that was the situation from 2018 to 2023.
Optimizing Python
Of course, the solution was Rust!
But, before I indulged in the pagan ritual of RIIR, I took a bit of time to properly optimize the Python version. This would give me a better point of comparison on the speedup achieved by switching languages.
Using profiling allowed me to identify various avenues of improvement:
- move the data back in a dictionary; it would use too much memory to run on my tiny VPS, but only speed mattered at that point
- deduplicate candidates from the deinflection
- I noticed that I was actually deinflecting twice due to poor function naming
These optimizations combined reduced the time needed to generate the entries from 1,200 ms per story to 24 ms (-98 %).
Going further, I used a trie to quickly look up the matching deinflection rules to apply (-50%). I also only kept relevant entries from the main dictionary (-15%). I was basically down to about 10 ms per story. With 4 stories on the main page, I was at 310 ms to first byte, compared to 230 ms in the version with precomputed sub-dictionaries.
Optimizing Django
Thanks to benchmarking, I realized my page loads were atrocious either way. So I optimized them as well, to better see the effect of generating sub-dictionaries. The line_profiler was very useful to identify hotspots in Django views:
- I ditched Django’s ORM query-builder for some complex stories, without changing the SQL query sent to the database; just building that query using dynamic Python objects at every page load was taking a very significant amount of time
- I avoided using additional SQL queries to search a specific story, when I could just use plain Python over the list of results (using SQL queries was an oversight from using Python-like Django’s ORM)
- used indexes1 (this actually yielded a smaller performance improvement than what I expected)
- avoided converting fields in an SQL query
Running the benchmark again, we see that these brought down the load time to 9 ms, with most of that time spent in generating HTML from Jinja2 templates, and some in SQL queries and Django’s ORM query builder. Even better, serving queries on 2 physical cores with hyperthreading allow to increase the throughput to 124 requests/second!
$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
Thread Stats Avg Stdev Max +/- Stdev
Latency 8.94ms 2.07ms 34.97ms 90.65%
…
Requests/sec: 106.21
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec: 124.42
At that point, things are starting to look pretty good. However, integrating on-the-fly generation of sub-dictionaries in Python would get us back into meh territory:
$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
Thread Stats Avg Stdev Max +/- Stdev
Latency 94.42ms 7.79ms 141.33ms 90.48%
…
Requests/sec: 10.49
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec: 10.49
Notice in particular how the multithreaded benchmark does not improve throughput. This is because all threads need to access the full dictionary objects. And, with Python’s GIL, that means all thread must take their turn.
So, let’s Rust!
A bit of Rust
First, I migrated the Python module that handled the generation of sub-dictionaries. I migrated the code from Python to Rust, keeping mostly the same logic, compiling it to a naive Python module, making the transition transparent to the rest of the project.
Simply writing in Rust already yielded significant benefits, with many low-level optimizations enabled by the typing system, and with lifetimes reducing the number of allocations and deallocation.
The one thing I kept in mind was to avoid cloning objects unnecessarily. For instance, the module does not copy the entries from the main dictionary, but instead use references as long as possible, until returning to Python. In fact, the whole file is kept as a single string, and all entries and keys (individual words) are pointing to that string.
Doing this reduced the time to generate the sub-dictionary to about 1 ms. Also, using Rust references everywhere instead of Python str
s reduced the memory footprint to 50 MB, making it viable in production.
$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
Thread Stats Avg Stdev Max +/- Stdev
Latency 22.17ms 2.51ms 32.16ms 71.95%
…
Requests/sec: 44.16
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec: 48.35
This is better. But notice that we are still not doing better with multiple threads. This is because we need to release the GIL. Then, we do see some improvement:
$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
Thread Stats Avg Stdev Max +/- Stdev
Latency 23.26ms 14.29ms 193.46ms 98.24%
…
Requests/sec: 44.57
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec: 76.25
However, calling this from Python still took about 13 ms out of the 23 ms to load the index page with 4 stories. With two sub-dictionaries per story, you would expect about 8 ms. But I did not investigate this overhead further, since I was planning to rewrite it all in Rust anyway.
All of the Rust
Having already used Axum to write a Web API, and Askama for templating, porting the Web service to Rust was unchallenging.
I did fight a bit with sqlx, however.
The two most popular ways to interact with a database in Rust are Diesel, which provides an ORM, and sqlx, which takes raw SQL. To me, the most important thing is being able to use static type checking to catch basic errors in the queries. Diesel achieves this with its query-builder; sqlx does that by verifying the SQL queries against a real database at compile time using PREPARE statements.
// with Diesel
diesel::update(dsl::users.find(user.id))
.set(dsl::name.eq(name))
.execute(con)
.unwrap();
// with sqlx
sqlx::query("UPDATE users SET name = $1 WHERE id = $2")
.bind(name)
.bind(user.id)
.execute(pool)
.await
.unwrap()
Using a query-builder adds a leaky layer of abstraction you need to fight with. This is especially frustrating when you already know the SQL query you want to write, but need to figure out how to make Diesel generate it for you. I already had experience with Diesel, so I opted to try sqlx, to see whether it would deliver on its promises, or whether querying a database at compile-time would add too much friction.
In fact, sqlx works mostly as described. It only becomes painful when you want to avoid making allocations for each field of each returned row! If you try to do it, you will actually lose type safety, since the queries won’t be checked anymore. This is an open issue.
$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:3000/
…
Thread Stats Avg Stdev Max +/- Stdev
Latency 9.25ms 1.22ms 22.02ms 88.10%
…
Requests/sec: 107.24
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:3000/
…
Requests/sec: 264.78
In any case, the Rust rewrite got rid of the overhead, reducing the total time to generate a page to 9 ms, virtually all of it being spent on generating sub-dictionaries!
Conclusion
Compiled languages with static typing have the reputation of being faster at the cost of being lower-level, involving more house-keeping, and having more gotchas.
However, in my experience, Rust actually feels high-level, almost Python-like! There are some things that can surprise a beginner, like the Borrow checker, but this is usually easily solved with a .clone()
. After all, using Python is pretty close to using Rc
and .clone()
everywhere, implicitly!
In addition, the migration can often be done piecemeal.
Without too much effort, Rust gives very significant performance improvements. The first factor to go further is to minimize allocations and data copies, which is much easier in Rust than in C or in C++, thanks to references and lifetimes that track everything for you.
There is definitely still room for improvement in my refactor, but it is more than enough to convince me that this was worth it!
- You should very much use indexes. ↩︎