Categories
Programming

Rewriting NHK Easier in Rust

The Christmas holidays are the perfect time of the year to shut oneself in, talk to no one, and pour days into staring at a computer screen.

So, naturally, I rewrote my longest-actually-running project in Rust 🦀.

The Challenge

Many Words

NHK Easier serves news written in simple Japanese for language learners. It has a couple of features to make the learning process easier. The most important and more complex one shows the corresponding dictionary entry when hovering a word.

The EDICT dictionary with proper names is 11 MB compressed with bzip2. We do not want to send that for a single page Web. Easy! We will only send the dictionary entries for the words that appear on the page.

The thing is, there are no spaces between words in Japanese. Let’s take the following sentence:

nekogataberu

A Japanese speaker understands the words as:

  • neko cat
  • ga follows the subject of the sentence
  • taberu eat

So, they would read the sentence as “The cat eats”. However, even if it only uses words out of a dictionary, a naive algorithm could split the sentence as:

  • ne root
  • kogata small-size
  • beru bell

Of course, this interpretation does not make sense. But the algorithm would at least need to understand Japanese syntax. And now we are doing NLP.

Instead, the safe approach is just to display all possibilities to the user. If the user hovers ね, we should suggest:

  • neko cat
  • ne root

If they hover こ, we should suggest:

  • kogata small-size
  • ko child
  • ko counter for articles

In other words, the basic task is to iterate over all substrings, look-up the corresponding dictionary entries, and send that to the user.

But then, it gets worse.

Grammar

Despite what you may have been told, Japanese does have grammar and inflection. And it’s agglutinative. This means that “he has been forced to make him eat” could become the single word “tabesaserareteita”. To figure out the fact that we should display the entry for “taberu” (eat) when the user hovers this word, we need to apply “deinflection rules”.

Basically, a deinflection rule tells you to replace a suffix of the word by a shorter one. But there are several, and you do not know which one is correct, so you just try them all. And then, you need to recursively deinflect these new words in turn. At some point, one of these words might match a dictionary entry. But you still need to check them all.

For a short article that ChatGPT 4 translates to 162 English words, we need to do 5,631 lookups. Of these, only 427 match at least one dictionary entry. In total, we output 679 dictionary entries for common names.

Early Solution

My first attempt was pretty straight-forward. However, this was using way too much memory. So I added an index; after a binary search, it would give me the offset in the EDICT file. This avoided creating multiple Python objects for each entry, and I relied on Linux file caching to avoid disk seeks.

Running a retrospective benchmark tells me it took about 1.1 s to find the relevant entries for the previous story. It was so slow that I precomputed sub-dictionaries for all pages. At least, I was paying this cost once per page, and not once per view. But it meant higher complexity (generating the dictionaries and serving them), lower flexibility (updating all the sub-dictionaries or creating a new kind of page would be cumbersome), and extra storage (1.5 GB for the sub-dictionaries of 8,997 pages).

In the benchmark below, you can see that the latency without contention is about 80 ms, and the maximal throughput is less than 13 requests/second. Running parallel requests actually lowers it!

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    79.05ms   12.37ms 137.93ms   84.80%
…
Requests/sec:    106.21
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:      7.99

Note: -H 'Connection: close is required because of #488. Without it, wrk measures latency incorrectly. This actually caused all my initial benchmarks to be off by about 40 ms. I only learned about this because of a discrepancy when redoing the benchmarks for this article, where doing more work would take less time!

Note: I have run the benchmark of this article on the index page of 2024-01-12. It contains 4 stories, with 4 sub-dictionaries for common names, and 4 sub-dictionaries for proper names.

And that was the situation from 2018 to 2023.

Optimizing Python

Of course, the solution was Rust!

But, before I indulged in the pagan ritual of RIIR, I took a bit of time to properly optimize the Python version. This would give me a better point of comparison on the speedup achieved by switching languages.

Using profiling allowed me to identify various avenues of improvement:

  • move the data back in a dictionary; it would use too much memory to run on my tiny VPS, but only speed mattered at that point
  • deduplicate candidates from the deinflection
  • I noticed that I was actually deinflecting twice due to poor function naming

These optimizations combined reduced the time needed to generate the entries from 1,200 ms per story to 24 ms (-98 %).

Going further, I used a trie to quickly look up the matching deinflection rules to apply (-50%). I also only kept relevant entries from the main dictionary (-15%). I was basically down to about 10 ms per story. With 4 stories on the main page, I was at 310 ms to first byte, compared to 230 ms in the version with precomputed sub-dictionaries.

Optimizing Django

Thanks to benchmarking, I realized my page loads were atrocious either way. So I optimized them as well, to better see the effect of generating sub-dictionaries. The line_profiler was very useful to identify hotspots in Django views:

  • I ditched Django’s ORM query-builder for some complex stories, without changing the SQL query sent to the database; just building that query using dynamic Python objects at every page load was taking a very significant amount of time
  • I avoided using additional SQL queries to search a specific story, when I could just use plain Python over the list of results (using SQL queries was an oversight from using Python-like Django’s ORM)
  • used indexes1 (this actually yielded a smaller performance improvement than what I expected)
  • avoided converting fields in an SQL query

Running the benchmark again, we see that these brought down the load time to 9 ms, with most of that time spent in generating HTML from Jinja2 templates, and some in SQL queries and Django’s ORM query builder. Even better, serving queries on 2 physical cores with hyperthreading allow to increase the throughput to 124 requests/second!

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.94ms    2.07ms  34.97ms   90.65%
…
Requests/sec:    106.21
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:    124.42

At that point, things are starting to look pretty good. However, integrating on-the-fly generation of sub-dictionaries in Python would get us back into meh territory:

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    94.42ms    7.79ms 141.33ms   90.48%
…
Requests/sec:     10.49
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     10.49

Notice in particular how the multithreaded benchmark does not improve throughput. This is because all threads need to access the full dictionary objects. And, with Python’s GIL, that means all thread must take their turn.

So, let’s Rust!

A bit of Rust

First, I migrated the Python module that handled the generation of sub-dictionaries. I migrated the code from Python to Rust, keeping mostly the same logic, compiling it to a naive Python module, making the transition transparent to the rest of the project.

Simply writing in Rust already yielded significant benefits, with many low-level optimizations enabled by the typing system, and with lifetimes reducing the number of allocations and deallocation.

The one thing I kept in mind was to avoid cloning objects unnecessarily. For instance, the module does not copy the entries from the main dictionary, but instead use references as long as possible, until returning to Python. In fact, the whole file is kept as a single string, and all entries and keys (individual words) are pointing to that string.

Doing this reduced the time to generate the sub-dictionary to about 1 ms. Also, using Rust references everywhere instead of Python strs reduced the memory footprint to 50 MB, making it viable in production.

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.17ms    2.51ms  32.16ms   71.95%
…
Requests/sec:     44.16
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     48.35

This is better. But notice that we are still not doing better with multiple threads. This is because we need to release the GIL. Then, we do see some improvement:

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.26ms   14.29ms 193.46ms   98.24%
…
Requests/sec:     44.57
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     76.25

However, calling this from Python still took about 13 ms out of the 23 ms to load the index page with 4 stories. With two sub-dictionaries per story, you would expect about 8 ms. But I did not investigate this overhead further, since I was planning to rewrite it all in Rust anyway.

All of the Rust

Having already used Axum to write a Web API, and Askama for templating, porting the Web service to Rust was unchallenging.

I did fight a bit with sqlx, however.

The two most popular ways to interact with a database in Rust are Diesel, which provides an ORM, and sqlx, which takes raw SQL. To me, the most important thing is being able to use static type checking to catch basic errors in the queries. Diesel achieves this with its query-builder; sqlx does that by verifying the SQL queries against a real database at compile time using PREPARE statements.

// with Diesel
diesel::update(dsl::users.find(user.id))
    .set(dsl::name.eq(name))
    .execute(con)
    .unwrap();

// with sqlx
sqlx::query("UPDATE users SET name = $1 WHERE id = $2")
    .bind(name)
    .bind(user.id)
    .execute(pool)
    .await
    .unwrap()

Using a query-builder adds a leaky layer of abstraction you need to fight with. This is especially frustrating when you already know the SQL query you want to write, but need to figure out how to make Diesel generate it for you. I already had experience with Diesel, so I opted to try sqlx, to see whether it would deliver on its promises, or whether querying a database at compile-time would add too much friction.

In fact, sqlx works mostly as described. It only becomes painful when you want to avoid making allocations for each field of each returned row! If you try to do it, you will actually lose type safety, since the queries won’t be checked anymore. This is an open issue.

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:3000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.25ms    1.22ms  22.02ms   88.10%
…
Requests/sec:    107.24
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:3000/
…
Requests/sec:    264.78

In any case, the Rust rewrite got rid of the overhead, reducing the total time to generate a page to 9 ms, virtually all of it being spent on generating sub-dictionaries!

Result of profiling the Web service using samply. I only performed requests to the index page, which is handled by the archive route. We can see that 95% of its time is spent generating sub-dictionaries.

Conclusion

Compiled languages with static typing have the reputation of being faster at the cost of being lower-level, involving more house-keeping, and having more gotchas.

However, in my experience, Rust actually feels high-level, almost Python-like! There are some things that can surprise a beginner, like the Borrow checker, but this is usually easily solved with a .clone(). After all, using Python is pretty close to using Rc and .clone() everywhere, implicitly!

In addition, the migration can often be done piecemeal.

Without too much effort, Rust gives very significant performance improvements. The first factor to go further is to minimize allocations and data copies, which is much easier in Rust than in C or in C++, thanks to references and lifetimes that track everything for you.

In fact, looking at the reversed callstack of samply’s measures, it looks like copying is still the thing where I spend most of my time:

There is definitely still room for improvement in my refactor, but it is more than enough to convince me that this was worth it!


  1. You should very much use indexes. ↩︎
Categories
Competitive Programming Computer Science Programming

Dynamic Programming is not Black Magic

This year’s Advent of Code has been brutal (compare the stats of 2023 with that of 2022, especially day 1 part 1 vs. day 1 part 2). It included a problem to solve with dynamic programming as soon as day 12, which discouraged some people I know. This specific problem was particularly gnarly for Advent of Code, with multiple special cases to take into account, making it basically intractable if you are not already familiar with dynamic programming.

However, dynamic programming itself is mostly natural when you understand what it does. And many common algorithms are actually just the application of dynamic programming to specific problems, including omnipresent path-finding algorithms such as Dijkstra’s algorithm.

This motivated me to write a gentler introduction and a detailed explanation of solving Day 12.

Let me start with a rant. You can skip to the following section if you want to get to the meat of the article.

The Rant

Software engineering is terrible at naming things.

Now, let’s take a look at “dynamic programming”. What can we learn from the name? “Programming” must refer to a style of programming, such as “functional programming”, or maybe “test-driven development”. Then, “dynamic” could mean:

Guess what. It means nothing of that, and it has nothing to do with being “dynamic”. It is an idea that you can use to design an algorithm, so there is a link to “programming”; I will grant it that.

Edit: There is a reason why it is named this way. When you look at the historical meaning of “programming”, the expression made sense. See niconiconi’s comment.

So, what is it?

Basic Caching

Let’s say we want to solve a problem by splitting it in smaller similar problems. Basically, a recursive function. Often, we end-up having to solve the same smaller problems many times.

The typical example is Fibonacci, where you want to evaluate f(n), which is defined as f(n - 1) + f(n - 2). If we implement it naively, we will end up evaluating f(1) many times:

Call tree for evaluating f(6), the 6-th Fibonacci number. To evaluate, f(6), we need to evaluate both f(5) and f(4). To evaluate f(5), we will need f(4) and f(3). Already, we see that we are going to need f(4) in two places. If we go further, we see that we will need f(1) 8 eight times, which happens to be f(6).

In fact, in this case, since only f(1) adds anything to the overall result, the number of times we will need f(1) is equal to f(n). And f(n) grows very fast as n grows.

Of course, we can avoid doing this. We can just cache the results (or memoize f, in terrible academic vernacular).

In the example, once we have evaluated f(4) once, there is no need to evaluate it again, saving 3 evaluations of f(1). By doing the same for f(3) and f(2), we get down to 2 evaluations of f(1). In total, f(…) is evaluated 7 times (f(0), f(1), f(2), f(3), f(4), f(5), f(6)), which is just f(n) + 1.

This is theoretically (asymptotically) optimal. But we can look at this in a different way.

Optimized Caching

With memoization, we keep the recursion: “to solve f(6), I need f(5), which will itself need f(4) […] and f(3) […], and f(4), which will itself need f(3) […] and f(2) […].”. Basically, we figure out what we need just when we need them.

Instead, we can make the simple observation that we will need f(0) and f(1) for all other evaluations of f(…). Once we have them, we can evaluate f(2), which will need for all other evaluations of f(…).

You can think of it as plucking the leaves (the nodes without descendants) from the call tree we saw before, and repeat until there are no more nodes. In other words, perform a topological sort.

With the example, if we have some array F where we can store our partial results:

  • F[0] = f(0) = 0
  • F[1] = f(1) = 1
  • F[2] = f(2) = f(1) + f(0) = F[1] + F[0] = 1 + 0 = 1
  • F[3] = f(3) = f(2) + f(1) = F[2] + F[1] = 1 + 1 = 2
  • F[4] = f(4) = f(3) + f(2) = F[3] + F[2] = 2 + 1 = 3
  • F[5] = f(5) = f(4) + f(3) = F[4] + F[3] = 3 + 2 = 5
  • F[6] = f(6) = f(5) + f(4) = F[5] + F[4] = 5 + 3 = 8

With this approach, we do not have any recursive call anymore. And that is dynamic programming.

It also forces us to think clearly about what information we will be storing. In fact, in the case of Fibonacci we can notice that we only need the two last previous values. In other words:

  • F[0] = f(0) = 0
  • F[1] = f(1) = 1
  • F[2] = f(2) = previous + previous_previous = 1 + 0 = 1
  • F[3] = f(3) = previous + previous_previous = 1 + 1 = 2
  • F[4] = f(4) = previous + previous_previous = 2 + 1 = 3
  • F[5] = f(5) = previous + previous_previous = 3 + 2 = 5
  • F[6] = f(6) = previous + previous_previous = 5 + 3 = 8

So, we can discard other values and just keep two of them. Doing this in Python, we get:

def fibo(n):
    if n == 0:
        return 0
    previous_previous = 0
    previous = 1
    for _ in range(n - 1):
        current = previous_previous + previous
        (previous, previous_previous) = (current, previous)
    return previous

I like that this gives us a natural and systematic progression from the mathematical definition of the Fibonacci function, to the iterative implementation (not the optimal one, though).

Now, Fibonacci is more of a toy example. Let’s have a look at

Edit Distance

The edit distance between two strings is the smallest number of edits needed to transform one string into the other one.

There are actually several versions, depending on what you count as an “edit”. For instance, if you only allow replacing a character by another, you get Hamming distance; evaluating the Hamming distance between two strings is algorithmically very simple.

Things become more interesting if you allow insertion and deletion of characters as well. This is the Levenstein distance. Considering this title of the present article, this is of course something that can be solved efficiently using ✨ dynamic programming ✨.

To do that, we’ll need to find how we can derive a full solution from solutions to smaller-problems. Let’s say we have two strings: A and B. We’ll note d(X, Y) the edit distance between strings X and Y, and we’ll note x the length of string X. We need to formulate d(A, B) from any combination of d(X, Y) where X is a substring of A and Y a substring of B1.

We’re going to look at a single character. We’ll use the last one. The first one would work just as well but using a middle one would not be as convenient. So, let’s look at A[a – 1] and B[b – 1] (using zero-indexing). We have four cases:

  • A[a - 1] == B[b - 1], then we can ignore that character and look at the rest, so d(A, B) = d(A[0..a - 1], B[0..b - 1])
  • A[a - 1] != B[b - 1], then we could apply any of the three rules. Since we want the smallest number of edits, we’ll need to select the smallest value given by applying each rule:
    • substitute the last character of A by that of B, in which case d(A, B) = d(A[0..a - 1], B[0..b - 1]) + 1
    • delete the last character of A, in which case d(A, B) = d(A[0..a - 1], B) + 1
    • insert the last character of B, in which case d(A, B) = d(A, B[0..b - 1]) + 1
  • A is actually empty (a = 0), then we need to insert all characters from B2, so d(A, B) = b
  • B is actually empty (b = 0), then we need to delete all characters from A, so d(A, B) = a

By translating this directly to Python, we get:

def levenstein(A: str, B: str) -> int:
    a = len(A)
    b = len(B)
    if a == 0:
        return b
    elif b == 0:
        return a
    elif A[a - 1] == B[b - 1]:
        return levenstein(A[:a - 1], B[:b - 1])
    else:
        return min([
            levenstein(A[:a - 1], B[:b - 1]) + 1,
            levenstein(A[:a - 1], B) + 1,
            levenstein(A, B[:b - 1]) + 1,
        ])


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# way too slow!
# assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

As hinted by the last test, this version becomes very slow when comparing long strings with lots of differences. In Fibonnacci, we were doubling the number of instances for each level in the call tree; here, we are tripling it!

In Python, we can easily apply memoization:

from functools import cache

@cache
def levenstein(A: str, B: str) -> int:
    a = len(A)
    b = len(B)
    if a == 0:
        return b
    elif b == 0:
        return a
    elif A[a - 1] == B[b - 1]:
        return levenstein(A[:a - 1], B[:b - 1])
    else:
        return min([
            levenstein(A[:a - 1], B[:b - 1]) + 1,
            levenstein(A[:a - 1], B) + 1,
            levenstein(A, B[:b - 1]) + 1,
        ])


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# instantaneous!
assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

Now, there is something that makes the code nicer, and more performant, but it is not technically necessary. The trick is that we do not actually need to create new strings in our recursive functions. We can just pass arounds the lengths of the substrings, and always refer to the original strings A and B. Then, our code becomes:

from functools import cache

def levenstein(A: str, B: str) -> int:
    @cache
    def aux(a: int, b: int) -> int:
        if a == 0:
            return b
        elif b == 0:
            return a
        elif A[a - 1] == B[b - 1]:
            return aux(a - 1, b - 1)
        else:
            return min([
                aux(a - 1, b - 1) + 1,
                aux(a - 1, b) + 1,
                aux(a, b - 1) + 1,
            ])
    return aux(len(A), len(B))


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# instantaneous!
assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

The next step is to build the cache ourselves:

def levenstein(A: str, B: str) -> int:
    # cache[a][b] = levenstein(A[:a], B[:b])
    # note the + 1 so that we can actually do cache[len(A)][len(B)]
    # the list comprehension ensures we create independent rows, not references to the same one
    cache = [[None] * (len(B) + 1) for _ in range(len(A) + 1)]
    def aux(a: int, b: int) -> int:
        if cache[a][b] == None:
            if a == 0:
                cache[a][b] = b
            elif b == 0:
                cache[a][b] = a
            elif A[a - 1] == B[b - 1]:
                cache[a][b] = aux(a - 1, b - 1)
            else:
                cache[a][b] = min([
                    aux(a - 1, b - 1) + 1,
                    aux(a - 1, b) + 1,
                    aux(a, b - 1) + 1,
                ])
        return cache[a][b]
    return aux(len(A), len(B))


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# instantaneous!
assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

The last thing we need to do is to replace the recursion with iterations. The important thing is to make sure we do that in the right order3:

def levenstein(A: str, B: str) -> int:
    # cache[a][b] = levenstein(A[:a], B[:b])
    # note the + 1 so that we can actually do cache[len(A)][len(B)]
    # the list comprehension ensures we create independent rows, not references to the same one
    cache = [[None] * (len(B) + 1) for _ in range(len(A) + 1)]
    for a in range(0, len(A) + 1):
        for b in range(0, len(B) + 1):
            if a == 0:
                cache[a][b] = b
            elif b == 0:
                cache[a][b] = a
            elif A[a - 1] == B[b - 1]:
                # since we are at row a, we have already filled in row a - 1
                cache[a][b] = cache[a - 1][b - 1]
            else:
                cache[a][b] = min([
                    # since we are at row a, we have already filled in row a - 1
                    cache[a - 1][b - 1] + 1,
                    # since we are at row a, we have already filled in row a - 1
                    cache[a - 1][b] + 1,
                    # since we are at column b, we have already filled column b - 1
                    cache[a][b - 1] + 1,
                ])
    return cache[len(A)][len(B)]


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# instantaneous!
assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

Now, if you really want to grok dynamic programming, I invite you to try it yourself on the following problems, preferrably in this order:

  1. longest common subsequence (not to be confused with longest common substring, but you can do that one too with dynamic programming)
  2. line wrap
  3. subset sum
  4. partition
  5. knapsack

Once you are comfortable with dynamic programming, Day 12 should become much less daunting!

Advent of Code, Day 12

In the Advent of Code of December 12th, 2023, you have to solve 1D nonograms. Rather than rephrasing the problem, I will let you read the official description.

.??..??...?##. 1,1,3

This can be solved by brute-force. The proper technique for that is backtracking, another terrible name. But the asymptotic complexity is exponential (for n question marks, we have to evaluate 2n potential solutions). Let’s see how it goes with this example:

  • .??..??...?##. 1,1,3 the first question mark could be either a . or a #; in the second case, we “consume” the first group of size 1, and the second question mark has to be a .
    1. ..?..??...?##. 1,1,3 the next question mark could be either a . or a #; in the second case, we “consume” the first group of size 1, and the next character has to be a ., which is the case
      1. .....??...?##. 1,1,3 the backtracking algorithm will continue to explore the 8 cases, but none of them is a valid solution
      2. ..#..??...?##. (1),1,3
        • and so on…
    2. .#...??...?##. (1),1,3
      • and so on…

There are 32 candidates, which would make 63 list items. I’ll spare you that. Instead, I want to draw your attention to the items 2.2 and 2:

  • 2.2. ..#..??...?##. (1),1,3
  • 2. .#...??...?##. (1),1,3

They are extremely similar. In fact, if we discard the part that has already been accounted for, they are more like:

  • 2.2. .??...?##. 1,3
  • 2. ..??...?##. 1,3

There is an extra . on the second one, but we can clearly see that it is actually the same problem, and has the same solutions.

In other words, just like with Fibonacci, the total number of cases is huge, but many of them will just be repeats of other ones. So we are going to apply memoization. And then, dynamic programming.

When we implement the “backtracking” algorithm we’ve overviewed above, we get something like this (not my code):

def count_arrangements(conditions, rules):
    if not rules:
        return 0 if "#" in conditions else 1
    if not conditions:
        return 1 if not rules else 0

    result = 0

    if conditions[0] in ".?":
        result += count_arrangements(conditions[1:], rules)
    if conditions[0] in "#?":
        if (
            rules[0] <= len(conditions)
            and "." not in conditions[: rules[0]]
            and (rules[0] == len(conditions) or conditions[rules[0]] != "#")
        ):
            result += count_arrangements(conditions[rules[0] + 1 :], rules[1:])

    return result

Note the program above handles ? by treating it as both . and #. The first case is easy, but the second case need to check that it matches the next rules; and for that, it needs to check that there is a separator afterwards, or the end of the string.

Since it’s Python, to memoize, we just need to add @cache.

To make it dynamic programing, we use the same trick as in the example of the edit distance: we pass the offset in the string, and the offset in the rules as parameters in the recursion. This becomes:

def count_arrangements(conditions, rules):
    @cache
    def aux(i, j):
        if not rules[j:]:
            return 0 if "#" in conditions[i:] else 1
        if not conditions[i:]:
            return 1 if not rules[j:] else 0

        result = 0

        if conditions[i] in ".?":
            result += aux(i + 1, j)
        if conditions[i] in "#?":
            if (
                rules[j] <= len(conditions[i:])
                and "." not in conditions[i:i + rules[j]]
                and (rules[j] == len(conditions[i:]) or conditions[i + rules[j]] != "#")
            ):
                result += aux(i + rules[j] + 1, j + 1)

        return result
    return aux(0, 0)

Then, we implement our own cache and fill it in the right order:

def count_arrangements(conditions, rules):
    cache = [[0] * (len(rules) + 1) for _ in range(len(conditions) + 1)]
    # note that we are in the indices in reverse order here
    for i in reversed(range(0, len(conditions) + 1)):
        for j in reversed(range(0, len(rules) + 1)):
            if not rules[j:]:
                result = 0 if "#" in conditions[i:] else 1
            elif not conditions[i:]:
                result = 1 if not rules[j:] else 0
            else:
                result = 0
                if conditions[i] in ".?":
                    # since we are at row i, we already filled in row i + 1
                    result += cache[i + 1][j]
                if conditions[i] in "#?":
                    if (
                        rules[j] <= len(conditions[i:])
                        and "." not in conditions[i:i + rules[j]]
                    ):
                        if rules[j] == len(conditions[i:]):
                            # since we are at row i, we already filled in row i + rules[j] > i
                            result += cache[i + rules[j]][j + 1]
                        elif conditions[i + rules[j]] != "#":
                            # since we are at row i, we already filled in row i + rules[j] + 1 > i
                            result += cache[i + rules[j] + 1][j + 1]
            cache[i][j] = result
    return cache[0][0]

And, voilà! You can also have a look at a Rust implementation (my code, this time).

Note: In this case, it looks like the dynamic programming version is slower than the memoized one. But that’s probably due to it being written in unoptimized Python.

Note: Independently from using a faster language and micro-optimizations, the dynamic programming version allows us to see that we only need the previous column. Thus, we could replace the 2D array by two 1D arrays (one for the previous column, and one for the column being filled).

Conclusion

I’ll concede that dynamic programming is not trivial. But it is far from being unreachable for most programmers. Being able to understand how to split a problem in smaller problems will enable you to use memoization in various contexts, which is already a huge improvement above a naive implementation.

However, mastering dynamic programming will let us understand a whole class of algorithms, better understand trade-offs, and make other optimizations possible. So, if you have not already done them, I strongly encourage you to practice on these problems:

  1. longest common subsequence (not to be confused with longest common substring, but you can do that one too with dynamic programming)
  2. line wrap
  3. subset sum
  4. partition
  5. knapsack

And don’t forget to benchmark and profile your code!


  1. Excluding, of course, d(A, B) itself ↩︎
  2. B could be empty as well, in which case we need to insert 0 characters ↩︎
  3. Note that we could permute the inner and outer loops as shown below. In this case, it works just as well:
    for b in range(0, len(B) + 1):
    for a in range(0, len(A) + 1):
    ↩︎