Categories
Programming

ViHN: Vim for Hacker News

tl;dr: I made ViHN to read Hacker News without having to move my hands away from the keyboard. It’s freely available on Firefox and Chrome.

This is about yet another Hacker News extension. But other people with my obsession for killing the mouse might enjoy it.

Most of the content of this article is just an ancient1 talking about the good ol’ times. But you can jump directly to the description of the extension.

btw, I use Vim

I know that I am not the only one who had a keyboard-zealot phase.

When GNOME Shell happened, I was forced to go looking for a new window manager. I ended up installing Ratpoison. Like the more well-known i3, Ratpoison is a tiling window.

The peak of UI. Wait, that’s not GNOME Panel.

That means that, instead of floating windows you usually move around with the mouse, you split the screen in “tiles” however you like, and assign windows to them. By default, the screen is made of a single tile, so windows are full-screen. Then, you might split it in two halves, to show two windows side-by-side (for instance, a text editor and a Wikipedia page). It might seem pointless since you can also do this with regular (“stacking”) window manager, but the important difference is the workflow.

With a regular window manager, positioning windows can feel somewhat limited. As a beginner, you might use the mouse to arrange some windows so that they display all the information you need in a convenient way. But that takes several seconds. And people like me do not have seconds to waste — we could be using it to read memes on Reddit instead.

Why would you use maximized windows, when you can remind yourself you are using Arch, btw

So we learn the keyboard shortcuts to arrange the current window in the left/right half of the screen. Unfortunately, there are times when you would like to show three windows at the same time. And then, you have to go back to using the mouse. Can you imagine the horror?

With a tiling window manager, we can use keyboard shortcuts to display how many windows in however way we want. Want to display 81 different windows in a 9×9 grid to make pixel art maximize your productivity? Of course, just a few keystrokes.

The real benefit, however, is that no one else will be able to use your computer. You will have switched to another tiling window manager by the time they figure out the previous one.

Note: However, Ratpoison is a misnomer: since you can still use your mouse to open links in Firefox instead of Tab-Tab-Tab-Tab-Tab-Tab-Tab-Tab-Tab-Tab-Tab-Tab-Return like a normal person2. Because of Valve, you even have to click on terrorists, instead of just spotting the “g” in the middle of random punctuation. Thankfully, you can live in the TTY, read documents with fbi and watch videos with libcaca.

A goblin raid in Dwarf Fortress. They killed a dog!

Since, I have strayed from the rightful path, and switched to xfce43. I even tried Visual Code for a while (with Vim key bindings, of course), but I always come back to Vim.

Hacker News

When I was younger, and more innocent, I used to kill time by looking for script kiddies hideouts with advanced search keywords such as “hacker forum”. I would enjoy watching various people trying to impress each other with technical nonsense.

Yes, this is exactly, what using a computer was like back then, with 3D interfaces in which you could fly, and everything4

At some point, I stumbled upon Hacker News, which is actually not a hideout for script kiddies5. It’s actually a place to promote great business ideas, discuss major security flaws underpinning infrastructure of geopolitical importance, and explore advanced technical topics. And not even half the content is about the latest technological fad6.

Hacker News, an actual forum for hackers

Unfortunately, it uses that quirky system where the discussion is organized in trees, and people vote for the most interesting comments. In older forums, you simply had to read the 3000-message-long thread to find the few insightful remarks. And with 20 messages per page, that would be barely 150 pages to read!

These fancy vote-based forums allow you to skip entire side-discussions of tangentially related concepts. It’s such a shame that you won’t have to learn about the proper way to repair drywall in an American house when you were just curious about the newest JavaScript framework.

In any case, the point is that reading a comment tree involves more than just scrolling done in a linear thread. Instead, you might want to go back up to the parent, skip to the next sibling discussion, hide a sub-tree, upvote or reply to a specific comment.

And as discussed above, I am not going to do all that hard work with a mouse pointer.

Procrastinating More Efficiently

At first, I just wrote myself some user script to use J, K, and other keys to navigate Hacker News comments, after having done something similar for SlateStarCodex. That was mostly good enough, but I was not as systematic with code versioning at the time, and the format did not particularly encourage it. So I eventually lost that script.

This led me to Refined Hacker News, which is actually pretty good. Unfortunately, some things are broken, and it can take a while to load. And since it is unmaintained, I did not have any hope of being able to fix the most annoying issue (the extension sometimes not working at all).

Navigating comments with Refined Hacker News

So, I finally made my own browser extension for Firefox and Chrome. It’s called “ViHN”. “Vi” as a shorthand for Vim7, and “HN” for “Hacker News”. Get it? I know, it’s very subtle.

It is primarily focused on keyboard-based navigation, should introduce no visible delay, and minimize surprises. Of course, it has many Vim-like key bindings to navigate comments (not just J and K). You can look at the README for the list, but I have also included it below to save you a mouse click if you are not using Vimium.

Also, I stole the idea from bakkot’s SlateStarComments to quickly find new comments. I never quite liked the idea of trying to remember what comments the user might have seen. This is easily confounded by reloading the page by mistake, not having the time to read everything at that particular time, not using the same browser or computer, and so on. Instead, ViHN shows a list of comments in reverse chronological order. You can go through that list (of course using key bindings), and stop when you get to those you already read.

Searching recent comments in late SlateStarCodex

Among other features, you can quick-reply/edit/delete comments, preview comments when replying/editing, it highlights op.

A less visible feature is that manages requests (votes, favorite, flag, persistent collapse) and spread them in time to avoid the drastic rate limiting of Hacker News. If a request results in a 503 error (non-standard “too many requests”), ViHN will retry it later. This materializes in the form of “…” instead of “favorite”/“flag” in links and spinning arrow for votes8.

But this would not be worthy of Hacker News if this was not used as an occasion to bike-shed a particular JavaScript framework. The extension is written is Vanilla JS. This enables maximal performance with minimum overhead and zero build time.

Full-featured Vanilla JS bundle

Key bindings

This section is reproduced from the README, and also available in the extension by pressing ? to toggle help.

Navigate Comments/Stories

KeyEffect
jNext comment/story
kPrevious comment/story
JNext sibling comment
KPrevious sibling comment
gGo to first story/comment
GGo to last story, last root comment or “More” link
HFocus on story/comment at the top of the screen (high)
MFocus on story/comment in the middle of the screen
LFocus on story/comment at the bottom of the screen (low)
nSwitch to Newest Items
hParent comment/story (see [#follow-links](Follow Links))
pParent comment/story (see [#follow-links](Follow Links))

Note: You can also select an item by clicking in its bounding box.

Follow links

KeyEffect
oOpen story link/comment
OOpen story link/comment in background
cOpen comment thread
COpen comment thread in background
bOpen both story link and comment thread
BOpen both story link and comment thread in background
hFollow “context” link (go to comment thread, but focus on current comment)
pFollow “parent” link (go to parent’s page, and focus on parent comment/story)
1Open 1st link in comment (maintain shift to open in background)
9Open 9th link in comment (maintain shift to open in background)
0Open 10th link in comment (maintain shift to open in background)

Note: When on the “XXX more comments” link, you can hit either of [lLcC] to go to the next page of comments.

Note: The digits of the numeric keypad work as well to open links in comments. However, this can only open links in foreground.

Note: When using AZERTY, the key bindings to open links inside comments still work like in QWERTY. Hit the 1 key without shift (like typing &) to open the 1st link in foreground. Hit the 1 key with shift (like typing 1) to open the 1st link in background. Same for the other link numbers.

Actions

KeyEffect
mCollapse/uncollapse comment tree
uUpvote story/comment, or cancel vote
dDownvote story/comment, or cancel vote
fFavorite/un-favorite story/comment of the current page
FFlag/unflag story/comment of the current page
rComment on story, or reply to comment (with preview)
eEdit comment (with preview)
DDelete comment
Ctrl+ReturnSubmit current form

Navigate Newest Items

In the Newest Items list, the following key bindings are available:

KeyEffect
lShow selected comment/story
jNext comment/story
kPrevious comment/story
JJump 10 down
KJump 10 up
gGo to top story/comment
GGo to last story/comment
nSwitch back from Newest Items
  1. I mean, I was born in the previous millenium! ↩︎
  2. Of course, you could just use Vimium, but where is the fun in that? ↩︎
  3. While writing this article, I learned that Linus Torvalds also switched to Xfce4 at this time. So, maybe, I am not such a bad person. ↩︎
  4. No, it wasn’t. The still is from the marvelous Hackers movie. But we did have Packard Bell Navigator, which was just as good (no, it wasn’t). ↩︎
  5. Or maybe it is, and I have become one without even realizing. ↩︎
  6. I like making fun of Hacker News, but is genuinely pretty good place to keep oneself informed, when you are into computers. If you want some more fun, read every Hacker News thread ever. ↩︎
  7. I only ever use Vim in practice, never vi, never ed, so I like to pretend it’s all the same. ↩︎
  8. There is no visible feedback for collapsing since that happens immediately, and the request can be handled asynchroneously. ↩︎
Categories
Programming

Rewriting NHK Easier in Rust

The Christmas holidays are the perfect time of the year to shut oneself in, talk to no one, and pour days into staring at a computer screen.

So, naturally, I rewrote my longest-actually-running project in Rust 🦀.

The Challenge

Many Words

NHK Easier serves news written in simple Japanese for language learners. It has a couple of features to make the learning process easier. The most important and more complex one shows the corresponding dictionary entry when hovering a word.

The EDICT dictionary with proper names is 11 MB compressed with bzip2. We do not want to send that for a single page Web. Easy! We will only send the dictionary entries for the words that appear on the page.

The thing is, there are no spaces between words in Japanese. Let’s take the following sentence:

nekogataberu

A Japanese speaker understands the words as:

  • neko cat
  • ga follows the subject of the sentence
  • taberu eat

So, they would read the sentence as “The cat eats”. However, even if it only uses words out of a dictionary, a naive algorithm could split the sentence as:

  • ne root
  • kogata small-size
  • beru bell

Of course, this interpretation does not make sense. But the algorithm would at least need to understand Japanese syntax. And now we are doing NLP.

Instead, the safe approach is just to display all possibilities to the user. If the user hovers ね, we should suggest:

  • neko cat
  • ne root

If they hover こ, we should suggest:

  • kogata small-size
  • ko child
  • ko counter for articles

In other words, the basic task is to iterate over all substrings, look-up the corresponding dictionary entries, and send that to the user.

But then, it gets worse.

Grammar

Despite what you may have been told, Japanese does have grammar and inflection. And it’s agglutinative. This means that “he has been forced to make him eat” could become the single word “tabesaserareteita”. To figure out the fact that we should display the entry for “taberu” (eat) when the user hovers this word, we need to apply “deinflection rules”.

Basically, a deinflection rule tells you to replace a suffix of the word by a shorter one. But there are several, and you do not know which one is correct, so you just try them all. And then, you need to recursively deinflect these new words in turn. At some point, one of these words might match a dictionary entry. But you still need to check them all.

For a short article that ChatGPT 4 translates to 162 English words, we need to do 5,631 lookups. Of these, only 427 match at least one dictionary entry. In total, we output 679 dictionary entries for common names.

Early Solution

My first attempt was pretty straight-forward. However, this was using way too much memory. So I added an index; after a binary search, it would give me the offset in the EDICT file. This avoided creating multiple Python objects for each entry, and I relied on Linux file caching to avoid disk seeks.

Running a retrospective benchmark tells me it took about 1.1 s to find the relevant entries for the previous story. It was so slow that I precomputed sub-dictionaries for all pages. At least, I was paying this cost once per page, and not once per view. But it meant higher complexity (generating the dictionaries and serving them), lower flexibility (updating all the sub-dictionaries or creating a new kind of page would be cumbersome), and extra storage (1.5 GB for the sub-dictionaries of 8,997 pages).

In the benchmark below, you can see that the latency without contention is about 80 ms, and the maximal throughput is less than 13 requests/second. Running parallel requests actually lowers it!

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    79.05ms   12.37ms 137.93ms   84.80%
…
Requests/sec:    106.21
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:      7.99

Note: -H 'Connection: close is required because of #488. Without it, wrk measures latency incorrectly. This actually caused all my initial benchmarks to be off by about 40 ms. I only learned about this because of a discrepancy when redoing the benchmarks for this article, where doing more work would take less time!

Note: I have run the benchmark of this article on the index page of 2024-01-12. It contains 4 stories, with 4 sub-dictionaries for common names, and 4 sub-dictionaries for proper names.

And that was the situation from 2018 to 2023.

Optimizing Python

Of course, the solution was Rust!

But, before I indulged in the pagan ritual of RIIR, I took a bit of time to properly optimize the Python version. This would give me a better point of comparison on the speedup achieved by switching languages.

Using profiling allowed me to identify various avenues of improvement:

  • move the data back in a dictionary; it would use too much memory to run on my tiny VPS, but only speed mattered at that point
  • deduplicate candidates from the deinflection
  • I noticed that I was actually deinflecting twice due to poor function naming

These optimizations combined reduced the time needed to generate the entries from 1,200 ms per story to 24 ms (-98 %).

Going further, I used a trie to quickly look up the matching deinflection rules to apply (-50%). I also only kept relevant entries from the main dictionary (-15%). I was basically down to about 10 ms per story. With 4 stories on the main page, I was at 310 ms to first byte, compared to 230 ms in the version with precomputed sub-dictionaries.

Optimizing Django

Thanks to benchmarking, I realized my page loads were atrocious either way. So I optimized them as well, to better see the effect of generating sub-dictionaries. The line_profiler was very useful to identify hotspots in Django views:

  • I ditched Django’s ORM query-builder for some complex stories, without changing the SQL query sent to the database; just building that query using dynamic Python objects at every page load was taking a very significant amount of time
  • I avoided using additional SQL queries to search a specific story, when I could just use plain Python over the list of results (using SQL queries was an oversight from using Python-like Django’s ORM)
  • used indexes1 (this actually yielded a smaller performance improvement than what I expected)
  • avoided converting fields in an SQL query

Running the benchmark again, we see that these brought down the load time to 9 ms, with most of that time spent in generating HTML from Jinja2 templates, and some in SQL queries and Django’s ORM query builder. Even better, serving queries on 2 physical cores with hyperthreading allow to increase the throughput to 124 requests/second!

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.94ms    2.07ms  34.97ms   90.65%
…
Requests/sec:    106.21
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:    124.42

At that point, things are starting to look pretty good. However, integrating on-the-fly generation of sub-dictionaries in Python would get us back into meh territory:

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    94.42ms    7.79ms 141.33ms   90.48%
…
Requests/sec:     10.49
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     10.49

Notice in particular how the multithreaded benchmark does not improve throughput. This is because all threads need to access the full dictionary objects. And, with Python’s GIL, that means all thread must take their turn.

So, let’s Rust!

A bit of Rust

First, I migrated the Python module that handled the generation of sub-dictionaries. I migrated the code from Python to Rust, keeping mostly the same logic, compiling it to a naive Python module, making the transition transparent to the rest of the project.

Simply writing in Rust already yielded significant benefits, with many low-level optimizations enabled by the typing system, and with lifetimes reducing the number of allocations and deallocation.

The one thing I kept in mind was to avoid cloning objects unnecessarily. For instance, the module does not copy the entries from the main dictionary, but instead use references as long as possible, until returning to Python. In fact, the whole file is kept as a single string, and all entries and keys (individual words) are pointing to that string.

Doing this reduced the time to generate the sub-dictionary to about 1 ms. Also, using Rust references everywhere instead of Python strs reduced the memory footprint to 50 MB, making it viable in production.

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.17ms    2.51ms  32.16ms   71.95%
…
Requests/sec:     44.16
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     48.35

This is better. But notice that we are still not doing better with multiple threads. This is because we need to release the GIL. Then, we do see some improvement:

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:8000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.26ms   14.29ms 193.46ms   98.24%
…
Requests/sec:     44.57
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:8000/
…
Requests/sec:     76.25

However, calling this from Python still took about 13 ms out of the 23 ms to load the index page with 4 stories. With two sub-dictionaries per story, you would expect about 8 ms. But I did not investigate this overhead further, since I was planning to rewrite it all in Rust anyway.

All of the Rust

Having already used Axum to write a Web API, and Askama for templating, porting the Web service to Rust was unchallenging.

I did fight a bit with sqlx, however.

The two most popular ways to interact with a database in Rust are Diesel, which provides an ORM, and sqlx, which takes raw SQL. To me, the most important thing is being able to use static type checking to catch basic errors in the queries. Diesel achieves this with its query-builder; sqlx does that by verifying the SQL queries against a real database at compile time using PREPARE statements.

// with Diesel
diesel::update(dsl::users.find(user.id))
    .set(dsl::name.eq(name))
    .execute(con)
    .unwrap();

// with sqlx
sqlx::query("UPDATE users SET name = $1 WHERE id = $2")
    .bind(name)
    .bind(user.id)
    .execute(pool)
    .await
    .unwrap()

Using a query-builder adds a leaky layer of abstraction you need to fight with. This is especially frustrating when you already know the SQL query you want to write, but need to figure out how to make Diesel generate it for you. I already had experience with Diesel, so I opted to try sqlx, to see whether it would deliver on its promises, or whether querying a database at compile-time would add too much friction.

In fact, sqlx works mostly as described. It only becomes painful when you want to avoid making allocations for each field of each returned row! If you try to do it, you will actually lose type safety, since the queries won’t be checked anymore. This is an open issue.

$ wrk -c1 -t1 -H 'Connection: close' http://127.0.0.1:3000/
…
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.25ms    1.22ms  22.02ms   88.10%
…
Requests/sec:    107.24
…
$ wrk -c4 -t4 -H 'Connection: close' http://127.0.0.1:3000/
…
Requests/sec:    264.78

In any case, the Rust rewrite got rid of the overhead, reducing the total time to generate a page to 9 ms, virtually all of it being spent on generating sub-dictionaries!

Result of profiling the Web service using samply. I only performed requests to the index page, which is handled by the archive route. We can see that 95% of its time is spent generating sub-dictionaries.

Conclusion

Compiled languages with static typing have the reputation of being faster at the cost of being lower-level, involving more house-keeping, and having more gotchas.

However, in my experience, Rust actually feels high-level, almost Python-like! There are some things that can surprise a beginner, like the Borrow checker, but this is usually easily solved with a .clone(). After all, using Python is pretty close to using Rc and .clone() everywhere, implicitly!

In addition, the migration can often be done piecemeal.

Without too much effort, Rust gives very significant performance improvements. The first factor to go further is to minimize allocations and data copies, which is much easier in Rust than in C or in C++, thanks to references and lifetimes that track everything for you.

In fact, looking at the reversed callstack of samply’s measures, it looks like copying is still the thing where I spend most of my time:

There is definitely still room for improvement in my refactor, but it is more than enough to convince me that this was worth it!


  1. You should very much use indexes. ↩︎
Categories
Competitive Programming Computer Science Programming

Dynamic Programming is not Black Magic

This year’s Advent of Code has been brutal (compare the stats of 2023 with that of 2022, especially day 1 part 1 vs. day 1 part 2). It included a problem to solve with dynamic programming as soon as day 12, which discouraged some people I know. This specific problem was particularly gnarly for Advent of Code, with multiple special cases to take into account, making it basically intractable if you are not already familiar with dynamic programming.

However, dynamic programming itself is mostly natural when you understand what it does. And many common algorithms are actually just the application of dynamic programming to specific problems, including omnipresent path-finding algorithms such as Dijkstra’s algorithm.

This motivated me to write a gentler introduction and a detailed explanation of solving Day 12.

Let me start with a rant. You can skip to the following section if you want to get to the meat of the article.

The Rant

Software engineering is terrible at naming things.

Now, let’s take a look at “dynamic programming”. What can we learn from the name? “Programming” must refer to a style of programming, such as “functional programming”, or maybe “test-driven development”. Then, “dynamic” could mean:

Guess what. It means nothing of that, and it has nothing to do with being “dynamic”. It is an idea that you can use to design an algorithm, so there is a link to “programming”; I will grant it that.

Edit: There is a reason why it is named this way. When you look at the historical meaning of “programming”, the expression made sense. See niconiconi’s comment.

So, what is it?

Basic Caching

Let’s say we want to solve a problem by splitting it in smaller similar problems. Basically, a recursive function. Often, we end-up having to solve the same smaller problems many times.

The typical example is Fibonacci, where you want to evaluate f(n), which is defined as f(n - 1) + f(n - 2). If we implement it naively, we will end up evaluating f(1) many times:

Call tree for evaluating f(6), the 6-th Fibonacci number. To evaluate, f(6), we need to evaluate both f(5) and f(4). To evaluate f(5), we will need f(4) and f(3). Already, we see that we are going to need f(4) in two places. If we go further, we see that we will need f(1) 8 eight times, which happens to be f(6).

In fact, in this case, since only f(1) adds anything to the overall result, the number of times we will need f(1) is equal to f(n). And f(n) grows very fast as n grows.

Of course, we can avoid doing this. We can just cache the results (or memoize f, in terrible academic vernacular).

In the example, once we have evaluated f(4) once, there is no need to evaluate it again, saving 3 evaluations of f(1). By doing the same for f(3) and f(2), we get down to 2 evaluations of f(1). In total, f(…) is evaluated 7 times (f(0), f(1), f(2), f(3), f(4), f(5), f(6)), which is just f(n) + 1.

This is theoretically (asymptotically) optimal. But we can look at this in a different way.

Optimized Caching

With memoization, we keep the recursion: “to solve f(6), I need f(5), which will itself need f(4) […] and f(3) […], and f(4), which will itself need f(3) […] and f(2) […].”. Basically, we figure out what we need just when we need them.

Instead, we can make the simple observation that we will need f(0) and f(1) for all other evaluations of f(…). Once we have them, we can evaluate f(2), which will need for all other evaluations of f(…).

You can think of it as plucking the leaves (the nodes without descendants) from the call tree we saw before, and repeat until there are no more nodes. In other words, perform a topological sort.

With the example, if we have some array F where we can store our partial results:

  • F[0] = f(0) = 0
  • F[1] = f(1) = 1
  • F[2] = f(2) = f(1) + f(0) = F[1] + F[0] = 1 + 0 = 1
  • F[3] = f(3) = f(2) + f(1) = F[2] + F[1] = 1 + 1 = 2
  • F[4] = f(4) = f(3) + f(2) = F[3] + F[2] = 2 + 1 = 3
  • F[5] = f(5) = f(4) + f(3) = F[4] + F[3] = 3 + 2 = 5
  • F[6] = f(6) = f(5) + f(4) = F[5] + F[4] = 5 + 3 = 8

With this approach, we do not have any recursive call anymore. And that is dynamic programming.

It also forces us to think clearly about what information we will be storing. In fact, in the case of Fibonacci we can notice that we only need the two last previous values. In other words:

  • F[0] = f(0) = 0
  • F[1] = f(1) = 1
  • F[2] = f(2) = previous + previous_previous = 1 + 0 = 1
  • F[3] = f(3) = previous + previous_previous = 1 + 1 = 2
  • F[4] = f(4) = previous + previous_previous = 2 + 1 = 3
  • F[5] = f(5) = previous + previous_previous = 3 + 2 = 5
  • F[6] = f(6) = previous + previous_previous = 5 + 3 = 8

So, we can discard other values and just keep two of them. Doing this in Python, we get:

def fibo(n):
    if n == 0:
        return 0
    previous_previous = 0
    previous = 1
    for _ in range(n - 1):
        current = previous_previous + previous
        (previous, previous_previous) = (current, previous)
    return previous

I like that this gives us a natural and systematic progression from the mathematical definition of the Fibonacci function, to the iterative implementation (not the optimal one, though).

Now, Fibonacci is more of a toy example. Let’s have a look at

Edit Distance

The edit distance between two strings is the smallest number of edits needed to transform one string into the other one.

There are actually several versions, depending on what you count as an “edit”. For instance, if you only allow replacing a character by another, you get Hamming distance; evaluating the Hamming distance between two strings is algorithmically very simple.

Things become more interesting if you allow insertion and deletion of characters as well. This is the Levenstein distance. Considering this title of the present article, this is of course something that can be solved efficiently using ✨ dynamic programming ✨.

To do that, we’ll need to find how we can derive a full solution from solutions to smaller-problems. Let’s say we have two strings: A and B. We’ll note d(X, Y) the edit distance between strings X and Y, and we’ll note x the length of string X. We need to formulate d(A, B) from any combination of d(X, Y) where X is a substring of A and Y a substring of B1.

We’re going to look at a single character. We’ll use the last one. The first one would work just as well but using a middle one would not be as convenient. So, let’s look at A[a – 1] and B[b – 1] (using zero-indexing). We have four cases:

  • A[a - 1] == B[b - 1], then we can ignore that character and look at the rest, so d(A, B) = d(A[0..a - 1], B[0..b - 1])
  • A[a - 1] != B[b - 1], then we could apply any of the three rules. Since we want the smallest number of edits, we’ll need to select the smallest value given by applying each rule:
    • substitute the last character of A by that of B, in which case d(A, B) = d(A[0..a - 1], B[0..b - 1]) + 1
    • delete the last character of A, in which case d(A, B) = d(A[0..a - 1], B) + 1
    • insert the last character of B, in which case d(A, B) = d(A, B[0..b - 1]) + 1
  • A is actually empty (a = 0), then we need to insert all characters from B2, so d(A, B) = b
  • B is actually empty (b = 0), then we need to delete all characters from A, so d(A, B) = a

By translating this directly to Python, we get:

def levenstein(A: str, B: str) -> int:
    a = len(A)
    b = len(B)
    if a == 0:
        return b
    elif b == 0:
        return a
    elif A[a - 1] == B[b - 1]:
        return levenstein(A[:a - 1], B[:b - 1])
    else:
        return min([
            levenstein(A[:a - 1], B[:b - 1]) + 1,
            levenstein(A[:a - 1], B) + 1,
            levenstein(A, B[:b - 1]) + 1,
        ])


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# way too slow!
# assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

As hinted by the last test, this version becomes very slow when comparing long strings with lots of differences. In Fibonnacci, we were doubling the number of instances for each level in the call tree; here, we are tripling it!

In Python, we can easily apply memoization:

from functools import cache

@cache
def levenstein(A: str, B: str) -> int:
    a = len(A)
    b = len(B)
    if a == 0:
        return b
    elif b == 0:
        return a
    elif A[a - 1] == B[b - 1]:
        return levenstein(A[:a - 1], B[:b - 1])
    else:
        return min([
            levenstein(A[:a - 1], B[:b - 1]) + 1,
            levenstein(A[:a - 1], B) + 1,
            levenstein(A, B[:b - 1]) + 1,
        ])


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# instantaneous!
assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

Now, there is something that makes the code nicer, and more performant, but it is not technically necessary. The trick is that we do not actually need to create new strings in our recursive functions. We can just pass arounds the lengths of the substrings, and always refer to the original strings A and B. Then, our code becomes:

from functools import cache

def levenstein(A: str, B: str) -> int:
    @cache
    def aux(a: int, b: int) -> int:
        if a == 0:
            return b
        elif b == 0:
            return a
        elif A[a - 1] == B[b - 1]:
            return aux(a - 1, b - 1)
        else:
            return min([
                aux(a - 1, b - 1) + 1,
                aux(a - 1, b) + 1,
                aux(a, b - 1) + 1,
            ])
    return aux(len(A), len(B))


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# instantaneous!
assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

The next step is to build the cache ourselves:

def levenstein(A: str, B: str) -> int:
    # cache[a][b] = levenstein(A[:a], B[:b])
    # note the + 1 so that we can actually do cache[len(A)][len(B)]
    # the list comprehension ensures we create independent rows, not references to the same one
    cache = [[None] * (len(B) + 1) for _ in range(len(A) + 1)]
    def aux(a: int, b: int) -> int:
        if cache[a][b] == None:
            if a == 0:
                cache[a][b] = b
            elif b == 0:
                cache[a][b] = a
            elif A[a - 1] == B[b - 1]:
                cache[a][b] = aux(a - 1, b - 1)
            else:
                cache[a][b] = min([
                    aux(a - 1, b - 1) + 1,
                    aux(a - 1, b) + 1,
                    aux(a, b - 1) + 1,
                ])
        return cache[a][b]
    return aux(len(A), len(B))


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# instantaneous!
assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

The last thing we need to do is to replace the recursion with iterations. The important thing is to make sure we do that in the right order3:

def levenstein(A: str, B: str) -> int:
    # cache[a][b] = levenstein(A[:a], B[:b])
    # note the + 1 so that we can actually do cache[len(A)][len(B)]
    # the list comprehension ensures we create independent rows, not references to the same one
    cache = [[None] * (len(B) + 1) for _ in range(len(A) + 1)]
    for a in range(0, len(A) + 1):
        for b in range(0, len(B) + 1):
            if a == 0:
                cache[a][b] = b
            elif b == 0:
                cache[a][b] = a
            elif A[a - 1] == B[b - 1]:
                # since we are at row a, we have already filled in row a - 1
                cache[a][b] = cache[a - 1][b - 1]
            else:
                cache[a][b] = min([
                    # since we are at row a, we have already filled in row a - 1
                    cache[a - 1][b - 1] + 1,
                    # since we are at row a, we have already filled in row a - 1
                    cache[a - 1][b] + 1,
                    # since we are at column b, we have already filled column b - 1
                    cache[a][b - 1] + 1,
                ])
    return cache[len(A)][len(B)]


assert levenstein("", "puppy") == 5
assert levenstein("kitten", "sitting") == 3
assert levenstein("uninformed", "uniformed") == 1
# instantaneous!
assert levenstein("pneumonoultramicroscopicsilicovolcanoconiosis", "sisoinoconaclovociliscipocsorcimartluonomuenp") == 36

Now, if you really want to grok dynamic programming, I invite you to try it yourself on the following problems, preferrably in this order:

  1. longest common subsequence (not to be confused with longest common substring, but you can do that one too with dynamic programming)
  2. line wrap
  3. subset sum
  4. partition
  5. knapsack

Once you are comfortable with dynamic programming, Day 12 should become much less daunting!

Advent of Code, Day 12

In the Advent of Code of December 12th, 2023, you have to solve 1D nonograms. Rather than rephrasing the problem, I will let you read the official description.

.??..??...?##. 1,1,3

This can be solved by brute-force. The proper technique for that is backtracking, another terrible name. But the asymptotic complexity is exponential (for n question marks, we have to evaluate 2n potential solutions). Let’s see how it goes with this example:

  • .??..??...?##. 1,1,3 the first question mark could be either a . or a #; in the second case, we “consume” the first group of size 1, and the second question mark has to be a .
    1. ..?..??...?##. 1,1,3 the next question mark could be either a . or a #; in the second case, we “consume” the first group of size 1, and the next character has to be a ., which is the case
      1. .....??...?##. 1,1,3 the backtracking algorithm will continue to explore the 8 cases, but none of them is a valid solution
      2. ..#..??...?##. (1),1,3
        • and so on…
    2. .#...??...?##. (1),1,3
      • and so on…

There are 32 candidates, which would make 63 list items. I’ll spare you that. Instead, I want to draw your attention to the items 2.2 and 2:

  • 2.2. ..#..??...?##. (1),1,3
  • 2. .#...??...?##. (1),1,3

They are extremely similar. In fact, if we discard the part that has already been accounted for, they are more like:

  • 2.2. .??...?##. 1,3
  • 2. ..??...?##. 1,3

There is an extra . on the second one, but we can clearly see that it is actually the same problem, and has the same solutions.

In other words, just like with Fibonacci, the total number of cases is huge, but many of them will just be repeats of other ones. So we are going to apply memoization. And then, dynamic programming.

When we implement the “backtracking” algorithm we’ve overviewed above, we get something like this (not my code):

def count_arrangements(conditions, rules):
    if not rules:
        return 0 if "#" in conditions else 1
    if not conditions:
        return 1 if not rules else 0

    result = 0

    if conditions[0] in ".?":
        result += count_arrangements(conditions[1:], rules)
    if conditions[0] in "#?":
        if (
            rules[0] <= len(conditions)
            and "." not in conditions[: rules[0]]
            and (rules[0] == len(conditions) or conditions[rules[0]] != "#")
        ):
            result += count_arrangements(conditions[rules[0] + 1 :], rules[1:])

    return result

Note the program above handles ? by treating it as both . and #. The first case is easy, but the second case need to check that it matches the next rules; and for that, it needs to check that there is a separator afterwards, or the end of the string.

Since it’s Python, to memoize, we just need to add @cache.

To make it dynamic programing, we use the same trick as in the example of the edit distance: we pass the offset in the string, and the offset in the rules as parameters in the recursion. This becomes:

def count_arrangements(conditions, rules):
    @cache
    def aux(i, j):
        if not rules[j:]:
            return 0 if "#" in conditions[i:] else 1
        if not conditions[i:]:
            return 1 if not rules[j:] else 0

        result = 0

        if conditions[i] in ".?":
            result += aux(i + 1, j)
        if conditions[i] in "#?":
            if (
                rules[j] <= len(conditions[i:])
                and "." not in conditions[i:i + rules[j]]
                and (rules[j] == len(conditions[i:]) or conditions[i + rules[j]] != "#")
            ):
                result += aux(i + rules[j] + 1, j + 1)

        return result
    return aux(0, 0)

Then, we implement our own cache and fill it in the right order:

def count_arrangements(conditions, rules):
    cache = [[0] * (len(rules) + 1) for _ in range(len(conditions) + 1)]
    # note that we are in the indices in reverse order here
    for i in reversed(range(0, len(conditions) + 1)):
        for j in reversed(range(0, len(rules) + 1)):
            if not rules[j:]:
                result = 0 if "#" in conditions[i:] else 1
            elif not conditions[i:]:
                result = 1 if not rules[j:] else 0
            else:
                result = 0
                if conditions[i] in ".?":
                    # since we are at row i, we already filled in row i + 1
                    result += cache[i + 1][j]
                if conditions[i] in "#?":
                    if (
                        rules[j] <= len(conditions[i:])
                        and "." not in conditions[i:i + rules[j]]
                    ):
                        if rules[j] == len(conditions[i:]):
                            # since we are at row i, we already filled in row i + rules[j] > i
                            result += cache[i + rules[j]][j + 1]
                        elif conditions[i + rules[j]] != "#":
                            # since we are at row i, we already filled in row i + rules[j] + 1 > i
                            result += cache[i + rules[j] + 1][j + 1]
            cache[i][j] = result
    return cache[0][0]

And, voilà! You can also have a look at a Rust implementation (my code, this time).

Note: In this case, it looks like the dynamic programming version is slower than the memoized one. But that’s probably due to it being written in unoptimized Python.

Note: Independently from using a faster language and micro-optimizations, the dynamic programming version allows us to see that we only need the previous column. Thus, we could replace the 2D array by two 1D arrays (one for the previous column, and one for the column being filled).

Conclusion

I’ll concede that dynamic programming is not trivial. But it is far from being unreachable for most programmers. Being able to understand how to split a problem in smaller problems will enable you to use memoization in various contexts, which is already a huge improvement above a naive implementation.

However, mastering dynamic programming will let us understand a whole class of algorithms, better understand trade-offs, and make other optimizations possible. So, if you have not already done them, I strongly encourage you to practice on these problems:

  1. longest common subsequence (not to be confused with longest common substring, but you can do that one too with dynamic programming)
  2. line wrap
  3. subset sum
  4. partition
  5. knapsack

And don’t forget to benchmark and profile your code!


  1. Excluding, of course, d(A, B) itself ↩︎
  2. B could be empty as well, in which case we need to insert 0 characters ↩︎
  3. Note that we could permute the inner and outer loops as shown below. In this case, it works just as well:
    for b in range(0, len(B) + 1):
    for a in range(0, len(A) + 1):
    ↩︎
Categories
Programming

The Secret to a Green CI: Efficient Pre-commit Hooks with checkout-index

tl;dr

Linting in the CI

Most people are familiar with the concept of “CI” as in “CI/CD”.

For the ones who are not, it usually refers to a set of automatic tests that are run when a developer sends their code for review. These tests typically include “linters”, which will check that the code follows certain stylistic rules, and detect common mistakes.

It can be pretty annoying to wait for the CI to complete, only to discover that it failed because that I forgot a “;” in TypeScript, added an extra “;” in Python, did not format my Rust code properly, etc.

Of course, I can always run the proper linting command before sending my code to make sure everything is okay. And what were the options to pass to cppcheck again? Oh right, I am supposed to use clang-tidy in this project. What’s the proper syntax to call mypy? Oh, and I do not want to run eslint on all the source files, it takes too long, which files did I change since the last time?

I know, I’ll just write a script. I will have to find out which files were changed since the last time it was call to avoid verifying everything every time. I could use the git history for that, but what was the last commit when the script was run? I would have to store that somewhere…

Things would be simpler if I just ran it on every commit. But then I am definitely going to forget. If only I could run that script for every commit. Oh, and I should ignore changes that I have not committed, since I am not going to send these, and it could create false positives and false negatives. I could use git stash, but I should be extra cautious about not messing things up.

The pre-commit hook

Fortunately, this is a common predicament, and a solution for this exists: pre-commit hooks.

In git, a hook is a script (or executable in general) that is run automatically at certain times. For instance, when running git push, or when making a commit.

When creating a repository, git automatically populates the hooks with some examples. Take a look at .git/hooks/ in your repository.

The hook when making a commit is .git/hooks/pre-commit. So if you make a script and put it there, it will be run when you do git commit. Try it by writing the following code in .git/hooks/pre-commit. Do not forget to make it executable with chmod +x .git/hooks/pre-commit.

#!/usr/bin/env bash
echo "Hello from the pre-commit hook!"
exit 1

If you git add some files and run git commit, you should see:

$ git commit
Hello from the pre-commit hook!

git did not open up your editor to let you write a commit message. In fact, not commit was actually made. The reason is that we made the script exit with the error status “1”. For success, that would be “0”. The status of the script is the same as that of the last executed command. Here, the last executed command is exit 1, which unsurprisingly exits with status “1”.

So, for your TypeScript project, you can just use something like the script below, and no more errors!

#!/usr/bin/env bash
npx eslint src/
npx tsc

Well, not quite.

Making it Work Right

First, we are writing a shell script. And, although they can be convenient, they come with their lot of surprises. In the previous pre-commit script, a linting error found by eslint won’t stop the commit. We can see why in this example:

$ cat false-true
#!/usr/bin/env bash
false
true
$ ./false-true
$ echo $?
0

The false command exits with “1”. But it does not stop the execution of the script! So true is then run, which exits with “0”. This behavior is usually unwanted. Thankfully, we can disable it by adding set -e:

$ cat false-true
#!/usr/bin/env bash
set -e
false
true
$ ./false-true
$ echo $?
1

That’s better!

We can also set the -E option, to ensure that the ERR trap is inherited in functions. What that means is that, if you do some clean-up in your bash script (usually done with traps), and happen to write a function where a failure occurs, the clean-up is done properly. Setting the -u option will give you a proper error when you make a typo in a variable name, instead of just behaving like the variable contains the empty string.

Note: We can also set the “-o pipefail” option, but it comes with its own surprises. Basically, without “-o pipefail”, if you have “command1 | command2”, the script will only fail if command2 fails. With “-o pipefail”, the script will also fail if command1 fails. This can be what you want. But sometimes, command1 failing just means that there is nothing to check. In short, use this option on a case-by-case basis.

Alright, we are not linting the files in the current directory before making a commit. That’s good. But that’s not what we want. What if we only commit some of the changes?

To run the linting command on the exact state corresponding to the commit we are about to do, we could just stash the changes, but that’s dangerous. What if the script fails? What if the script succeeds? What if there is nothing to stash? What if we make a mistake somewhere and randomly corrupt the working directory?

Okay, there is a simple rule to avoid all that: don’t touch the working directory. Don’t stash, don’t move files, don’t do anything in the current directory.

Instead, just copy the state corresponding to the commit in a temporary directory far away from the working directory.

But my repository has so many files! It’s going to take so long to copy the full thing!

We’re not cloning the repository. Just doing a checkout-index. And git is fast.

The proper way to create a working directory is with mktemp, which will avoid any collision. Also, we are not going to leave random files lying around, so we ensure that the temporary directory is removed once the script finishes (succeeds, fails, crashes, is killed, etc.). In Bash, the proper way to do this is with traps. So we do that immediately. Finally, we copy the state of the current commit to that temporary directory with checkout-index:

TEMPDIR=$(mktemp -d)
trap "rm -rf $TEMPDIR" EXIT SIGHUP SIGINT SIGQUIT SIGTERM
git checkout-index --prefix=$TEMPDIR/ -af

Once this is done, we can just cd to that temporary directory and run our linting commands. And we’re done.

Well, almost. You did not expect it to be that easy, do you?

Making it Work Fast

See, modern build tools such as npm and cargo do this clever thing where they put dependencies in a local subdirectory (node_modules/ for npm, target/ for cargo).

It’s definitely an improvement about Python’s virtual environments. However, it does mean that, when we run our linting command from the temporary directory, we start from scratch1. Every time we try to make a commit.

Thankfully, we can just tell npm and cargo to use the usual subdirectory2. But where is this subdirectory again? We could use pwd, but maybe the user is in another subdirectory of the project. Maybe we could try visiting the parent’s until we find node_modules/ or target/?

There is much simpler. The thing we are looking for is usually easy to locate from the repository’s root. Often, it is just directly at the repository’s root. So, if the repository is at ~/src/my-project, we should look in ~/src/my-project/node-modyles/ and ~/src/my-project/target/. Conveniently, we can just ask git for the repository’s root:

GIT_ROOT=$(git rev-parse --show-toplevel)

Then, we’ll just set the relevant environment variables:

export NODE_PATH="${GIT_ROOT}/node_modules"
export CARGO_TARGET_DIR="${GIT_ROOT}/target"

Now things are starting to look pretty good!

But we can do better.

If you only change 1 file out of 10,000, you might want to just check that one file and skip the rest. We can do just that.

First, let’s get the list of the files we should check:

git diff --cached --name-only --diff-filter=AM

Here, the “AM” value passed to diff-filter is two flags, “A” for “added files” and “M” for “modified files”. We can filter further with grep to keep only the files ending in “.py” for example.

Then, we feed this list as arguments for our linting executable. This is done easily with xargs, which allows will take each line from the input and give it as an argument to the command. For instance:

(echo a; echo b; echo c) | xargs ls
# same as
ls a b c

Since the list might be empty, remember to use --no-run-if-empty. Without it, you will run the linting command without argument, which might fail, or worse3, lint all the files. So we could have:

git diff --cached --name-only --diff-filter=AM |
    grep '\.[hc]$' |
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast c; }  # lint from temp directory

Note: The paths listed by git diff are relative to the root of the repository, so you’ll want to run the linting command from that root, even if all the files are in a subdirectory. This could also mean that you may have to explicitly point the linting command to the configuration file. For instance, if your TypeScript files are in a front/ directory along with the Biome configuration file, you will need to do:

git diff --cached --name-only --diff-filter=AM | grep -E '\.tsx?$' |
    (cd $TEMPDIR; xargs --no-run-if-empty npx @biomejs/biome check --config-path="front/")

The Script

Finally, with all the pieces coming together, we have a pre-commit hook which is both reliable and fast. Don’t forget to add the pre-commit script itself to your repository, and enable it with ln -s ../../pre-commit .git/hooks/ (you cannot directly track files in .git/).

You might not need TypeScript and Rust and Python and C and C++. But you can just pick the relevant parts.

#!/usr/bin/env bash
# Usage: copy this file to .git/hooks/

# Exit at first error
set -Eeu

# To handle partially committed files, we must copy the staged changes to a
# separate location
# See also https://stackoverflow.com/a/36793330
TEMPDIR=$(mktemp -d)
trap "rm -rf $TEMPDIR" EXIT SIGHUP SIGINT SIGQUIT SIGTERM
git checkout-index --prefix=$TEMPDIR/ -af

# keep using the same node_modules/ and target/ directories, not a new one in
# the temporary directory this avoids re-parsing everything from scratch every
# time we run the script
GIT_ROOT=$(git rev-parse --show-toplevel)

# TypeScript
export NODE_PATH="${GIT_ROOT}/node_modules"
git diff --cached --name-only --diff-filter=AM | grep -E '\.tsx?$' |
    (cd $TEMPDIR; xargs --no-run-if-empty npx @biomejs/biome check)

# Rust
export CARGO_TARGET_DIR="${GIT_ROOT}/target"
(cd $TEMPDIR; cargo fmt --check)
(cd $TEMPDIR; cargo clippy --all -- --deny warnings)

# Python
(cd $TEMPDIR/back; ruff check .)

# C
git diff --cached --name-only --diff-filter=AM |  # list modified files
    grep '\.[hc]$' |  # filter C files
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast c; }  # lint from temp directory

# C++
git diff --cached --name-only --diff-filter=AM |  # list modified files
    grep '\.[hc]pp$' |  # filter C++ files
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast cxx; }  # lint from temp directory

For C and C++, I use the following lint-fast script:

#!/usr/bin/env bash
# Exit at first error
set -Eeuo pipefail

MODE="${1}"
shift

# trailing space check
if grep -Hn '\s$' "$@"; then
    echo "Error: trailing space"
    exit 1
fi

# check for NULL instead of nullptr in C++ files
if test "${MODE}" = "cxx"; then
    grep -Hn --color NULL "$@" && exit 1
fi


GCCFLAGS=(
    # do not actually compile
    -fsyntax-only
    # fail script when issue is found
    -Werror
    # usual paranoid options
    -Wall
    -Wconversion
    -Wextra
    -Wpedantic
    -Wshadow
    -Wvla
)

# C++ specific options
if test "${MODE}" = "cxx"; then
    GCCFLAGS+=(
        -std=c++17
        -Wold-style-cast
    )
fi

gcc "${GCCFLAGS[@]}" "$@"

#note: one flag might come from pre-commit hook
CPPCHECK_FLAGS=(
    --suppress=noExplicitConstructor
    # no point when we only check a subset of the files
    --suppress=unusedFunction
    # disable lookup for system includes
    --suppress=missingIncludeSystem
    # sometimes does stupid suggestions
    --suppress=useStlAlgorithm
    # otherwise, it complain when not enough files are included
    --suppress=unusedStructMember
    # otherwise, it will complain it has not found a suppressed warning
    --suppress=unmatchedSuppression
    # enable everything else
    --enable=all
    # fail script when issue is found
    --error-exitcode=1
    # only print something when there is an issue
    --quiet
    # ignore inlined JS code
    -DEM_ASM=
)

cppcheck "${CPPCHECK_FLAGS[@]}" "$@"

# C++ specific options
CLANG_FLAGS=(
    --checks=-clang-analyzer-security.FloatLoopCounter,-clang-analyzer-valist.Uninitialized,-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,-clang-analyzer-optin.performance.Padding
    --warnings-as-errors=*
)
if test "${MODE}" = "cxx"; then
    CLANG_FLAGS+=(
        -extra-arg=-std=c++17
    )
fi
clang-tidy "${CLANG_FLAGS[@]}" "$@" --
  1. Of course, if we’re linting with a call to npx, it will probably just fail since the command is not installed in the local project. ↩︎
  2. Yes, I know, I just said we shouldn’t touch the working directory. But we’re not, npm and cargo are! … Okay, let’s just say it’s a special case and we are being very careful here. ↩︎
  3. This is worse because this could stay under the radar for a long time and waste time. An error can usually be diagnosticed quickly and fixed once and for all. ↩︎
Categories
Programming

Tiny Docker Containers with Rust

The Magic

Thanks to Python, I have gotten used to Docker images that takes minutes (or dozens of minutes) to build and hundreds of megabytes to store and upload.

FROM python3:3.14

# install everything
RUN apt-get update \
    && apt-get install -y --no-install-recommends
    libsome-random-library0 libanother-library0 \
    libmaybe-its-for-pandas-i-dunno0 libit-goes-on0 liband-on0 \
    liband-on-and-on0 \
    && rm -rf /var/lib/apt/lists/*

# install more of everything
COPY requirements.txt .
RUN pip3 install -r requirements.txt

ENTRYPOINT ["gunicorn", "--daemon", "--workers=4", "--bind=0.0.0.0:8080", "app:create_app()"]

But there is another way. The Rust way. Where you build your image in a second, and it only takes 5 MB1.

FROM scratch
COPY target/release/my-binary .
ENTRYPOINT ["./my-binary"]

The Trick

This is possible thanks to the points below.

  1. Rust is compiled to binary
  2. Rust is statically2 compiled to binary
  3. You can easily use musl3 with Rust

The first point means that there is no need for a runtime to interpret the script or the bytecode.

The second point means that the binary contains the code for all the libraries4. And, more specifically, only the required code. So no need to install them externally, you remove some overhead, and the total size is reduced.

With these two, you get down to just a few runtime dependencies. Namely, glibc and the ELF loader5.

$ ldd target/release/my-binary
linux-vdso.so.1 (0x00007fffbf7f9000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f7a759b2000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f7a758d3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7a756f2000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7a762a6000)

But you can go further! We do not have to use glibc. It turns out you can just statically compile musl. And with Rust, using musl is only a matter of installing musl-tools and adding --target=x86_64-unknown-linux-musl to your cargo build command.

And, since we just got rid of our last runtime dependency, we do not even need the ELF loader! So now, we get:

$ ldd target/x86_64-unknown-linux-musl/release/my-binary
	statically linked

All the user code is now in a single file and will talk to the kernel directly using int 0x80.

Conclusion

This is a nice trick, and can help you if you really care about small Docker images, but it does have a drawback: targeting musl typically gives you lower performance. So keep this in mind!

  1. Sure, I am simplifying things here. In practice, you would also build the binary in a Dockerfile, and then use staging. But that does not change the point. ↩︎
  2. Static linking means including external libraries within the binary, as in opposition to dynamic libraries, where they are not included. Binaries that were dynamically linked will need to be provided with the external libraries at runtime (.so files on Linux, .dll files on Windows). ↩︎
  3. musl is an implementation of libc, the standard library for C. Virtually all binaries depend on the libc. glibc is the most common implementation of libc on Linux. ↩︎
  4. Of course, it won’t work if a library is using FFI to talk to OpenSSL for instance ↩︎
  5. The ELF loader is another program that helps your program to start up. In particular, it tells your program where to find the dynamically linked libraries. ↩︎
Categories
Programming

Where to Start with Rust

As a Rust evangelist, I have been asked a few times for resources to start learning Rust. I have compiled my thoughts in this article. I have ordered the resources in a rough order: you should first take a look at the first resources and later at the other ones. Do not take it as gospel, and do peek at the rest of the list before completing even the official tutorial.

The Rust Programming Language

This is the official tutorial for Rust. It is also known as “the Rust book”. I do not recommend you read it entirely, but you should read at least the first 12 chapters. Of course, the rest of the chapters are worth a read, but the more advanced topics are better covered by dedicated resources.

Rust for functional programmers

If you are used to programming in Haskell or OCaml, this tutorial might help you get you to speed quicker.

Tour of Rust

For a more hands-on approach, you can use this tutorial, which drops you in a Rust playground with minimal exercises that each train you on a new concept.

Rust by Example

If the Tour of Rust is not enough, you can also try this one to complement your knowledge.

Learning Rust With Entirely Too Many Linked Lists

This tutorial is about getting over one of the main pain points of Rust: cyclic references. It goes over various solutions, how they work, and their pros and cons. A must-read if you want to become more comfortable with Rust.

Learn Rust the Dangerous Way

You can see this tutorial as an extension of the previous one. When you are used to developing in C, switching to idiomatic Rust can seem like a daunting task. This website will take you from a C program (the n-body problem from the Benchmarks Game) to a fully idiomatic and optimized Rust one.

Rustlings

At this point, you should be getting comfortable with the syntax and main idiosyncrasies of Rust. This collection of exercises will let you hone your skills.

Zero To Production In Rust (book)

To go further than writing small snippets of Rust, you will want to start taking into account higher-level considerations. This book narrates every step in setting up a simple mailing-list server in Rust.

Macros in Rust

Writing macros is a fairly advanced topics of Rust. This tutorial will get you more familiar with writing declarative macros (or “macros by example”).

The Rustonomicon

Most of the time, you should not need to touch unsafe. But, when you do, you enter a parallel world where every step you take could lead you to the dreaded undefined behaviors. Using unsafe safely requires thinking about many things. This tutorial will help you get started with this.

Rust for Rustaceans (book)

Like all non-trivial software projects, Rust has its quirks and unintuitive mechanisms. This book is a very dense explication of some of this crust. You might also enjoy the author’s YouTube channel.

This Week in Rust

This is the one-stop-shop to keep learning more about Rust, its ecosystem and its community. You do not need to follow all the links, or even read it from start to finish, but skimming the latest edition every week will let you stay on top of things.

Rust Koans

I conclude this list of heavy technical content with some light reading, likely inspired from the Rootless Root.

Categories
Programming

Client-Side Password Hashing

tl;dr: When you think about it, hashing password only client-side (in the browser, in JavaScript) is not as terrible as you might think. And if you do most of the work client-side, you get something that is actually pretty secure.

Storing Passwords

So, I have been working on modernizing a Web application and fixing some of its issues. As someone who has spent way too much time implementing cryptographic primitives and cracking password hashes for fun, one of the main problems was that passwords were only hashed client-side.

For the context, we have learned quite a long time ago that storing raw passwords in a database was a bad idea. If the database leaked, all passwords were immediately compromised. Since people tend to reuse the same passwords for various services, it meant that compromising a non-sensitive service would allow an attacker to access very sensitive accounts of many individuals.

So we encrypt them. Except we don’t. If the database leaks, the encryption key will probably leak as well. Instead, we cryptographically hash the password. That is, we transform them in something completely different, irreversibly. So, “correct horse battery staple” would become “c4bbcb1fbec99d65bf59d85c8cb62ee2db963f0fe106f483d9afa73bd4e39a8a”.

How your password is saved when you register, and how it is checked when you log in

Because the same password always get hashed the same way, we can authenticate people by comparing the hashes instead of the passwords themselves. No need to remember the password in the database! And if the database leaks, no attacker cannot find “correct horse battery staple” from “c4bbcb1fbec99d65bf59d85c8cb62ee2db963f0fe106f483d9afa73bd4e39a8a”.

Well, except they can.

Slowing down attackers

Although cryptographic hashes are designed to be hard to directly reversed, an attacker who got access to the hashes can just take a guess, hash it, and check whether it matches the hash of a user. If it does, it has found a password.

Since cryptographic hashes are also designed to be very fast, an attacker can try many guesses in a short amount of time.

The average password entropy is 40 bits. The median is likely to be lower. A modern CPU can compute more than 1 million SHA-256 hash computation per seconds. So trying every password with 40 bits of entropy would take less than 2 weeks.

The number of characters is not everything. What matters is how it is generated. Don’t pick a password: since humans are bad at randomness, the entropy will be even lower!

Thankfully, we can make things more difficult for the attacker. What if, instead of storing SHA-256(password), we stored SHA-256(SHA-256(password)). It would make things twice as hard for the attacker, and would not change much when we validate a user’s password for authentication. We can go further! Instead of requiring to compute SHA-256 twice, we can make it so that it takes ten, a hundred, a thousand iterations! We can choose a number that makes it so that hashing the password takes 10 ms on the server. In that case, it would take the attacker 350 years of CPU time! This is basically PBKDF2.

The more iterations, the longer it will take to crack a password. Of course, verifying the password when authenticating will also take longer, but it should only be a few milliseconds.

Except that the attacker can do something that we cannot: it can try many guesses in parallel. Today’s CPUs can have dozens of cores. GPUs scale even more. And, with Bitcoin/blockchain/web3/NFTs, people have started building dedicated hardware to hash SHA-256 en masse. For a few thousand dollars, you can get hardware to crack SHA-256 at tens of billions of guesses per second. Even with our clever scheme, the attacker is back to 2 weeks to try every password1.

That’s cheating!

Yeah, it is! But we can do something about that. GPUs and dedicated hardware are designed for very specific tasks. We can take advantage of that to make it very hard to crack passwords with them. The trick is to replace SHA-256 with a hash function that requires a lot of memory to compute. It turns out, people have designed a very good one! Argon2id2

That’s basically the state of the art in storing passwords3.

Client-Side Only Hashing

The client (your web browser) just sends the password over the network. Then, the servers hash to hash it before storing or checking it.

Now, that’s quite a bit of work for our server. We need to do a significant amount of computation every time a user tries to connect. What if we had the user’s client do it instead?

Instead of sending the password, and having the server hash it, let’s imagine we use some JavaScript to hash the user’s password with Argon2id, and then send it over the network. Now, the server only needs to compare the hash to the one in the database!

Now, the client does most of the work.

There is one issue though: from the point of view of our Web application, the actual credential is not the password anymore, but the hash. So if an attacker gets the hash, they can authenticate as the user. In other words, a database leak means the credentials leak!

It’s not as bad as in the initial scenario, however. These credentials are only valid for our service. And it is still very hard to find the original passwords from the hashes.

Mostly Client-Side Hashing

We can improve this with a simple observation: since it is very hard to get the password back from the hash, we can treat the hash like a high-entropy password.

In this case, brute-force attacks become irrelevant, and we do not need to iterate or use memory-hard functions. Instead, we could just use a fast hash. It looks like that:

  1. The browser computes an intermediate hash with a slow memory-hard function
  2. The browser sends the intermediate hash to the server
  3. The server hashes the intermediate hash with a fast function, getting the final hash
  4. The server compares the final hash with the one in the database

From the point of view of the server, the effective credential is the intermediate hash. However, if the database leak, the attacker only gets the final hash. It would be very hard to find the intermediate hash directly from the final hash, since the input space is very large. And it would be very hard to find the user’s password from the final hash, since a memory-hard function is involved.

The nice thing is that all the heavy work is offloaded to the browser, while still keeping the password secure!

The one drawback that I can think of is that this is not enforced by the back-end. If the front-end (especially a third party one) does not do proper memory-hard hashing, passwords might become exposed.

A positive side effect is that the back-end never sees raw passwords. It can be a good mitigation for inadvertently leaking information in the logs. Leaking the intermediate hash still allows an attacker to authenticate as a user, but they are not able to get their raw password and compromise the user’s accounts on other websites.

Conclusion

A priori, client-side hashing is a terrible idea from the point of view of security. However, looking more closely at it, it looks like it can actually make sense. But we should still use hashing on the server to avoid any risk of leaking credentials.

  1. Note that, without the iterated hashing, it would take just 2 minutes with dedicated hardware. So that’s still better than not doing it. ↩︎
  2. scrypt came first, and is pretty good as well ↩︎
  3. There is also salting, but that’s orthogonal to this discussion ↩︎
Categories
Programming

Strongly Typed Web Apps

I Am Old

I like software to be correct. I know. Controversial, right? But as I have gained experience tinkering with various technologies, and building hobby project after hobby project, with some professional projects in-between, I have come to realize that I do not want to just pray and hope. I want to know that my programs work. And I do not want to spend 99 % of my time manually testing them.

When I first encountered Python, coming from a QBasic, C and C++ background, it was a bliss not having to declare types manually. Instead of wasting time writing boilerplate, I could actually make my programs do what I wanted them to do.

But, as my projects grew more ambitious, I had to become stricter and stricter to make sure I found bugs as early as possible. So wrote tests, I set up linters, wrote tests, added back typing through type annotations, and wrote tests.

Type annotations get a bad rap, but they are actually useful, unlike C types.

In C, types are mostly about memory layout, and not so much about correctness. When it comes to semantics, a char is an int, a short is an int, and an int could be a long, or a float. Also, an int could be a short1. And of course, you can add strings to arrays, pointers to integers, and floating point values to enumerations. Because, in the end, it’s all the same to C2.

With type annotations, you start to get the point that functional aficionados have been trying to make since forever. “Yeah, that value could be None, so you need to check for that before you use it. By the way, you’re evaluating the addition between a hash table and a list; don’t.” Thanks, compiler3.

And when you start thinking like that, you might want to say: “Okay, this variable holds euros. I really do not want to add it to kilowatts.”. That’s when you know it’s too late. But now you have implemented a full-fledged system for dimensional analysis in your type system. And it is glorious. You have forgotten what it was meant for, though.

What’s this to do with Web Apps? Seriously, not much. But there is a point!

Typed APIs

Web Apps have two fully separated software components talking to each other: the frontend, and the backend. And it’s very easy to introduce bugs at the interface between the two. Especially when they use different languages4.

Of course, the solution is XML. As everyone knows, you simply write an XSD or DTD specification for your format. Then, you can easily transform your data for consumption thanks to XSLT and even presentation with XLS. You might even have something working within ten years!

Okay, okay, everyone is doing JSON these days. Thankfully, we do have a way to describe what a JSON document contains: Swagger OpenAPI.

Fine. So you write this OpenAPI thing. Then what?

Well, that’s where the current picture is still lacking. We now have three components and two interfaces:

  • backend code
  • OpenAPI specification
  • frontend code

How do we get out of manually checking everything and make sure it works?

First, we can generate some frontend code from the OpenAPI specification, which will handle all communications with the backend. For this, I use openapi-typescript-codegen. Okay, that one mostly works fine.

Now, the fun part is to make the backend code and the OpenAPI code work together.

The convention is to generate the OpenAPI specification from the backend code, and not backend code from the OpenAPI specification. In Python, you might be familiar with Flask-RESTful Flask-RESTplus Flask-RESTX. In Rust, you could be using Utoipa.

In principle, this is great: you add some annotations to your backend code, and you can be sure that your frontend code sees the same types!

However, this is still a very immature space. You can very easily document the wrong type in the annotations, and you won’t get a problem until you try to actually use the generated OpenAPI specification in the front-end.

Conclusion

Things have definitely improved since the days I was writing PNG parsers in QBasic. But we still have ways to go until we really pick all the low-hanging fruit for software correctness. I want the compiler to shout at me more often!

  1. Yes, I am aware of type ranks. The point is that none of this helps you make your program more correct. ↩︎
  2. Sure, you cannot add a struct to a bit-field. But guess how you pass a struct around? ↩︎
  3. Or mypy in this case. ↩︎
  4. Sure, you might enjoy using JavaScript in your backend, since you already have to use it for your frontend. Not. ↩︎
Categories
Programming

Overkill Debugging

Hack that Code!

I have a not-so-secret technique that I use whenever I encounter a bug that evades my grasps. Or when I am lazy. Okay, mostly when I am lazy.

The idea is no more than applying the concept of binary search to source code:

  1. Remove half the code
  2. Compile and run
  3. If the bug is still present, repeat with the remaining code
  4. Otherwise, repeat with the code that was removed in 1

Of course, this naive approach is rarely practical. For once, removing half the code indiscriminately tend to make things not compile. Also, sometimes, a bug manifests because of the way the first half and the second one interact.

So, in practice, it looks more like:

  1. Remove some big chunk of code
  2. If that fails to compile/run, try another chunk, or reduce the size
  3. Compile and run
  4. If the bug is still present, repeat with the remaining code
  5. Otherwise, put back the chunk of code, and try removing another one (can be a sub-chunk of the initial chunk)

Note: sometimes, you will still need to do mini-refactor to keep making progress

This works surprisingly well. And it is very satisfying to remorselessly hack through code that was written painstakingly over hours, months or even years. And then, getting to a small subset of code that clearly exemplify the bug.

Automate the Fun Parts

Software engineers might be the group of people who work the hardest to make themselves redundant.

Don't forget the time you spend finding the chart to look up what you save. And the time spent reading this reminder about the time spent. And the time trying to figure out if either of those actually make sense. Remember, every second counts toward your life total, including these right now.
I am guilty of spending way too much time automating tasks that only took a trivial amount of time

So, of course, someone already automated the process I described above. It is called C-Reduce. It is meant to find bugs in the compiler rather than in the compiled program, but the principle is the same. If your program is C and you can write a small script that test whether the bug is present, then it will automatically slash through your code.

To give you an idea, consider the code below.

#include <stdlib.h>
#include <stdio.h>

void f(void) {
    int x = 42;
    int y = 83;
    int z = 128;
    printf("%d\n", x  + y);
}

void g(void) {
    int x = 1;
    int y = 2;
    int z = 3;
    printf("%d\n", x  + y);
}

int h(void) {
    return 42 * 43;
}

int main(void) {
    h();
    f();
}

#include <stdarg.h>
#include <time.h>

As well as the following Bash script.

#!/usr/bin/env bash
set -Eeuo pipefail
gcc -o main main.c
./main | grep -q ^125$

Then creduce test.sh main.c reduces the C source file to:

main() { printf("%d\n", 42 + 83); }

This was really a toy example, but that this program is very useful when you have large source files. For instance, C++ ones *shivers*.

Audio Millisecond Mistiming

I have been writing on a tool to practice Morse code using jscwlib, the library behind LCWO. I want to play an infinite flow of words in Morse code. The algorithm is the fruit of decades of research:

  1. Play a word
  2. When it has finished playing, play another word

While dogfooding, I realized that the speed sometimes increased unexpectedly. This was subtle enough that I did not notice the transition from normal speed to the higher speed.

Over my next practice sessions, I kept an eye ear on this phenomenon, and eventually realized that:

  • The space between the words did get shorter
  • The duration of the characters remained the same

From the two first points, I was able to devise a more reliable way to detect the presence of the bug. First, I made all the words just the letter “e”, which is a single “dit” in Morse. Second, I measured the time between the start of two words with new Date(). Finally, I showed the durations on the console.

This is not a very precise method but, it enabled me to see the duration of spaces reducing millisecond by millisecond over a few seconds. The effect became noticeable after a few hundred words were played.

With this, I could test the presence of the bug in mere seconds, instead of minutes of practice. So I went slashing.

After a short while, I found the culprit, wrote a fix, and opened a PR!

Conclusion

In the end, it took me much more time noticing the bug, understanding what it looked sounded like, and finding a way to check for it effectively, than actually identifying its origin in the source code.

Categories
Programming

Float Woes in C

The C programming language is a gift that keeps giving. Today, we are going to see how a seemingly banal and common operation can hide unfathomable depths of unmentionable horrors.

Woes with Integer Coercion

What is the problem of the code below?

// returns the closest integer to f
int float2int(float f) {
    return (int) f;
}

It’s written in C.

A Rust enthusiast

That’s… a good point.

But let’s assume we actually want to write C, for some forsaken reason. And we do not care whether we convert up, down or sideways.

There is still a fundamental issue with this code. What happens when we call float2int with INFINITY?

#include <math.h>
#include <stdio.h>

int float2int(float f) {
    return (int) f;
}

int main(void) {
    printf("%d\n", float2int(INFINITY));
    printf("%d\n", float2int(-INFINITY));
    printf("%d\n", float2int(NAN));
}

When we compile this code with gcc and run the result, we get:

$ gcc -O3 test.c && ./a.out
2147483647
-2147483648
0

Makes sense! The largest integer is 2147483647; the smallest is –2147483648, and what would you choose for NAN but 0?

Someone still young and innocent

Now, let’s just remove that -O3 option:

$ gcc test.c && ./a.out
-2147483648
-2147483648
-2147483648

What?

The innocent who got their world shattered

Wait, it gets better:

$ clang -O3 test.c && ./a.out
1464089272
1488257696
1488257696
$ ./a.out
1259459480
-1806736736
-1806736736
$ ./a.out
-2098630344
1664811680
1664811680

But… why?

All hope is gone

Because it can.

Yup, that’s undefined behavior. Converting a non-finite value (i.e. an infinite or a NaN) to an integer is undefined behavior.

But that’s not all! What should the largest finite floating-point value (3.4028234664e+38) convert to? It’s much bigger than INT_MAX, the value that int can represent (it does not matter whether you are on a 32, 64 or 128 bit architecture, really).

Maybe we could just say that all the floating point-number larger than INT_MAX should be converted to INT_MAX when converted? But alas, it makes for unwelcome surprises down the line, when you realize that the rounded value is not always within 1 of the original value.

Thankfully, the C standard always has the solution whenever there is a hint of ambiguity:

When a value of integer type is converted to a real floating type, if the value being converted can
be represented exactly in the new type, it is unchanged. If the value being converted is in the
range of values that can be represented but cannot be represented exactly, the result is either the
nearest higher or nearest lower representable value, chosen in an implementation-defined manner.

If the value being converted is outside the range of values that can be represented, the behavior is
undefined
. Results of some implicit conversions may be represented in greater range and precision
than that required by the new type (see 6.3.1.8 and 6.8.6.4).

Part 6.3.1.4 paragraph 2 of the C standard

In other words, we’re just going to say all the values that are not quite that convenient trigger undefined behavior. The “range of values that can be represented” by int is the interval [INT_MIN, INT_MAX]. Attempting to convert any float value from outside this range to the int type is undefined behavior.

We’ll just have to check for that!

Someone has not learned their lesson

Woes with Range Checking

We are just getting to the actually tricky part. Let’s have a look at a seemingly robust implementation:

#include <limits.h>
#include <math.h>

int float2int(float f) {
    if (isnan(f)) {
        return 0;
    }
    if (f < INT_MIN) {  // also filters out -INFINITY
        return INT_MIN;
    }
    if (f > INT_MAX) {  // also filters out +INFINITY
        return INT_MAX;
    }
    return (int) f;
}

For the sake of simplicity, we are just providing arbitrary values for the values that we cannot convert. Other implementations might choose to return a status indicating whether the conversion is possible or not.

But, of course, there is a bug lurking in there. And it is extremely easy to miss.

And, this time, it is not caused by an undefined behavior, but by another surprising “feature” of C: implicit type conversions.

To understand what I mean by that, we need to look at what the conditions in the code above really mean. The first one is fine, let’s look at the other two: f < INT_MIN and f > INT_MAX. In both, the left operand is a float, and the right operand is an int. Processors rarely have such built-in operations. So a conversion must be taking place. The way it happens in described in the “usual arithmetic conversion”:

[…]

if the corresponding real type of either operand is float, the other operand is
converted, without change of type domain, to a type whose corresponding real type is float.

[…]

Part 6.3.1.8 paragraph 1 of the C standard

Thankfully, we only need the part I have quoted here. In short, both operands get converted to float. Let’s look at our conditions again.

if (f < (float) INT_MIN) {  // also filters -INFINITY
    return INT_MIN;
}
if (f > (float) INT_MAX) {  // also filters +INFINITY
    return INT_MAX;
}

Let’s look at the first condition. What is the value of (float) INT_MIN? Let’s assume 32-bit 2-complement int type. Then INT_MIN is -2³¹. A float can represent this value exactly. You can check it out with an online converter. So (float) INT_MIN is -2³¹ as well. No problem here, this conversion does not change the value, and this code does exactly what we want.

With the same assumption, INT_MAX is 2³¹ – 1. And there comes the catch: a float cannot represent 2³¹ – 1 exactly. If you put “2147483647” in the online converter, you will see that it is rounded to a float whose value is actually “2147483648”. It makes sense: a float trades off being able to represent all integers in [-2³¹, 2³¹ – 1] in order to cover a much wider range and many more magnitudes. In other words, the actual value of (float) INT_MAX is INT_MAX + 1. The condition is actually doing:

if (f > INT_MAX + 1) {  // also filters +INFINITY
    return INT_MAX;
}

Note: if this were C, this code would contain an undefined behavior because of the signed integer overflow that happens when evaluating INT_MAX + 1. This is just pseudocode for illustration purposes.

Since the upper bound is offset by one, the code fails to filter out the half-open interval (INT_MAX, INT_MAX + 1]. If the input parameter f were to take its value from this range, converting it to int would be undefined behavior. Now, is there any float value in this interval? Yes! INT_MAX + 1 of course (it’s the only one).

Conclusion

The “robust” implementation does not handle 2³¹ properly. This is not just a theoretical oddity. I encountered this issue in critical code in production while working at TrustInSoft. Such bugs will easily stay under the radar and avoid even the most intensive test campaigns. Only formal methods help in detecting and avoiding such errors.

The worse thing is that, since this is undefined behavior, you might encounter it many times, and never now, corrupting data in unknown ways. Things that could have been a runtime error become the entry door for silent bugs and security issues.

C likes to keep you on your toes.