Categories
Programming

The Secret to a Green CI: Efficient Pre-commit Hooks with checkout-index

tl;dr

Linting in the CI

Most people are familiar with the concept of “CI” as in “CI/CD”.

For the ones who are not, it usually refers to a set of automatic tests that are run when a developer sends their code for review. These tests typically include “linters”, which will check that the code follows certain stylistic rules, and detect common mistakes.

It can be pretty annoying to wait for the CI to complete, only to discover that it failed because that I forgot a “;” in TypeScript, added an extra “;” in Python, did not format my Rust code properly, etc.

Of course, I can always run the proper linting command before sending my code to make sure everything is okay. And what were the options to pass to cppcheck again? Oh right, I am supposed to use clang-tidy in this project. What’s the proper syntax to call mypy? Oh, and I do not want to run eslint on all the source files, it takes too long, which files did I change since the last time?

I know, I’ll just write a script. I will have to find out which files were changed since the last time it was call to avoid verifying everything every time. I could use the git history for that, but what was the last commit when the script was run? I would have to store that somewhere…

Things would be simpler if I just ran it on every commit. But then I am definitely going to forget. If only I could run that script for every commit. Oh, and I should ignore changes that I have not committed, since I am not going to send these, and it could create false positives and false negatives. I could use git stash, but I should be extra cautious about not messing things up.

The pre-commit hook

Fortunately, this is a common predicament, and a solution for this exists: pre-commit hooks.

In git, a hook is a script (or executable in general) that is run automatically at certain times. For instance, when running git push, or when making a commit.

When creating a repository, git automatically populates the hooks with some examples. Take a look at .git/hooks/ in your repository.

The hook when making a commit is .git/hooks/pre-commit. So if you make a script and put it there, it will be run when you do git commit. Try it by writing the following code in .git/hooks/pre-commit. Do not forget to make it executable with chmod +x .git/hooks/pre-commit.

#!/usr/bin/env bash
echo "Hello from the pre-commit hook!"
exit 1

If you git add some files and run git commit, you should see:

$ git commit
Hello from the pre-commit hook!

git did not open up your editor to let you write a commit message. In fact, not commit was actually made. The reason is that we made the script exit with the error status “1”. For success, that would be “0”. The status of the script is the same as that of the last executed command. Here, the last executed command is exit 1, which unsurprisingly exits with status “1”.

So, for your TypeScript project, you can just use something like the script below, and no more errors!

#!/usr/bin/env bash
npx eslint src/
npx tsc

Well, not quite.

Making it Work Right

First, we are writing a shell script. And, although they can be convenient, they come with their lot of surprises. In the previous pre-commit script, a linting error found by eslint won’t stop the commit. We can see why in this example:

$ cat false-true
#!/usr/bin/env bash
false
true
$ ./false-true
$ echo $?
0

The false command exits with “1”. But it does not stop the execution of the script! So true is then run, which exits with “0”. This behavior is usually unwanted. Thankfully, we can disable it by adding set -e:

$ cat false-true
#!/usr/bin/env bash
set -e
false
true
$ ./false-true
$ echo $?
1

That’s better!

We can also set the -E option, to ensure that the ERR trap is inherited in functions. What that means is that, if you do some clean-up in your bash script (usually done with traps), and happen to write a function where a failure occurs, the clean-up is done properly. Setting the -u option will give you a proper error when you make a typo in a variable name, instead of just behaving like the variable contains the empty string.

Note: We can also set the “-o pipefail” option, but it comes with its own surprises. Basically, without “-o pipefail”, if you have “command1 | command2”, the script will only fail if command2 fails. With “-o pipefail”, the script will also fail if command1 fails. This can be what you want. But sometimes, command1 failing just means that there is nothing to check. In short, use this option on a case-by-case basis.

Alright, we are not linting the files in the current directory before making a commit. That’s good. But that’s not what we want. What if we only commit some of the changes?

To run the linting command on the exact state corresponding to the commit we are about to do, we could just stash the changes, but that’s dangerous. What if the script fails? What if the script succeeds? What if there is nothing to stash? What if we make a mistake somewhere and randomly corrupt the working directory?

Okay, there is a simple rule to avoid all that: don’t touch the working directory. Don’t stash, don’t move files, don’t do anything in the current directory.

Instead, just copy the state corresponding to the commit in a temporary directory far away from the working directory.

But my repository has so many files! It’s going to take so long to copy the full thing!

We’re not cloning the repository. Just doing a checkout-index. And git is fast.

The proper way to create a working directory is with mktemp, which will avoid any collision. Also, we are not going to leave random files lying around, so we ensure that the temporary directory is removed once the script finishes (succeeds, fails, crashes, is killed, etc.). In Bash, the proper way to do this is with traps. So we do that immediately. Finally, we copy the state of the current commit to that temporary directory with checkout-index:

TEMPDIR=$(mktemp -d)
trap "rm -rf $TEMPDIR" EXIT SIGHUP SIGINT SIGQUIT SIGTERM
git checkout-index --prefix=$TEMPDIR/ -af

Once this is done, we can just cd to that temporary directory and run our linting commands. And we’re done.

Well, almost. You did not expect it to be that easy, do you?

Making it Work Fast

See, modern build tools such as npm and cargo do this clever thing where they put dependencies in a local subdirectory (node_modules/ for npm, target/ for cargo).

It’s definitely an improvement about Python’s virtual environments. However, it does mean that, when we run our linting command from the temporary directory, we start from scratch1. Every time we try to make a commit.

Thankfully, we can just tell npm and cargo to use the usual subdirectory2. But where is this subdirectory again? We could use pwd, but maybe the user is in another subdirectory of the project. Maybe we could try visiting the parent’s until we find node_modules/ or target/?

There is much simpler. The thing we are looking for is usually easy to locate from the repository’s root. Often, it is just directly at the repository’s root. So, if the repository is at ~/src/my-project, we should look in ~/src/my-project/node-modyles/ and ~/src/my-project/target/. Conveniently, we can just ask git for the repository’s root:

GIT_ROOT=$(git rev-parse --show-toplevel)

Then, we’ll just set the relevant environment variables:

export NODE_PATH="${GIT_ROOT}/node_modules"
export CARGO_TARGET_DIR="${GIT_ROOT}/target"

Now things are starting to look pretty good!

But we can do better.

If you only change 1 file out of 10,000, you might want to just check that one file and skip the rest. We can do just that.

First, let’s get the list of the files we should check:

git diff --cached --name-only --diff-filter=AM

Here, the “AM” value passed to diff-filter is two flags, “A” for “added files” and “M” for “modified files”. We can filter further with grep to keep only the files ending in “.py” for example.

Then, we feed this list as arguments for our linting executable. This is done easily with xargs, which allows will take each line from the input and give it as an argument to the command. For instance:

(echo a; echo b; echo c) | xargs ls
# same as
ls a b c

Since the list might be empty, remember to use --no-run-if-empty. Without it, you will run the linting command without argument, which might fail, or worse3, lint all the files. So we could have:

git diff --cached --name-only --diff-filter=AM |
    grep '\.[hc]$' |
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast c; }  # lint from temp directory

Note: The paths listed by git diff are relative to the root of the repository, so you’ll want to run the linting command from that root, even if all the files are in a subdirectory. This could also mean that you may have to explicitly point the linting command to the configuration file. For instance, if your TypeScript files are in a front/ directory along with the Biome configuration file, you will need to do:

git diff --cached --name-only --diff-filter=AM | grep -E '\.tsx?$' |
    (cd $TEMPDIR; xargs --no-run-if-empty npx @biomejs/biome check --config-path="front/")

The Script

Finally, with all the pieces coming together, we have a pre-commit hook which is both reliable and fast. Don’t forget to add the pre-commit script itself to your repository, and enable it with ln -s ../../pre-commit .git/hooks/ (you cannot directly track files in .git/).

You might not need TypeScript and Rust and Python and C and C++. But you can just pick the relevant parts.

#!/usr/bin/env bash
# Usage: copy this file to .git/hooks/

# Exit at first error
set -Eeu

# To handle partially committed files, we must copy the staged changes to a
# separate location
# See also https://stackoverflow.com/a/36793330
TEMPDIR=$(mktemp -d)
trap "rm -rf $TEMPDIR" EXIT SIGHUP SIGINT SIGQUIT SIGTERM
git checkout-index --prefix=$TEMPDIR/ -af

# keep using the same node_modules/ and target/ directories, not a new one in
# the temporary directory this avoids re-parsing everything from scratch every
# time we run the script
GIT_ROOT=$(git rev-parse --show-toplevel)

# TypeScript
export NODE_PATH="${GIT_ROOT}/node_modules"
git diff --cached --name-only --diff-filter=AM | grep -E '\.tsx?$' |
    (cd $TEMPDIR; xargs --no-run-if-empty npx @biomejs/biome check)

# Rust
export CARGO_TARGET_DIR="${GIT_ROOT}/target"
(cd $TEMPDIR; cargo fmt --check)
(cd $TEMPDIR; cargo clippy --all -- --deny warnings)

# Python
(cd $TEMPDIR/back; ruff check .)

# C
git diff --cached --name-only --diff-filter=AM |  # list modified files
    grep '\.[hc]$' |  # filter C files
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast c; }  # lint from temp directory

# C++
git diff --cached --name-only --diff-filter=AM |  # list modified files
    grep '\.[hc]pp$' |  # filter C++ files
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast cxx; }  # lint from temp directory

For C and C++, I use the following lint-fast script:

#!/usr/bin/env bash
# Exit at first error
set -Eeuo pipefail

MODE="${1}"
shift

# trailing space check
if grep -Hn '\s$' "$@"; then
    echo "Error: trailing space"
    exit 1
fi

# check for NULL instead of nullptr in C++ files
if test "${MODE}" = "cxx"; then
    grep -Hn --color NULL "$@" && exit 1
fi


GCCFLAGS=(
    # do not actually compile
    -fsyntax-only
    # fail script when issue is found
    -Werror
    # usual paranoid options
    -Wall
    -Wconversion
    -Wextra
    -Wpedantic
    -Wshadow
    -Wvla
)

# C++ specific options
if test "${MODE}" = "cxx"; then
    GCCFLAGS+=(
        -std=c++17
        -Wold-style-cast
    )
fi

gcc "${GCCFLAGS[@]}" "$@"

#note: one flag might come from pre-commit hook
CPPCHECK_FLAGS=(
    --suppress=noExplicitConstructor
    # no point when we only check a subset of the files
    --suppress=unusedFunction
    # disable lookup for system includes
    --suppress=missingIncludeSystem
    # sometimes does stupid suggestions
    --suppress=useStlAlgorithm
    # otherwise, it complain when not enough files are included
    --suppress=unusedStructMember
    # otherwise, it will complain it has not found a suppressed warning
    --suppress=unmatchedSuppression
    # enable everything else
    --enable=all
    # fail script when issue is found
    --error-exitcode=1
    # only print something when there is an issue
    --quiet
    # ignore inlined JS code
    -DEM_ASM=
)

cppcheck "${CPPCHECK_FLAGS[@]}" "$@"

# C++ specific options
CLANG_FLAGS=(
    --checks=-clang-analyzer-security.FloatLoopCounter,-clang-analyzer-valist.Uninitialized,-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,-clang-analyzer-optin.performance.Padding
    --warnings-as-errors=*
)
if test "${MODE}" = "cxx"; then
    CLANG_FLAGS+=(
        -extra-arg=-std=c++17
    )
fi
clang-tidy "${CLANG_FLAGS[@]}" "$@" --
  1. Of course, if we’re linting with a call to npx, it will probably just fail since the command is not installed in the local project. ↩︎
  2. Yes, I know, I just said we shouldn’t touch the working directory. But we’re not, npm and cargo are! … Okay, let’s just say it’s a special case and we are being very careful here. ↩︎
  3. This is worse because this could stay under the radar for a long time and waste time. An error can usually be diagnosticed quickly and fixed once and for all. ↩︎
Categories
Programming

Tiny Docker Containers with Rust

The Magic

Thanks to Python, I have gotten used to Docker images that takes minutes (or dozens of minutes) to build and hundreds of megabytes to store and upload.

FROM python3:3.14

# install everything
RUN apt-get update \
    && apt-get install -y --no-install-recommends
    libsome-random-library0 libanother-library0 \
    libmaybe-its-for-pandas-i-dunno0 libit-goes-on0 liband-on0 \
    liband-on-and-on0 \
    && rm -rf /var/lib/apt/lists/*

# install more of everything
COPY requirements.txt .
RUN pip3 install -r requirements.txt

ENTRYPOINT ["gunicorn", "--daemon", "--workers=4", "--bind=0.0.0.0:8080", "app:create_app()"]

But there is another way. The Rust way. Where you build your image in a second, and it only takes 5 MB1.

FROM scratch
COPY target/release/my-binary .
ENTRYPOINT ["./my-binary"]

The Trick

This is possible thanks to the points below.

  1. Rust is compiled to binary
  2. Rust is statically2 compiled to binary
  3. You can easily use musl3 with Rust

The first point means that there is no need for a runtime to interpret the script or the bytecode.

The second point means that the binary contains the code for all the libraries4. And, more specifically, only the required code. So no need to install them externally, you remove some overhead, and the total size is reduced.

With these two, you get down to just a few runtime dependencies. Namely, glibc and the ELF loader5.

$ ldd target/release/my-binary
linux-vdso.so.1 (0x00007fffbf7f9000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f7a759b2000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f7a758d3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7a756f2000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7a762a6000)

But you can go further! We do not have to use glibc. It turns out you can just statically compile musl. And with Rust, using musl is only a matter of installing musl-tools and adding --target=x86_64-unknown-linux-musl to your cargo build command.

And, since we just got rid of our last runtime dependency, we do not even need the ELF loader! So now, we get:

$ ldd target/x86_64-unknown-linux-musl/release/my-binary
	statically linked

All the user code is now in a single file and will talk to the kernel directly using int 0x80.

Conclusion

This is a nice trick, and can help you if you really care about small Docker images, but it does have a drawback: targeting musl typically gives you lower performance. So keep this in mind!

  1. Sure, I am simplifying things here. In practice, you would also build the binary in a Dockerfile, and then use staging. But that does not change the point. ↩︎
  2. Static linking means including external libraries within the binary, as in opposition to dynamic libraries, where they are not included. Binaries that were dynamically linked will need to be provided with the external libraries at runtime (.so files on Linux, .dll files on Windows). ↩︎
  3. musl is an implementation of libc, the standard library for C. Virtually all binaries depend on the libc. glibc is the most common implementation of libc on Linux. ↩︎
  4. Of course, it won’t work if a library is using FFI to talk to OpenSSL for instance ↩︎
  5. The ELF loader is another program that helps your program to start up. In particular, it tells your program where to find the dynamically linked libraries. ↩︎
Categories
Programming

Where to Start with Rust

As a Rust evangelist, I have been asked a few times for resources to start learning Rust. I have compiled my thoughts in this article. I have ordered the resources in a rough order: you should first take a look at the first resources and later at the other ones. Do not take it as gospel, and do peek at the rest of the list before completing even the official tutorial.

The Rust Programming Language

This is the official tutorial for Rust. It is also known as “the Rust book”. I do not recommend you read it entirely, but you should read at least the first 12 chapters. Of course, the rest of the chapters are worth a read, but the more advanced topics are better covered by dedicated resources.

Rust for functional programmers

If you are used to programming in Haskell or OCaml, this tutorial might help you get you to speed quicker.

Tour of Rust

For a more hands-on approach, you can use this tutorial, which drops you in a Rust playground with minimal exercises that each train you on a new concept.

Rust by Example

If the Tour of Rust is not enough, you can also try this one to complement your knowledge.

Learning Rust With Entirely Too Many Linked Lists

This tutorial is about getting over one of the main pain points of Rust: cyclic references. It goes over various solutions, how they work, and their pros and cons. A must-read if you want to become more comfortable with Rust.

Learn Rust the Dangerous Way

You can see this tutorial as an extension of the previous one. When you are used to developing in C, switching to idiomatic Rust can seem like a daunting task. This website will take you from a C program (the n-body problem from the Benchmarks Game) to a fully idiomatic and optimized Rust one.

Rustlings

At this point, you should be getting comfortable with the syntax and main idiosyncrasies of Rust. This collection of exercises will let you hone your skills.

Zero To Production In Rust (book)

To go further than writing small snippets of Rust, you will want to start taking into account higher-level considerations. This book narrates every step in setting up a simple mailing-list server in Rust.

Macros in Rust

Writing macros is a fairly advanced topics of Rust. This tutorial will get you more familiar with writing declarative macros (or “macros by example”).

The Rustonomicon

Most of the time, you should not need to touch unsafe. But, when you do, you enter a parallel world where every step you take could lead you to the dreaded undefined behaviors. Using unsafe safely requires thinking about many things. This tutorial will help you get started with this.

Rust for Rustaceans (book)

Like all non-trivial software projects, Rust has its quirks and unintuitive mechanisms. This book is a very dense explication of some of this crust. You might also enjoy the author’s YouTube channel.

This Week in Rust

This is the one-stop-shop to keep learning more about Rust, its ecosystem and its community. You do not need to follow all the links, or even read it from start to finish, but skimming the latest edition every week will let you stay on top of things.

Rust Koans

I conclude this list of heavy technical content with some light reading, likely inspired from the Rootless Root.

Categories
Programming

Client-Side Password Hashing

tl;dr: When you think about it, hashing password only client-side (in the browser, in JavaScript) is not as terrible as you might think. And if you do most of the work client-side, you get something that is actually pretty secure.

Storing Passwords

So, I have been working on modernizing a Web application and fixing some of its issues. As someone who has spent way too much time implementing cryptographic primitives and cracking password hashes for fun, one of the main problems was that passwords were only hashed client-side.

For the context, we have learned quite a long time ago that storing raw passwords in a database was a bad idea. If the database leaked, all passwords were immediately compromised. Since people tend to reuse the same passwords for various services, it meant that compromising a non-sensitive service would allow an attacker to access very sensitive accounts of many individuals.

So we encrypt them. Except we don’t. If the database leaks, the encryption key will probably leak as well. Instead, we cryptographically hash the password. That is, we transform them in something completely different, irreversibly. So, “correct horse battery staple” would become “c4bbcb1fbec99d65bf59d85c8cb62ee2db963f0fe106f483d9afa73bd4e39a8a”.

How your password is saved when you register, and how it is checked when you log in

Because the same password always get hashed the same way, we can authenticate people by comparing the hashes instead of the passwords themselves. No need to remember the password in the database! And if the database leaks, no attacker cannot find “correct horse battery staple” from “c4bbcb1fbec99d65bf59d85c8cb62ee2db963f0fe106f483d9afa73bd4e39a8a”.

Well, except they can.

Slowing down attackers

Although cryptographic hashes are designed to be hard to directly reversed, an attacker who got access to the hashes can just take a guess, hash it, and check whether it matches the hash of a user. If it does, it has found a password.

Since cryptographic hashes are also designed to be very fast, an attacker can try many guesses in a short amount of time.

The average password entropy is 40 bits. The median is likely to be lower. A modern CPU can compute more than 1 million SHA-256 hash computation per seconds. So trying every password with 40 bits of entropy would take less than 2 weeks.

The number of characters is not everything. What matters is how it is generated. Don’t pick a password: since humans are bad at randomness, the entropy will be even lower!

Thankfully, we can make things more difficult for the attacker. What if, instead of storing SHA-256(password), we stored SHA-256(SHA-256(password)). It would make things twice as hard for the attacker, and would not change much when we validate a user’s password for authentication. We can go further! Instead of requiring to compute SHA-256 twice, we can make it so that it takes ten, a hundred, a thousand iterations! We can choose a number that makes it so that hashing the password takes 10 ms on the server. In that case, it would take the attacker 350 years of CPU time! This is basically PBKDF2.

The more iterations, the longer it will take to crack a password. Of course, verifying the password when authenticating will also take longer, but it should only be a few milliseconds.

Except that the attacker can do something that we cannot: it can try many guesses in parallel. Today’s CPUs can have dozens of cores. GPUs scale even more. And, with Bitcoin/blockchain/web3/NFTs, people have started building dedicated hardware to hash SHA-256 en masse. For a few thousand dollars, you can get hardware to crack SHA-256 at tens of billions of guesses per second. Even with our clever scheme, the attacker is back to 2 weeks to try every password1.

That’s cheating!

Yeah, it is! But we can do something about that. GPUs and dedicated hardware are designed for very specific tasks. We can take advantage of that to make it very hard to crack passwords with them. The trick is to replace SHA-256 with a hash function that requires a lot of memory to compute. It turns out, people have designed a very good one! Argon2id2

That’s basically the state of the art in storing passwords3.

Client-Side Only Hashing

The client (your web browser) just sends the password over the network. Then, the servers hash to hash it before storing or checking it.

Now, that’s quite a bit of work for our server. We need to do a significant amount of computation every time a user tries to connect. What if we had the user’s client do it instead?

Instead of sending the password, and having the server hash it, let’s imagine we use some JavaScript to hash the user’s password with Argon2id, and then send it over the network. Now, the server only needs to compare the hash to the one in the database!

Now, the client does most of the work.

There is one issue though: from the point of view of our Web application, the actual credential is not the password anymore, but the hash. So if an attacker gets the hash, they can authenticate as the user. In other words, a database leak means the credentials leak!

It’s not as bad as in the initial scenario, however. These credentials are only valid for our service. And it is still very hard to find the original passwords from the hashes.

Mostly Client-Side Hashing

We can improve this with a simple observation: since it is very hard to get the password back from the hash, we can treat the hash like a high-entropy password.

In this case, brute-force attacks become irrelevant, and we do not need to iterate or use memory-hard functions. Instead, we could just use a fast hash. It looks like that:

  1. The browser computes an intermediate hash with a slow memory-hard function
  2. The browser sends the intermediate hash to the server
  3. The server hashes the intermediate hash with a fast function, getting the final hash
  4. The server compares the final hash with the one in the database

From the point of view of the server, the effective credential is the intermediate hash. However, if the database leak, the attacker only gets the final hash. It would be very hard to find the intermediate hash directly from the final hash, since the input space is very large. And it would be very hard to find the user’s password from the final hash, since a memory-hard function is involved.

The nice thing is that all the heavy work is offloaded to the browser, while still keeping the password secure!

The one drawback that I can think of is that this is not enforced by the back-end. If the front-end (especially a third party one) does not do proper memory-hard hashing, passwords might become exposed.

A positive side effect is that the back-end never sees raw passwords. It can be a good mitigation for inadvertently leaking information in the logs. Leaking the intermediate hash still allows an attacker to authenticate as a user, but they are not able to get their raw password and compromise the user’s accounts on other websites.

Conclusion

A priori, client-side hashing is a terrible idea from the point of view of security. However, looking more closely at it, it looks like it can actually make sense. But we should still use hashing on the server to avoid any risk of leaking credentials.

  1. Note that, without the iterated hashing, it would take just 2 minutes with dedicated hardware. So that’s still better than not doing it. ↩︎
  2. scrypt came first, and is pretty good as well ↩︎
  3. There is also salting, but that’s orthogonal to this discussion ↩︎
Categories
Programming

Strongly Typed Web Apps

I Am Old

I like software to be correct. I know. Controversial, right? But as I have gained experience tinkering with various technologies, and building hobby project after hobby project, with some professional projects in-between, I have come to realize that I do not want to just pray and hope. I want to know that my programs work. And I do not want to spend 99 % of my time manually testing them.

When I first encountered Python, coming from a QBasic, C and C++ background, it was a bliss not having to declare types manually. Instead of wasting time writing boilerplate, I could actually make my programs do what I wanted them to do.

But, as my projects grew more ambitious, I had to become stricter and stricter to make sure I found bugs as early as possible. So wrote tests, I set up linters, wrote tests, added back typing through type annotations, and wrote tests.

Type annotations get a bad rap, but they are actually useful, unlike C types.

In C, types are mostly about memory layout, and not so much about correctness. When it comes to semantics, a char is an int, a short is an int, and an int could be a long, or a float. Also, an int could be a short1. And of course, you can add strings to arrays, pointers to integers, and floating point values to enumerations. Because, in the end, it’s all the same to C2.

With type annotations, you start to get the point that functional aficionados have been trying to make since forever. “Yeah, that value could be None, so you need to check for that before you use it. By the way, you’re evaluating the addition between a hash table and a list; don’t.” Thanks, compiler3.

And when you start thinking like that, you might want to say: “Okay, this variable holds euros. I really do not want to add it to kilowatts.”. That’s when you know it’s too late. But now you have implemented a full-fledged system for dimensional analysis in your type system. And it is glorious. You have forgotten what it was meant for, though.

What’s this to do with Web Apps? Seriously, not much. But there is a point!

Typed APIs

Web Apps have two fully separated software components talking to each other: the frontend, and the backend. And it’s very easy to introduce bugs at the interface between the two. Especially when they use different languages4.

Of course, the solution is XML. As everyone knows, you simply write an XSD or DTD specification for your format. Then, you can easily transform your data for consumption thanks to XSLT and even presentation with XLS. You might even have something working within ten years!

Okay, okay, everyone is doing JSON these days. Thankfully, we do have a way to describe what a JSON document contains: Swagger OpenAPI.

Fine. So you write this OpenAPI thing. Then what?

Well, that’s where the current picture is still lacking. We now have three components and two interfaces:

  • backend code
  • OpenAPI specification
  • frontend code

How do we get out of manually checking everything and make sure it works?

First, we can generate some frontend code from the OpenAPI specification, which will handle all communications with the backend. For this, I use openapi-typescript-codegen. Okay, that one mostly works fine.

Now, the fun part is to make the backend code and the OpenAPI code work together.

The convention is to generate the OpenAPI specification from the backend code, and not backend code from the OpenAPI specification. In Python, you might be familiar with Flask-RESTful Flask-RESTplus Flask-RESTX. In Rust, you could be using Utoipa.

In principle, this is great: you add some annotations to your backend code, and you can be sure that your frontend code sees the same types!

However, this is still a very immature space. You can very easily document the wrong type in the annotations, and you won’t get a problem until you try to actually use the generated OpenAPI specification in the front-end.

Conclusion

Things have definitely improved since the days I was writing PNG parsers in QBasic. But we still have ways to go until we really pick all the low-hanging fruit for software correctness. I want the compiler to shout at me more often!

  1. Yes, I am aware of type ranks. The point is that none of this helps you make your program more correct. ↩︎
  2. Sure, you cannot add a struct to a bit-field. But guess how you pass a struct around? ↩︎
  3. Or mypy in this case. ↩︎
  4. Sure, you might enjoy using JavaScript in your backend, since you already have to use it for your frontend. Not. ↩︎
Categories
Programming

Overkill Debugging

Hack that Code!

I have a not-so-secret technique that I use whenever I encounter a bug that evades my grasps. Or when I am lazy. Okay, mostly when I am lazy.

The idea is no more than applying the concept of binary search to source code:

  1. Remove half the code
  2. Compile and run
  3. If the bug is still present, repeat with the remaining code
  4. Otherwise, repeat with the code that was removed in 1

Of course, this naive approach is rarely practical. For once, removing half the code indiscriminately tend to make things not compile. Also, sometimes, a bug manifests because of the way the first half and the second one interact.

So, in practice, it looks more like:

  1. Remove some big chunk of code
  2. If that fails to compile/run, try another chunk, or reduce the size
  3. Compile and run
  4. If the bug is still present, repeat with the remaining code
  5. Otherwise, put back the chunk of code, and try removing another one (can be a sub-chunk of the initial chunk)

Note: sometimes, you will still need to do mini-refactor to keep making progress

This works surprisingly well. And it is very satisfying to remorselessly hack through code that was written painstakingly over hours, months or even years. And then, getting to a small subset of code that clearly exemplify the bug.

Automate the Fun Parts

Software engineers might be the group of people who work the hardest to make themselves redundant.

Don't forget the time you spend finding the chart to look up what you save. And the time spent reading this reminder about the time spent. And the time trying to figure out if either of those actually make sense. Remember, every second counts toward your life total, including these right now.
I am guilty of spending way too much time automating tasks that only took a trivial amount of time

So, of course, someone already automated the process I described above. It is called C-Reduce. It is meant to find bugs in the compiler rather than in the compiled program, but the principle is the same. If your program is C and you can write a small script that test whether the bug is present, then it will automatically slash through your code.

To give you an idea, consider the code below.

#include <stdlib.h>
#include <stdio.h>

void f(void) {
    int x = 42;
    int y = 83;
    int z = 128;
    printf("%d\n", x  + y);
}

void g(void) {
    int x = 1;
    int y = 2;
    int z = 3;
    printf("%d\n", x  + y);
}

int h(void) {
    return 42 * 43;
}

int main(void) {
    h();
    f();
}

#include <stdarg.h>
#include <time.h>

As well as the following Bash script.

#!/usr/bin/env bash
set -Eeuo pipefail
gcc -o main main.c
./main | grep -q ^125$

Then creduce test.sh main.c reduces the C source file to:

main() { printf("%d\n", 42 + 83); }

This was really a toy example, but that this program is very useful when you have large source files. For instance, C++ ones *shivers*.

Audio Millisecond Mistiming

I have been writing on a tool to practice Morse code using jscwlib, the library behind LCWO. I want to play an infinite flow of words in Morse code. The algorithm is the fruit of decades of research:

  1. Play a word
  2. When it has finished playing, play another word

While dogfooding, I realized that the speed sometimes increased unexpectedly. This was subtle enough that I did not notice the transition from normal speed to the higher speed.

Over my next practice sessions, I kept an eye ear on this phenomenon, and eventually realized that:

  • The space between the words did get shorter
  • The duration of the characters remained the same

From the two first points, I was able to devise a more reliable way to detect the presence of the bug. First, I made all the words just the letter “e”, which is a single “dit” in Morse. Second, I measured the time between the start of two words with new Date(). Finally, I showed the durations on the console.

This is not a very precise method but, it enabled me to see the duration of spaces reducing millisecond by millisecond over a few seconds. The effect became noticeable after a few hundred words were played.

With this, I could test the presence of the bug in mere seconds, instead of minutes of practice. So I went slashing.

After a short while, I found the culprit, wrote a fix, and opened a PR!

Conclusion

In the end, it took me much more time noticing the bug, understanding what it looked sounded like, and finding a way to check for it effectively, than actually identifying its origin in the source code.

Categories
Programming

Float Woes in C

The C programming language is a gift that keeps giving. Today, we are going to see how a seemingly banal and common operation can hide unfathomable depths of unmentionable horrors.

Woes with Integer Coercion

What is the problem of the code below?

// returns the closest integer to f
int float2int(float f) {
    return (int) f;
}

It’s written in C.

A Rust enthusiast

That’s… a good point.

But let’s assume we actually want to write C, for some forsaken reason. And we do not care whether we convert up, down or sideways.

There is still a fundamental issue with this code. What happens when we call float2int with INFINITY?

#include <math.h>
#include <stdio.h>

int float2int(float f) {
    return (int) f;
}

int main(void) {
    printf("%d\n", float2int(INFINITY));
    printf("%d\n", float2int(-INFINITY));
    printf("%d\n", float2int(NAN));
}

When we compile this code with gcc and run the result, we get:

$ gcc -O3 test.c && ./a.out
2147483647
-2147483648
0

Makes sense! The largest integer is 2147483647; the smallest is –2147483648, and what would you choose for NAN but 0?

Someone still young and innocent

Now, let’s just remove that -O3 option:

$ gcc test.c && ./a.out
-2147483648
-2147483648
-2147483648

What?

The innocent who got their world shattered

Wait, it gets better:

$ clang -O3 test.c && ./a.out
1464089272
1488257696
1488257696
$ ./a.out
1259459480
-1806736736
-1806736736
$ ./a.out
-2098630344
1664811680
1664811680

But… why?

All hope is gone

Because it can.

Yup, that’s undefined behavior. Converting a non-finite value (i.e. an infinite or a NaN) to an integer is undefined behavior.

But that’s not all! What should the largest finite floating-point value (3.4028234664e+38) convert to? It’s much bigger than INT_MAX, the value that int can represent (it does not matter whether you are on a 32, 64 or 128 bit architecture, really).

Maybe we could just say that all the floating point-number larger than INT_MAX should be converted to INT_MAX when converted? But alas, it makes for unwelcome surprises down the line, when you realize that the rounded value is not always within 1 of the original value.

Thankfully, the C standard always has the solution whenever there is a hint of ambiguity:

When a value of integer type is converted to a real floating type, if the value being converted can
be represented exactly in the new type, it is unchanged. If the value being converted is in the
range of values that can be represented but cannot be represented exactly, the result is either the
nearest higher or nearest lower representable value, chosen in an implementation-defined manner.

If the value being converted is outside the range of values that can be represented, the behavior is
undefined
. Results of some implicit conversions may be represented in greater range and precision
than that required by the new type (see 6.3.1.8 and 6.8.6.4).

Part 6.3.1.4 paragraph 2 of the C standard

In other words, we’re just going to say all the values that are not quite that convenient trigger undefined behavior. The “range of values that can be represented” by int is the interval [INT_MIN, INT_MAX]. Attempting to convert any float value from outside this range to the int type is undefined behavior.

We’ll just have to check for that!

Someone has not learned their lesson

Woes with Range Checking

We are just getting to the actually tricky part. Let’s have a look at a seemingly robust implementation:

#include <limits.h>
#include <math.h>

int float2int(float f) {
    if (isnan(f)) {
        return 0;
    }
    if (f < INT_MIN) {  // also filters out -INFINITY
        return INT_MIN;
    }
    if (f > INT_MAX) {  // also filters out +INFINITY
        return INT_MAX;
    }
    return (int) f;
}

For the sake of simplicity, we are just providing arbitrary values for the values that we cannot convert. Other implementations might choose to return a status indicating whether the conversion is possible or not.

But, of course, there is a bug lurking in there. And it is extremely easy to miss.

And, this time, it is not caused by an undefined behavior, but by another surprising “feature” of C: implicit type conversions.

To understand what I mean by that, we need to look at what the conditions in the code above really mean. The first one is fine, let’s look at the other two: f < INT_MIN and f > INT_MAX. In both, the left operand is a float, and the right operand is an int. Processors rarely have such built-in operations. So a conversion must be taking place. The way it happens in described in the “usual arithmetic conversion”:

[…]

if the corresponding real type of either operand is float, the other operand is
converted, without change of type domain, to a type whose corresponding real type is float.

[…]

Part 6.3.1.8 paragraph 1 of the C standard

Thankfully, we only need the part I have quoted here. In short, both operands get converted to float. Let’s look at our conditions again.

if (f < (float) INT_MIN) {  // also filters -INFINITY
    return INT_MIN;
}
if (f > (float) INT_MAX) {  // also filters +INFINITY
    return INT_MAX;
}

Let’s look at the first condition. What is the value of (float) INT_MIN? Let’s assume 32-bit 2-complement int type. Then INT_MIN is -2³¹. A float can represent this value exactly. You can check it out with an online converter. So (float) INT_MIN is -2³¹ as well. No problem here, this conversion does not change the value, and this code does exactly what we want.

With the same assumption, INT_MAX is 2³¹ – 1. And there comes the catch: a float cannot represent 2³¹ – 1 exactly. If you put “2147483647” in the online converter, you will see that it is rounded to a float whose value is actually “2147483648”. It makes sense: a float trades off being able to represent all integers in [-2³¹, 2³¹ – 1] in order to cover a much wider range and many more magnitudes. In other words, the actual value of (float) INT_MAX is INT_MAX + 1. The condition is actually doing:

if (f > INT_MAX + 1) {  // also filters +INFINITY
    return INT_MAX;
}

Note: if this were C, this code would contain an undefined behavior because of the signed integer overflow that happens when evaluating INT_MAX + 1. This is just pseudocode for illustration purposes.

Since the upper bound is offset by one, the code fails to filter out the half-open interval (INT_MAX, INT_MAX + 1]. If the input parameter f were to take its value from this range, converting it to int would be undefined behavior. Now, is there any float value in this interval? Yes! INT_MAX + 1 of course (it’s the only one).

Conclusion

The “robust” implementation does not handle 2³¹ properly. This is not just a theoretical oddity. I encountered this issue in critical code in production while working at TrustInSoft. Such bugs will easily stay under the radar and avoid even the most intensive test campaigns. Only formal methods help in detecting and avoiding such errors.

The worse thing is that, since this is undefined behavior, you might encounter it many times, and never now, corrupting data in unknown ways. Things that could have been a runtime error become the entry door for silent bugs and security issues.

C likes to keep you on your toes.

Categories
Programming

Flexible Array Members: Typical C Shenanigans

Speed. I am Speed.

The one thing that C cares about is speed. Seriously, you might not get the right results, but at least, you will get them fast.

Let’s look at two common ways to represent a vector in C.

struct vector {
    unsigned long size;
    int data[42];
};

The first one is to just use the built-in array type. This approach is simple, but implies choosing the size of the vector when compiling. The program cannot decide the size depending on user-input or some other runtime parameter.

Note: Variable-Length Arrays (VLA) do allow declaring arrays whose length is decided at run-time. However, this is limited to local variables and is an optional feature since C11.

struct vector {
    unsigned long size;
    int* data;
};

The alternative is to not use an array at all. Instead, we use a pointer to a memory buffer that will contain the data. This memory location can be created with malloc, for instance.

Pointers can be used like arrays almost transparently, which brings (the “almost” means that most C programmer are paranoid about using sizeof on expressions). The example below shows how accessing an element at some index uses the same syntax in both cases.

But although the syntax is the same, the meaning is different. Look at the assembly code generated to implement vector_index. Indexing the built-in array uses a single MOV instruction; indexing through a pointer uses two MOV instructions.

It makes sense. rdi contains the address to the struct and rsi the value of index. In the first case, we shift by 8 bytes to skip the field size, then we move by index * 4 bytes to select the right array element, then we read the value at this memory location (indicated by the bracket in the Intel assembly syntax). In the second case, we need to first recover the value of the field data, at an offset of 8 bytes from the beginning of the struct, then read at the right offset from the address it contains.

In other words, using a pointer instead of a built-in array is twice as slow! Okay, not really. Any modern CPU is going to recompile, pipeline, prefetch and out-of-order execute this such that it really won’t matter. But it used to! And it might still matter on microcontrollers.

In fact! I did some benchmarks, and, although there was no difference on my laptop (CPU model i5-4210H), I did see a tremendously large difference on my Arduino Leonardo (microcontroller model ATmega32u4). Indeed, the code with the additional indirection through a pointer was 10 % slower.

Yeah, I did expect it to be worse than that. But still, 10 % was not something to scoff at in the previous millennium. Also, CPUs at the time might have seen a larger difference than my ATmega32U4 released not that long ago.

In summary, embedding a built-in C array is fast, but inconvenient. Using dynamic allocation with malloc is more flexible, but not as fast. How can we make our compiler generate faster code while keeping the convenience?

Well, we can lie to it.

Lies. So Many Lies.

Let’s look at how we will typically create a vector with the first approach:

struct vector {
    unsigned long size;
    int data[42];
};
struct vector *vector_new(unsigned long size) {
    assert(size == 42);
    struct vector *ret = malloc(sizeof(struct vector));
    ret->size = size;
    return ret;
}

Notice how malloc is just another C function, that can take any value as an argument at runtime? Notice how we are forcing ourselves to give it a compile-time value (the size of our struct)? What if we actually allocated the size we wanted?

For instance, if we allocated sizeof(struct vector) + 100 bytes, that would mean we would still have the 42 elements from the definition of struct vector, but we would have 100 additional bytes, enough to store 25 more 4-byte integers!

And so, a new idiom was born:

struct vector {
    unsigned long size;
    int data[1];
};
struct vector *vector_new(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * (size - 1)
    );
    ret->size = size;
    return ret;
}

Since we are going to lie to the compiler about the size of the actual array anyway, we make the base size as small as possible. Since the C standard does not allow 0-length arrays (“An array type describes a contiguously allocated nonempty set of objects”), we just use 1. Then, when we create an instance of our vector, we are going to make sure to actually have enough memory for size elements. Since there is already 1 built-in, we need to allocate for a size - 1 elements in addition to the base struct.

C developers embraced this hack, wrote many fast programs, and lived happily ever after.

Well, kind of. I mean, if you write this, your program is probably going to be fast. Just as fast as a fixed-size array. But maybe it won’t. And maybe it will allow random people to take control of your computer. Or maybe it will blow up the spaceship. Who knows?

You might know where I am going with this: this is undefined behavior.

Yes, we did allocate enough memory to do what we want. But this is not the point. We lied to the compiler. And compilers are very delicate, gullible beings. It will believe you. It will believe that your array is actually of size 1. And it will do unmentionable things to your code, because this assumption is false.

Save one Character, Save the Entire World

When they observe developers writing C, compilers authors and language lawyers alike get the urge to start again from scratch. This is actually the reason why there are so many flood myths through the ages.

This time, however, they recognized that the end justified the means. Compiler authors, too, want the programs they write to be fast (except C++ compiler authors).

And so, GCC introduced a way to do this:

struct vector {
    unsigned long size;
    int data[0];
};
struct vector *vectornew(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * size
    );
    ret->size = size;
    return ret;
}

Notice the difference? We just replaced 1 by 0 for the size of the array.

But you said the C standard did not allow this!

Someone who pays attention to details

Indeed! But GNU C does! GNU C is the dialect that GCC understands, and it allows for 0-length arrays as an extension.

Of course, if you were using another compiler, you were out of luck… until!

Finally, in the C99 revision of the C standard, a new syntax was introduced!

struct vector {
    unsigned long size;
    int data[];
};
struct vector *vector_new(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * size
    );
    ret->size = size;
    return ret;
}

Can you see the difference? We just removed the 0.

That’s all?

Someone who wasted too much time reading this; but worry not, it took even longer to write!

Yup. This syntax tells the compiler to not assume anything about the size of the array. You can still use it as usual. But ensuring that you allocate enough memory, and that you do not go out-of-bounds, is up-to-you.

Less is More

We went from 1, to 0, to nothing. Since there is nothing left to take away, we attained perfection. It’s a general truism with C: the less C, the better.

Categories
Programming

Why Undefined Behavior Matters

So Many Bugs

So, you’ve been programming for one month, or for ten years, and you have gotten familiar with Murphy’s and Sturgeon’s laws. And you are wondering if there is some way to catch all the bugs, not just spray-and-pray with regression/unit/integration/system/acceptance testing.

Or you are a curious onlooker, wondering why all these programmers cannot make a piece of software that works reliably, what with apps doing unexpected things and with all the security breaches you hear off.

First, what are we talking about? Essentially, a bug is anytime a system does not do what it is expected to:

The Mars Climate Orbiter disintegrated into Mars because a module was using non-metric units and forgot to convert. Seriously.

As you can see, there is quite a spectrum. I like to sort bugs in three categories:

Today, I would like to focus on the last one. It looks like it should be the more tractable, since the scope is well constrained. But it is still a non-trivial problem.

What we want is two-fold:

  1. Specify what we expect the program to do
  2. Verify that the program meets our specification

Even the first step is tricky. For the sake of simplicity, we are going to reduce the scope of what we want to achieve, and focus on a well-delimited set of things our program should do: have a well-defined behavior.

What does this mean? Let us say you wrote a program that does some kind of scientific computation, expect you used meters instead of kilometers. Then, the result is going to be wrong, of course, but in a predictable way: the result will be off by a factor 1,000. If you run the program again the same way, you should get the same result.

On the contrary, if your program contains undefined behavior, it could do unexpected things. It might output the right value; it might output a wrong value; it might crash; it might allow an attacker to take control of your computer. And it might behave differently every time you run it. Worse: it might run perfectly fine during all your tests, but output the wrong value when it is being used to pilot a real plane with real passengers.

How can we know if a program contains undefined behavior? First, we need to talk about programming languages.

Why is C an “Unsafe” Language?

When people talk about “undefined behavior”, they are usually referring to undefined behaviors in the C programming language. There is a reason why this particular language is criticized as being “memory-unsafe”, and why so many security vulnerabilities stem from code written in C. And that reason is that the C language leaves the door open for many bugs.

To understand why, we must first ask ourselves what “the C programming language” refers to. Who decides what it is, and what if someone does things differently?

Historically, C was loosely defined and had many dialects. Every compiler author would do things a bit differently. The people writing a compiler to compile “C” code to 8080 assembly made different choices than the ones targeting PDP-11. The same program could end up doing different things on different processors. Things were messy.

So, the C compiler writers, and the C developers, and the processors manufacturers joined together and agreed on a common definition for what “C” really meant. Thus was born the C standard.

The C standard had to try to align things between people doing very different things. And these people cared a lot about performance. Processors at the time were about a hundred times slower than the ones today. And I am only talking about clock frequency here; nowadays, each CPU core can do much more in a cycle than the CPUs at that time, and there are now many cores in a modern CPU. It makes sense that they really wanted to keep compiling C code to efficient assembly, to have programs run at decent speeds.

Now, what happens if the code tries to access memory it is not supposed to? Well, the compiler cannot just add runtime checks everywhere, since that would have a negative impact on performance. And it cannot always know in advance that this will happen, since it would require it to solve arbitrary complex mathematical problems.

Because of this, the C standard did not require either. Instead, if the code does such an invalid operation, it is said to have “undefined behavior”.

So what? If code crashes, I’ll notice the problem. If it outputs the wrong result, my tests will notice it.

Simplicius, my imaginary straw-man C developer

Wait a minute, renegade figment of my imagination! Undefined behavior means that the compiler is allowed to produce code that does anything. In particular, it is allowed to assume that undefined behavior never occurs in the problem, and deduce things from that. And you can prove many things if you assume something which is false.

Let’s take an example. The code below uses a toy authorization function that just checks whether the user id (uid) . As you might have already guessed from the topic of this article, there is an undefined behavior in this code. First question: can you see it?

#include <stdio.h>
int authorized_uids[] = {0, 1001, 1005, 1012};
int is_authorized(int uid) {
    for (int i = 0; i <= 4; i++) {
        if (uid == authorized_uids[i]) {
            return 1;
        }
    }
    return 0;
}
int main(void) {
    if (is_authorized(4242)) {
        puts("You are now logged in as root.");
    } else {
        puts("You are a lowly common user.");
    }
}

Spoiler: answer below.

The loop is written as for (int i = 0; i <= 4; i++), so i takes the values 0, 1, 2, 3 and 4. Then we use it as an index to read from an array: authorized_uids. However, this array only contains 4 elements, at indices 0, 1, 2 and 3. There is no index 4 in this array. Trying to access index 4 is an out-of-bounds access, which is a type of undefined behavior.

Second question: what will happen when we compile and run this code? You can actually do it, but do try to have a guess first.

The universal answer is: anything can happen. Since there is an undefined behavior, the program could do anything. However, there are two common things compilers will do in this case.

Option 1. It can translate this C code to assembly in a rather naive way. For an x86 processor, it would translate authorized_uids[i] as an assembly instruction reading memory at the address authorized_uids + i.

This might seem a bit obvious, but the compiler does not have to do that. Even when there is no undefined behavior, the compiler is allowed to emit any code it wants, as long as it produces the same result. For instance, the compiler might find a faster series of instructions that achieves the same results. This is what “optimizing” mean.

If we then run the compiled program, the CPU will try to read at authorized_uids + i. Most likely, however, this will still be in the valid memory of the program. Indeed, memory is usually allocated generously (segments for the stack, pages for the heap). So, the assembly instruction will not cause a crash; it will return some value. Most likely, the value will not be 4242, so the test will fail, and the user will not be authorized.

This is the middle column in the compiler explorer below. You can see at the top that the source code for is_authorized contains various assembly instructions. And, at the bottom, you can see the output of the program: “You are just a lowly common user.”.

Option 2. Now, let’s have a look at the rightmost column. The output is “You are now logged in as root.”! This is not what this program was supposed to do at all! And this is not just a fluke. It happens every time you run it!

Then, why is the result different this time? If you look closely, you will notice that the only difference is that we passed the option -O2 to the compiler. It tells the compiler to try harder to optimize the program.

In fact, if you look at the generated assembly, you will see that the compiler optimized the function to something extremely efficient:

is_authorized(int):
        movl    $1, %eax
        ret

This means that the function is_authorized just always returns 1. Wat?

Let’s try to understand the reasoning of the compiler. It sees:

int is_authorized(int uid) {
    for (int i = 0; i <= 4; i++) {
        if (uid == authorized_uids[i]) {
            return 1;
        }
    }
    return 0;
}

This is equivalent to:

int is_authorized(int uid) {
    if (uid == authorized_uids[0]) { return 1; }
    if (uid == authorized_uids[1]) { return 1; }
    if (uid == authorized_uids[2]) { return 1; }
    if (uid == authorized_uids[3]) { return 1; }
    if (uid == authorized_uids[4]) { return 1; }
    return 0;
}

But we can assume that there is no undefined behavior in the code. So uid == authorized_uids[4] must never be evaluated. So the compiler can safely assume that the function always exits at one of the first 4 return statements. In other words, it always returns 1.

But couldn’t the compiler at least warn about this?

Someone rightfully annoyed with the state of the universe

Fighting Undefined Behavior

If you run gcc with -Wall -Wextra -Wpedantic or even clang with -Weverthing, you still get no warnings about the undefined behavior!

Why don’t they warn about this? Because this is harder than it looks. The compiler cannot always know where an UB will happen. Let’s have a look at the example below.

// hour ∈ [0, 23]
// minute ∈ [0, 59]
// second ∈ [0, 59]
int time_to_seconds(int hour, int minute, int second) {
    return hour * 3600 + minute * 60 + second;
}

If a compiler had to ensure the absence of undefined behavior when compiling the above code, they would have to assume that hour, minute and second contain any possible value from INT_MIN to INT_MAX. The compiler does not understand comments written for humans, and it does not understand variable or function names, even when they are chosen appropriately.

To make the code safe, the compiler would have to introduce runtime checks in the compiled code to make sure that no overflows occur during the computation. In practice, that would imply 2 comparisons and 2 branches before each of the 2 multiplications and each of the 2 additions. But C was not designed for safety. It was designed for speed. All these comparisons and branches add up. On a simple benchmark, I observed that a safe version would be between 2 and 3 times as slow as the unsafe one.

For the same reason, if the compiler had to warn for every potential undefined behavior, developers would get flooded with false positives. In the example above, the compiler would have to emit a warning for each arithmetic operation. With many false alarms, the developers would have to get in the habit of just ignoring all the warnings.

But I do not care about performance that much! I want runtime checks everywhere. Or at least, I want to see the warnings! I set -Wall -Wextra -Wpedantic -Wconversion -Weverything. I really do want to know when there are potential undefined behaviors in my code.

Let’s address each of these.

Runtime checks. The clang compiler does have the ability to add runtime checks! ASan (Address Sanitizer) looks closely at the way memory is allocated, deallocated and used. UBSan (Undefined Behavior Sanitizer) looks at a few additional things, such as invalid arithmetic operations. Both of them work by adding instructions to the resulting binary executable, which will check everything. Since the program has more things to do, it runs slower. Valgrind/memcheck works directly on the compiled program, and inserts its own checks when running it. It also makes running the program much slower. These tools are able to catch a few types of issues every time they actually happen in the program. However, it is often non-trivial to understand why it happened. Using the function from the previous example, compiling with clang -fsanitize=undefined time-to-seconds.c and running the program would result in something like:

time-to-seconds.c:5:17: runtime error: signed integer overflow: 2147483647 * 3600 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior time-to-seconds.c:5:17 in 
time-to-seconds.c:5:33: runtime error: signed integer overflow: 2147483647 * 60 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior time-to-seconds.c:5:33 in 

Syntactic checks. Although they do not understand what your program does, compilers are able to notice suspicious patterns in the source code. For instance, they can see if (x = 42) and emit a warning, since this is a common mistake where the developer should have written if (x == 42). Modern compilers now include an extensive set of such patterns to warn the user about potential mistakes. Some tools are specialized in finding such patterns in the code, such as clang-tidy or cppcheck. Since they do not change the way the code is compiled, they incur no slow down when actually running the program. However, they can only find common mistakes, or they would risk emitting too many false warnings. In the example, no linter emits any warnings, since it would be likely to be a false positive. However, they do warn of more obvious cases:

warning: integer overflow in expression of type ‘int’ results in ‘1’ [-Woverflow]
   39 |     INT_MAX * INT_MAX;
      |             ^

If we want to go further, we need to look at formal methods. The idea is to use mathematical and systematical approaches to find all the undefined behaviors in a provable way. This means that such methods even allow you to prove the absence of undefined behavior (when there is none, of course).

I want to talk about two techniques from formal methods in particular. I am familiar with them because the company where I work, TrustInSoft, sells an industrial tool to apply them to real code used in critical applications.

Abstract interpretation. The idea for this one is intuitive. Essentially, we are going to interpret the source code in an abstract machine. This abstract machine will interpret each statement from the code exactly as specified by the standard. If undefined behavior can happen, a warning can be emitted. Essentially, this can be thought as an implementation of the abstract machine as described by the C standard.

The interesting thing is that you can go even further by having the abstract machine work with sets of values instead of singular values. For instance, you can analyze a function for all possible inputs (the values for each parameter being represented as a set of actual values), instead of just one at a time. For instance, let us consider the example below.

int abs(int x) {
    if (x < 0) {
        return -x;
    } else {
        return x;
    }
}

We can analyze this using the Eva plugin of Frama-C, or the Value plugin of TrustInSoft Analyzer.

$ frama-c -eva abs.c -main abs -verbose=0
[eva:alarm] abs.c:3: Warning: signed overflow. assert -x ≤ 2147483647;

This command interprets the source code in an abstract machine, starting from the abs function. By default, it assumes the argument can be any possible value (any integer from INT_MIN to INT_MAX). When checking the program for all these possible values, it notices a problem in one specific case: when x is INT_MIN, computing -x is undefined behavior, because it would be bigger than INT_MAX. Indeed, usually INT_MIN is 2³² and INT_MAX is 2³²-1, because of the way integers are represented. If you are interested in knowing how these programs manage to work with huge sets of values, look up interval arithmetic.

Weakest Precondition. The other technique is much more general and can actually be used to verify arbitrary facts on programs, not just to find undefined behaviors. For instance, with this, you can prove that your implementation of a sorting algorithm actually does what it is supposed to do. However, it also takes much longer to learn to use it, and to analyze a real program. If you are actually interested in it, I can only recommend the excellent tutorial written by Allan Blanchard.

What About Other Languages?

We explained why C is an unsafe language. But are the other languages unsafe?

C++. This is essentially a superset of C. And it adds its own undefined behaviors. It does introduce features to avoid doing mistakes common in C code, but the complexity it brings also hands another footgun to the user.

Java, Python, Ruby. These languages work by being compiled to byte code, which is something similar to assembly, except that it is not run directly by the processor. Instead, it is run by the corresponding virtual machine. In other, each language targets a single architecture. While C needs to encompass CPUs whose memory model vary significantly, these languages can specify strictly what should happen. They do not care as much about performance, either. The result is that, if you try to access out of the bounds of an array, you do not get undefined behavior. You get a nice, clean, exception telling you what the problem is, and a call stack telling you where it happened.

JavaScript. JavaScript is the worst client-side language. It is also the only one. Surprisingly, however, it is safe in the sense that there are no undefined behaviors (there were some until ECMAScript 2016).

Rust. Now, Rust is an interesting case. Like C, it is designed to be fast. Unlike C, it is designed to be safe. It achieves this thanks to the concept of unsafety. By default, checks are added wherever they are needed, and dangerous operations are simply prohibited. You can turn off these checks to do lower level operations and improve performance where needed, but you have to do explicitly. The point is that unsafety is opt-in, so developers know to be careful with undefined behaviors.

As you can see, we can do quite a bit about undefined behaviors. There are very fast, very simple-to-use tools, that might find common mistakes, and there are very advanced, and exhaustive tools, that require more involvement. And there are languages trying to make safety the default.

There is still a long way to go until software becomes as reliable as it could be, but we’re getting there, one byte at a time.

Further Reading

Categories
Kepler Project

Continuous Integration and Delivery Made Easy

These are not really advanced topics, but I wanted to write about them anyway. So here it is.

Continuous Integration

Sometimes, I report bugs in software. Others, in libraries. Or I can just fix things by myself. And more. Point being: there are lots of bugs everywhere. What do we, software developers, do to avoid writing too many bugs?

We run tests. Unit tests. Integration tests. System tests. Acceptance tests. And where possible, we write software to do these tests for us. Usually, that means unit tests and integration tests. Usually, such tests won’t guarantee that our code is without defect. But at least they give us confidence that it does work for the cases we do tests. If we change something and the tests still pass, we probably have not broken something major. This let us iterate quickly without having to double and triple check whether a change broke one of many features. But even then, starting the tests and waiting for the results is burdensome, so we might get complacent and forget to run them before releasing our changes.

So we get even more software to remember that for us. That’s what we refer to as “continuous integration”, or CI for short. Whenever we submit a new feature, a bugfix, or even if we just want to slightly change the shadow under a button, the CI will run the tests and make sure we have not broken anything. For instance, in the CI of Kepler Project, I run linters to find obvious defects in code, static analyzers to find less obvious defects, as well as a suite of unit tests. I plan to add integration tests at some point, but it is slightly annoying to run them in graphical software.

Continuous Delivery

Once we are confident our changes will not accidentally launch the nukes, we want to actually release them to our users. But software developers are lazy, and they will spend hours automating a task that only takes minutes. In the end, this is the next step after continuous integration, so we call it “continuous delivery” (CD).

An added bonus of a well-designed CD is that you will be able to tell exactly what code was used to create a specific release. For instance, in Kepler Project, the CD starts whenever I create a new git tag. The release it generates uses the tag as the version number. Since a git tag points to a specific state of the code, I can easily check out the code corresponding to a version, to locate a bug, for instance. I can also browse all the changes between two versions.

CI/CD

If you have any kind of non-trivial project, you will want to at least run a CI. And even a basic CD will probably help you in tracking changes from a released piece of software back to the corresponding code in your code history.