Float Woes in C

The C programming language is a gift that keeps giving. Today, we are going to see how a seemingly banal and common operation can hide unfathomable depths of unmentionable horrors.

Woes with Integer Coercion

What is the problem of the code below?

// returns the closest integer to f
int float2int(float f) {
    return (int) f;

It’s written in C.

A Rust enthusiast

That’s… a good point.

But let’s assume we actually want to write C, for some forsaken reason. And we do not care whether we convert up, down or sideways.

There is still a fundamental issue with this code. What happens when we call float2int with INFINITY?

#include <math.h>
#include <stdio.h>

int float2int(float f) {
    return (int) f;

int main(void) {
    printf("%d\n", float2int(INFINITY));
    printf("%d\n", float2int(-INFINITY));
    printf("%d\n", float2int(NAN));

When we compile this code with gcc and run the result, we get:

$ gcc -O3 test.c && ./a.out

Makes sense! The largest integer is 2147483647; the smallest is –2147483648, and what would you choose for NAN but 0?

Someone still young and innocent

Now, let’s just remove that -O3 option:

$ gcc test.c && ./a.out


The innocent who got their world shattered

Wait, it gets better:

$ clang -O3 test.c && ./a.out
$ ./a.out
$ ./a.out

But… why?

All hope is gone

Because it can.

Yup, that’s undefined behavior. Converting a non-finite value (i.e. an infinite or a NaN) to an integer is undefined behavior.

But that’s not all! What should the largest finite floating-point value (3.4028234664e+38) convert to? It’s much bigger than INT_MAX, the value that int can represent (it does not matter whether you are on a 32, 64 or 128 bit architecture, really).

Maybe we could just say that all the floating point-number larger than INT_MAX should be converted to INT_MAX when converted? But alas, it makes for unwelcome surprises down the line, when you realize that the rounded value is not always within 1 of the original value.

Thankfully, the C standard always has the solution whenever there is a hint of ambiguity:

When a value of integer type is converted to a real floating type, if the value being converted can
be represented exactly in the new type, it is unchanged. If the value being converted is in the
range of values that can be represented but cannot be represented exactly, the result is either the
nearest higher or nearest lower representable value, chosen in an implementation-defined manner.

If the value being converted is outside the range of values that can be represented, the behavior is
. Results of some implicit conversions may be represented in greater range and precision
than that required by the new type (see and

Part paragraph 2 of the C standard

In other words, we’re just going to say all the values that are not quite that convenient trigger undefined behavior. The “range of values that can be represented” by int is the interval [INT_MIN, INT_MAX]. Attempting to convert any float value from outside this range to the int type is undefined behavior.

We’ll just have to check for that!

Someone has not learned their lesson

Woes with Range Checking

We are just getting to the actually tricky part. Let’s have a look at a seemingly robust implementation:

#include <limits.h>
#include <math.h>

int float2int(float f) {
    if (isnan(f)) {
        return 0;
    if (f < INT_MIN) {  // also filters out -INFINITY
        return INT_MIN;
    if (f > INT_MAX) {  // also filters out +INFINITY
        return INT_MAX;
    return (int) f;

For the sake of simplicity, we are just providing arbitrary values for the values that we cannot convert. Other implementations might choose to return a status indicating whether the conversion is possible or not.

But, of course, there is a bug lurking in there. And it is extremely easy to miss.

And, this time, it is not caused by an undefined behavior, but by another surprising “feature” of C: implicit type conversions.

To understand what I mean by that, we need to look at what the conditions in the code above really mean. The first one is fine, let’s look at the other two: f < INT_MIN and f > INT_MAX. In both, the left operand is a float, and the right operand is an int. Processors rarely have such built-in operations. So a conversion must be taking place. The way it happens in described in the “usual arithmetic conversion”:


if the corresponding real type of either operand is float, the other operand is
converted, without change of type domain, to a type whose corresponding real type is float.


Part paragraph 1 of the C standard

Thankfully, we only need the part I have quoted here. In short, both operands get converted to float. Let’s look at our conditions again.

if (f < (float) INT_MIN) {  // also filters -INFINITY
    return INT_MIN;
if (f > (float) INT_MAX) {  // also filters +INFINITY
    return INT_MAX;

Let’s look at the first condition. What is the value of (float) INT_MIN? Let’s assume 32-bit 2-complement int type. Then INT_MIN is -2³¹. A float can represent this value exactly. You can check it out with an online converter. So (float) INT_MIN is -2³¹ as well. No problem here, this conversion does not change the value, and this code does exactly what we want.

With the same assumption, INT_MAX is 2³¹ – 1. And there comes the catch: a float cannot represent 2³¹ – 1 exactly. If you put “2147483647” in the online converter, you will see that it is rounded to a float whose value is actually “2147483648”. It makes sense: a float trades off being able to represent all integers in [-2³¹, 2³¹ – 1] in order to cover a much wider range and many more magnitudes. In other words, the actual value of (float) INT_MAX is INT_MAX + 1. The condition is actually doing:

if (f > INT_MAX + 1) {  // also filters +INFINITY
    return INT_MAX;

Note: if this were C, this code would contain an undefined behavior because of the signed integer overflow that happens when evaluating INT_MAX + 1. This is just pseudocode for illustration purposes.

Since the upper bound is offset by one, the code fails to filter out the half-open interval (INT_MAX, INT_MAX + 1]. If the input parameter f were to take its value from this range, converting it to int would be undefined behavior. Now, is there any float value in this interval? Yes! INT_MAX + 1 of course (it’s the only one).


The “robust” implementation does not handle 2³¹ properly. This is not just a theoretical oddity. I encountered this issue in critical code in production while working at TrustInSoft. Such bugs will easily stay under the radar and avoid even the most intensive test campaigns. Only formal methods help in detecting and avoiding such errors.

The worse thing is that, since this is undefined behavior, you might encounter it many times, and never now, corrupting data in unknown ways. Things that could have been a runtime error become the entry door for silent bugs and security issues.

C likes to keep you on your toes.


Ham Crypto

Amateur Radio

Amateur radio is about tinkering with the electromagnetic field. This may involve designing analog circuits to shape or unveil the modulation of signals, or sometimes using digital ones to recover information from the noise, or operating multi-hundred-watts transceivers to chat directly with the other side of the world, or lower-power ones to go just as far with Morse code, or even bouncing signals off the Moon. Also, lasers.

Yes, this is the 2005 picture from Wikipedia. No, things have not changed that much since. Physics still work the same.

Amateur radio operators, or “hams” for short, benefit from special privileges to this end. The radio spectrum is a crowded place; yet hams are allowed to transmit at various frequencies spread between 135,7 kHz and 250 GHz (not all bands are visible in the chart). Radio equipment is normally strictly regulated to limit interference and operations in unauthorized bands; yet hams are allowed to modify transmitters as they please, or even build and operate their own from scratch.

The allocation of the radio spectrum in the United States. The allocation is similar in other countries. You might notice there is not much free room. Click to enlarge.

It would be very tempting for a commercial operator, for instance, to make use of these privileges for their own benefit. To avoid this, there is one very important restriction: all communications must be done in the clear.

Code is Law

The exact wording from the international regulations is:

Transmissions between amateur stations of different countries shall not be encoded for the purpose of obscuring their meaning

Article 25, paragraph 2A from the Radio Regulations edited by the ITU (emphasis mine)
(yes, the website is very slow, which is a bit sad for the International Telecommunication Union)

Note: you might think that “of different countries” means that hams can encrypt their communications within a country, and that can actually be the case. However, the article is meant to bind countries to a common standard, not to restrict what happens within each country. In practice, each country will decide whether they want to allow it locally. And, in general, they don’t.

Encoding is the mapping of information to a particular representation. For instance, Morse Code and ASCII are encodings that map the letters of the alphabet to short and long signals, and to numbers, respectively. AM and FM are encodings that map analog signals to other analog signals, that is to say modulation.

Encoding is necessary for any form of communication, so it is of course allowed in general. What is forbidden is to represent information in a way that hides its meaning.

The ASCII encoding maps the uppercase and lowercase Latin letters, as well as the Arabic numerals, some punctuation marks, and control characters

Note that the wording does not say “cryptography”. This is intentional. Cryptography is just a set of techniques which can be used to hide information. It does not have to be this way, and this is not the only way to do so. This is about intent.

If you are doing Morse Code, SSB, FT8, or experimenting with your own super-advanced error-correction encoding that is meant to be usable by anyone, you would be in the spirit of the law. On the opposite, if you are hiding information in the low bits of a digital image, or using a custom protocol only you and your buddies know about, or using a proprietary protocol whose specification is not public, you are going against the spirit of the law.

Steganography is the idea of hiding information in plain sight

Note that the difference between “experimenting with a new protocol” and “designing a protocol restricted to just a few people” is slim. What matters is intent. A way to show good faith would be to include a link (using plain voice or Morse code) to an open source repository that anyone can easily install to decode the rest of the communication. To be clear, that’s not an original take.

What about cryptography? It should obviously be banned, right? The entire point of this discipline is to hide information.

My plant in the audience to ask a convenient question at a convenient time for my transition

Mostly, yes. But we are here to tinker, so let us see what this “mostly” entails.

Confidentiality and Authentication

The two main use cases of cryptography are confidentiality and authentication. Mostly everything else derives from these.

They are mostly used together. For instance, the most widespread implementation of cryptography is indubitably TLS, the protocol used to secure all communications of the Web. This the S in HTTPS. This is also what enabled online shopping, and grew the Web from a niche protocol to visit personal homepages to the societal backbone it is today.

The green padlock is telling you that the communications with the website are made secure using cryptography. It does not tell you that the website itself is trustworthy, however.

Nowadays, when you connect to a website, your Web browser will ensure two things.

  1. Communications are encrypted, so that no one snooping around can read them. That’s the “confidentiality” part.
  2. You are talking to the right website. Otherwise, it would be pretty easy to intercept all your communications and read them. This is how company proxies can monitor what you are doing. That’s the “authentication” part.

Note that authentication is necessary if you want confidentiality. Without at least some basic form of authentication (e.g. sharing a private key), encryption will only give confidentiality against passive attack models.

However, authentication on its own could definitely make sense. And the purpose of authentication is not to obscure the meaning. In particular, a cryptographic signature is a piece of data that anyone can decode and interpret. The information it contains is simply the fact that the associated message was indeed sent by its author, not by someone else.

Note: the usual rebuttal is this would allow an attacker to send the same message again. However, this is a well-known topic and easily solvable.

To be clear, I am not the first person to think about doing this. Nor the second:

But, why?

Why not? The purpose of amateur radio is to experiment with communication technology. Knowledge can be an end in itself. But still, having some ideas of what practical applications could look like can be motivating, so I list some here.

The Glider, the smallest moving pattern in Conway’s Game of Life

Remote Control. Since 2004, Radio Regulations include an exception to article 25.2A for the control of satellites. Indeed, going to orbit is hard¹, so you want to be able to service satellites remotely. But if anyone can remotely turn off your satellite, or make it dive into the atmosphere, you’ve got a problem. So you need some way to authenticate the commands. But you do not actually need encryption. This means that remote control over amateur radio is possible for things other than satellites. Remote base stations, or repeaters, for instance.

¹ In fact, low Earth orbit is halfway to anywhere in the Solar System.

Yes, people are actually building satellites, and actually sending them to space

Authenticated QSOs. Such schemes could be used to authenticate contacts, or QSOs, to prevent people from spoofing call signs. This could serve a similar purpose as PGP or GPG for email.

Contesting. Possibly the most popular activity among hams is contesting. That is, contacting as many other operators in the world in a given time frame. Nowadays, scoring is done by having all the operators send the log of their contacts to the contest’s website. If K1AA says they have contacted F6ZZ at 28,500 MHz around 14:00 UTC, and F6ZZ says they have contacted K1AA at the same frequency at around the same time, we can probably assume that this actually happened. Unless K1AA and F6ZZ are colluding, of course, but this is hard to prevent. But if, for some reason, F6ZZ is unable, or forgets, to upload their logs, then the contest will assume that K1AA made a mistake, and subtract points. With authenticated QSOs, K1AA could provide strong evidence that the contacts happened as described. Contests already routinely require the exchange of serial numbers, so it might not be such a stretch to implement. Instead of an incrementing number, each operator might have a computer generate a short code for each contact.

Wow. Much radio. So frequency. Many QSOs.

Online services. Now that digital modes and packet communications are finally being used in amateur radio, it is possible to use protocols such as HTTP. But, without authentication, the use cases are as limited as the Web in the early 1990s. With authentication, we could envision message boards and editable wikis with secure user accounts.

These are just the few examples I could think of, but there might be other relevant applications. Of course, there are wrinkles to be ironed out, but I do not pretend to have it all figured out.

Null Cipher

How do we use that in practice? We could roll our own crypto, or not. On the one hand, it is deceptively easy to make mistakes when implementing cryptography by oneself. On the other hand, the purpose of amateur radio is to experiment. So we might play around with custom cryptographic protocols over the air, but maybe we should use proven ones for actual persistent use. Let’s see what we could use for this.

TLS. The first obvious candidate is the omnipresent TLS. It is mostly used to authenticate servers, but you can also use it to authenticate clients. This has already been discussed in the past in the context of amateur radio to authenticate clients for remote base stations. Using TLS with the “NULL” cipher is a common topic, and it is not very hard to tell OpenSSL to do so (-cipher NULL-SHA256:@SECLEVEL=0). Modifying some web server to serve content with the NULL cipher is more involved, but probably not that hard.

However, both Mozilla Firefox and Google Chrome have removed the NULL cipher from the list of supported ciphers. In other words, they will refuse to connect without encryption. Fixing this would involve modifying their source code (easy), navigating their complicated build process (annoying), distributing them to the users (hard), and maintaining the fork (who wants to do that anyway?). Unfortunately, this is probably necessary, if we also want to prevent users from inadvertently using encryption by connecting to HTTPS websites. I’ll probably look into it at some point, but this is a significant project.

Me writing some random text through an SSH connection

SSH. Another candidate is SSH, which is already widely used for remote control of virtually everything nowadays (yes, I know, everything is terrible). It authenticates both the server and the client. The source code and the build process are also fairly simple, so it is easy to make a server and a client that can talk to each other without encryption. The nice thing with SSH is that you can then redirect ports through it, use it as a SOCKS proxy, or even as a VPN.

Wireshark intercepting the raw communications for the SSH sessions over the network. As you can see, the text is visible in clear. The [0m, [1m, [27m and so on are ANSI escape sequences. The rest are the SSH handshake and the cryptographic signature, which are easier to read when displayed as such.


To sum up, I like the idea of using cryptographic tools to provide proper authentication for remote control and enable new use cases through amateur radio. Of course, there is significant work to be done before we can get to there, but it’s nice to know that we do not need to start from scratch.


Flexible Array Members: Typical C Shenanigans

Speed. I am Speed.

The one thing that C cares about is speed. Seriously, you might not get the right results, but at least, you will get them fast.

Let’s look at two common ways to represent a vector in C.

struct vector {
    unsigned long size;
    int data[42];

The first one is to just use the built-in array type. This approach is simple, but implies choosing the size of the vector when compiling. The program cannot decide the size depending on user-input or some other runtime parameter.

Note: Variable-Length Arrays (VLA) do allow declaring arrays whose length is decided at run-time. However, this is limited to local variables and is an optional feature since C11.

struct vector {
    unsigned long size;
    int* data;

The alternative is to not use an array at all. Instead, we use a pointer to a memory buffer that will contain the data. This memory location can be created with malloc, for instance.

Pointers can be used like arrays almost transparently, which brings (the “almost” means that most C programmer are paranoid about using sizeof on expressions). The example below shows how accessing an element at some index uses the same syntax in both cases.

But although the syntax is the same, the meaning is different. Look at the assembly code generated to implement vector_index. Indexing the built-in array uses a single MOV instruction; indexing through a pointer uses two MOV instructions.

It makes sense. rdi contains the address to the struct and rsi the value of index. In the first case, we shift by 8 bytes to skip the field size, then we move by index * 4 bytes to select the right array element, then we read the value at this memory location (indicated by the bracket in the Intel assembly syntax). In the second case, we need to first recover the value of the field data, at an offset of 8 bytes from the beginning of the struct, then read at the right offset from the address it contains.

In other words, using a pointer instead of a built-in array is twice as slow! Okay, not really. Any modern CPU is going to recompile, pipeline, prefetch and out-of-order execute this such that it really won’t matter. But it used to! And it might still matter on microcontrollers.

In fact! I did some benchmarks, and, although there was no difference on my laptop (CPU model i5-4210H), I did see a tremendously large difference on my Arduino Leonardo (microcontroller model ATmega32u4). Indeed, the code with the additional indirection through a pointer was 10 % slower.

Yeah, I did expect it to be worse than that. But still, 10 % was not something to scoff at in the previous millennium. Also, CPUs at the time might have seen a larger difference than my ATmega32U4 released not that long ago.

In summary, embedding a built-in C array is fast, but inconvenient. Using dynamic allocation with malloc is more flexible, but not as fast. How can we make our compiler generate faster code while keeping the convenience?

Well, we can lie to it.

Lies. So Many Lies.

Let’s look at how we will typically create a vector with the first approach:

struct vector {
    unsigned long size;
    int data[42];
struct vector *vector_new(unsigned long size) {
    assert(size == 42);
    struct vector *ret = malloc(sizeof(struct vector));
    ret->size = size;
    return ret;

Notice how malloc is just another C function, that can take any value as an argument at runtime? Notice how we are forcing ourselves to give it a compile-time value (the size of our struct)? What if we actually allocated the size we wanted?

For instance, if we allocated sizeof(struct vector) + 100 bytes, that would mean we would still have the 42 elements from the definition of struct vector, but we would have 100 additional bytes, enough to store 25 more 4-byte integers!

And so, a new idiom was born:

struct vector {
    unsigned long size;
    int data[1];
struct vector *vector_new(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * (size - 1)
    ret->size = size;
    return ret;

Since we are going to lie to the compiler about the size of the actual array anyway, we make the base size as small as possible. Since the C standard does not allow 0-length arrays (“An array type describes a contiguously allocated nonempty set of objects”), we just use 1. Then, when we create an instance of our vector, we are going to make sure to actually have enough memory for size elements. Since there is already 1 built-in, we need to allocate for a size - 1 elements in addition to the base struct.

C developers embraced this hack, wrote many fast programs, and lived happily ever after.

Well, kind of. I mean, if you write this, your program is probably going to be fast. Just as fast as a fixed-size array. But maybe it won’t. And maybe it will allow random people to take control of your computer. Or maybe it will blow up the spaceship. Who knows?

You might know where I am going with this: this is undefined behavior.

Yes, we did allocate enough memory to do what we want. But this is not the point. We lied to the compiler. And compilers are very delicate, gullible beings. It will believe you. It will believe that your array is actually of size 1. And it will do unmentionable things to your code, because this assumption is false.

Save one Character, Save the Entire World

When they observe developers writing C, compilers authors and language lawyers alike get the urge to start again from scratch. This is actually the reason why there are so many flood myths through the ages.

This time, however, they recognized that the end justified the means. Compiler authors, too, want the programs they write to be fast (except C++ compiler authors).

And so, GCC introduced a way to do this:

struct vector {
    unsigned long size;
    int data[0];
struct vector *vectornew(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * size
    ret->size = size;
    return ret;

Notice the difference? We just replaced 1 by 0 for the size of the array.

But you said the C standard did not allow this!

Someone who pays attention to details

Indeed! But GNU C does! GNU C is the dialect that GCC understands, and it allows for 0-length arrays as an extension.

Of course, if you were using another compiler, you were out of luck… until!

Finally, in the C99 revision of the C standard, a new syntax was introduced!

struct vector {
    unsigned long size;
    int data[];
struct vector *vector_new(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * size
    ret->size = size;
    return ret;

Can you see the difference? We just removed the 0.

That’s all?

Someone who wasted too much time reading this; but worry not, it took even longer to write!

Yup. This syntax tells the compiler to not assume anything about the size of the array. You can still use it as usual. But ensuring that you allocate enough memory, and that you do not go out-of-bounds, is up-to-you.

Less is More

We went from 1, to 0, to nothing. Since there is nothing left to take away, we attained perfection. It’s a general truism with C: the less C, the better.


Why Undefined Behavior Matters

So Many Bugs

So, you’ve been programming for one month, or for ten years, and you have gotten familiar with Murphy’s and Sturgeon’s laws. And you are wondering if there is some way to catch all the bugs, not just spray-and-pray with regression/unit/integration/system/acceptance testing.

Or you are a curious onlooker, wondering why all these programmers cannot make a piece of software that works reliably, what with apps doing unexpected things and with all the security breaches you hear off.

First, what are we talking about? Essentially, a bug is anytime a system does not do what it is expected to:

The Mars Climate Orbiter disintegrated into Mars because a module was using non-metric units and forgot to convert. Seriously.

As you can see, there is quite a spectrum. I like to sort bugs in three categories:

Today, I would like to focus on the last one. It looks like it should be the more tractable, since the scope is well constrained. But it is still a non-trivial problem.

What we want is two-fold:

  1. Specify what we expect the program to do
  2. Verify that the program meets our specification

Even the first step is tricky. For the sake of simplicity, we are going to reduce the scope of what we want to achieve, and focus on a well-delimited set of things our program should do: have a well-defined behavior.

What does this mean? Let us say you wrote a program that does some kind of scientific computation, expect you used meters instead of kilometers. Then, the result is going to be wrong, of course, but in a predictable way: the result will be off by a factor 1,000. If you run the program again the same way, you should get the same result.

On the contrary, if your program contains undefined behavior, it could do unexpected things. It might output the right value; it might output a wrong value; it might crash; it might allow an attacker to take control of your computer. And it might behave differently every time you run it. Worse: it might run perfectly fine during all your tests, but output the wrong value when it is being used to pilot a real plane with real passengers.

How can we know if a program contains undefined behavior? First, we need to talk about programming languages.

Why is C an “Unsafe” Language?

When people talk about “undefined behavior”, they are usually referring to undefined behaviors in the C programming language. There is a reason why this particular language is criticized as being “memory-unsafe”, and why so many security vulnerabilities stem from code written in C. And that reason is that the C language leaves the door open for many bugs.

To understand why, we must first ask ourselves what “the C programming language” refers to. Who decides what it is, and what if someone does things differently?

Historically, C was loosely defined and had many dialects. Every compiler author would do things a bit differently. The people writing a compiler to compile “C” code to 8080 assembly made different choices than the ones targeting PDP-11. The same program could end up doing different things on different processors. Things were messy.

So, the C compiler writers, and the C developers, and the processors manufacturers joined together and agreed on a common definition for what “C” really meant. Thus was born the C standard.

The C standard had to try to align things between people doing very different things. And these people cared a lot about performance. Processors at the time were about a hundred times slower than the ones today. And I am only talking about clock frequency here; nowadays, each CPU core can do much more in a cycle than the CPUs at that time, and there are now many cores in a modern CPU. It makes sense that they really wanted to keep compiling C code to efficient assembly, to have programs run at decent speeds.

Now, what happens if the code tries to access memory it is not supposed to? Well, the compiler cannot just add runtime checks everywhere, since that would have a negative impact on performance. And it cannot always know in advance that this will happen, since it would require it to solve arbitrary complex mathematical problems.

Because of this, the C standard did not require either. Instead, if the code does such an invalid operation, it is said to have “undefined behavior”.

So what? If code crashes, I’ll notice the problem. If it outputs the wrong result, my tests will notice it.

Simplicius, my imaginary straw-man C developer

Wait a minute, renegade figment of my imagination! Undefined behavior means that the compiler is allowed to produce code that does anything. In particular, it is allowed to assume that undefined behavior never occurs in the problem, and deduce things from that. And you can prove many things if you assume something which is false.

Let’s take an example. The code below uses a toy authorization function that just checks whether the user id (uid) . As you might have already guessed from the topic of this article, there is an undefined behavior in this code. First question: can you see it?

#include <stdio.h>
int authorized_uids[] = {0, 1001, 1005, 1012};
int is_authorized(int uid) {
    for (int i = 0; i <= 4; i++) {
        if (uid == authorized_uids[i]) {
            return 1;
    return 0;
int main(void) {
    if (is_authorized(4242)) {
        puts("You are now logged in as root.");
    } else {
        puts("You are a lowly common user.");

Spoiler: answer below.

The loop is written as for (int i = 0; i <= 4; i++), so i takes the values 0, 1, 2, 3 and 4. Then we use it as an index to read from an array: authorized_uids. However, this array only contains 4 elements, at indices 0, 1, 2 and 3. There is no index 4 in this array. Trying to access index 4 is an out-of-bounds access, which is a type of undefined behavior.

Second question: what will happen when we compile and run this code? You can actually do it, but do try to have a guess first.

The universal answer is: anything can happen. Since there is an undefined behavior, the program could do anything. However, there are two common things compilers will do in this case.

Option 1. It can translate this C code to assembly in a rather naive way. For an x86 processor, it would translate authorized_uids[i] as an assembly instruction reading memory at the address authorized_uids + i.

This might seem a bit obvious, but the compiler does not have to do that. Even when there is no undefined behavior, the compiler is allowed to emit any code it wants, as long as it produces the same result. For instance, the compiler might find a faster series of instructions that achieves the same results. This is what “optimizing” mean.

If we then run the compiled program, the CPU will try to read at authorized_uids + i. Most likely, however, this will still be in the valid memory of the program. Indeed, memory is usually allocated generously (segments for the stack, pages for the heap). So, the assembly instruction will not cause a crash; it will return some value. Most likely, the value will not be 4242, so the test will fail, and the user will not be authorized.

This is the middle column in the compiler explorer below. You can see at the top that the source code for is_authorized contains various assembly instructions. And, at the bottom, you can see the output of the program: “You are just a lowly common user.”.

Option 2. Now, let’s have a look at the rightmost column. The output is “You are now logged in as root.”! This is not what this program was supposed to do at all! And this is not just a fluke. It happens every time you run it!

Then, why is the result different this time? If you look closely, you will notice that the only difference is that we passed the option -O2 to the compiler. It tells the compiler to try harder to optimize the program.

In fact, if you look at the generated assembly, you will see that the compiler optimized the function to something extremely efficient:

        movl    $1, %eax

This means that the function is_authorized just always returns 1. Wat?

Let’s try to understand the reasoning of the compiler. It sees:

int is_authorized(int uid) {
    for (int i = 0; i <= 4; i++) {
        if (uid == authorized_uids[i]) {
            return 1;
    return 0;

This is equivalent to:

int is_authorized(int uid) {
    if (uid == authorized_uids[0]) { return 1; }
    if (uid == authorized_uids[1]) { return 1; }
    if (uid == authorized_uids[2]) { return 1; }
    if (uid == authorized_uids[3]) { return 1; }
    if (uid == authorized_uids[4]) { return 1; }
    return 0;

But we can assume that there is no undefined behavior in the code. So uid == authorized_uids[4] must never be evaluated. So the compiler can safely assume that the function always exits at one of the first 4 return statements. In other words, it always returns 1.

But couldn’t the compiler at least warn about this?

Someone rightfully annoyed with the state of the universe

Fighting Undefined Behavior

If you run gcc with -Wall -Wextra -Wpedantic or even clang with -Weverthing, you still get no warnings about the undefined behavior!

Why don’t they warn about this? Because this is harder than it looks. The compiler cannot always know where an UB will happen. Let’s have a look at the example below.

// hour ∈ [0, 23]
// minute ∈ [0, 59]
// second ∈ [0, 59]
int time_to_seconds(int hour, int minute, int second) {
    return hour * 3600 + minute * 60 + second;

If a compiler had to ensure the absence of undefined behavior when compiling the above code, they would have to assume that hour, minute and second contain any possible value from INT_MIN to INT_MAX. The compiler does not understand comments written for humans, and it does not understand variable or function names, even when they are chosen appropriately.

To make the code safe, the compiler would have to introduce runtime checks in the compiled code to make sure that no overflows occur during the computation. In practice, that would imply 2 comparisons and 2 branches before each of the 2 multiplications and each of the 2 additions. But C was not designed for safety. It was designed for speed. All these comparisons and branches add up. On a simple benchmark, I observed that a safe version would be between 2 and 3 times as slow as the unsafe one.

For the same reason, if the compiler had to warn for every potential undefined behavior, developers would get flooded with false positives. In the example above, the compiler would have to emit a warning for each arithmetic operation. With many false alarms, the developers would have to get in the habit of just ignoring all the warnings.

But I do not care about performance that much! I want runtime checks everywhere. Or at least, I want to see the warnings! I set -Wall -Wextra -Wpedantic -Wconversion -Weverything. I really do want to know when there are potential undefined behaviors in my code.

Let’s address each of these.

Runtime checks. The clang compiler does have the ability to add runtime checks! ASan (Address Sanitizer) looks closely at the way memory is allocated, deallocated and used. UBSan (Undefined Behavior Sanitizer) looks at a few additional things, such as invalid arithmetic operations. Both of them work by adding instructions to the resulting binary executable, which will check everything. Since the program has more things to do, it runs slower. Valgrind/memcheck works directly on the compiled program, and inserts its own checks when running it. It also makes running the program much slower. These tools are able to catch a few types of issues every time they actually happen in the program. However, it is often non-trivial to understand why it happened. Using the function from the previous example, compiling with clang -fsanitize=undefined time-to-seconds.c and running the program would result in something like:

time-to-seconds.c:5:17: runtime error: signed integer overflow: 2147483647 * 3600 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior time-to-seconds.c:5:17 in 
time-to-seconds.c:5:33: runtime error: signed integer overflow: 2147483647 * 60 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior time-to-seconds.c:5:33 in 

Syntactic checks. Although they do not understand what your program does, compilers are able to notice suspicious patterns in the source code. For instance, they can see if (x = 42) and emit a warning, since this is a common mistake where the developer should have written if (x == 42). Modern compilers now include an extensive set of such patterns to warn the user about potential mistakes. Some tools are specialized in finding such patterns in the code, such as clang-tidy or cppcheck. Since they do not change the way the code is compiled, they incur no slow down when actually running the program. However, they can only find common mistakes, or they would risk emitting too many false warnings. In the example, no linter emits any warnings, since it would be likely to be a false positive. However, they do warn of more obvious cases:

warning: integer overflow in expression of type ‘int’ results in ‘1’ [-Woverflow]
   39 |     INT_MAX * INT_MAX;
      |             ^

If we want to go further, we need to look at formal methods. The idea is to use mathematical and systematical approaches to find all the undefined behaviors in a provable way. This means that such methods even allow you to prove the absence of undefined behavior (when there is none, of course).

I want to talk about two techniques from formal methods in particular. I am familiar with them because the company where I work, TrustInSoft, sells an industrial tool to apply them to real code used in critical applications.

Abstract interpretation. The idea for this one is intuitive. Essentially, we are going to interpret the source code in an abstract machine. This abstract machine will interpret each statement from the code exactly as specified by the standard. If undefined behavior can happen, a warning can be emitted. Essentially, this can be thought as an implementation of the abstract machine as described by the C standard.

The interesting thing is that you can go even further by having the abstract machine work with sets of values instead of singular values. For instance, you can analyze a function for all possible inputs (the values for each parameter being represented as a set of actual values), instead of just one at a time. For instance, let us consider the example below.

int abs(int x) {
    if (x < 0) {
        return -x;
    } else {
        return x;

We can analyze this using the Eva plugin of Frama-C, or the Value plugin of TrustInSoft Analyzer.

$ frama-c -eva abs.c -main abs -verbose=0
[eva:alarm] abs.c:3: Warning: signed overflow. assert -x ≤ 2147483647;

This command interprets the source code in an abstract machine, starting from the abs function. By default, it assumes the argument can be any possible value (any integer from INT_MIN to INT_MAX). When checking the program for all these possible values, it notices a problem in one specific case: when x is INT_MIN, computing -x is undefined behavior, because it would be bigger than INT_MAX. Indeed, usually INT_MIN is 2³² and INT_MAX is 2³²-1, because of the way integers are represented. If you are interested in knowing how these programs manage to work with huge sets of values, look up interval arithmetic.

Weakest Precondition. The other technique is much more general and can actually be used to verify arbitrary facts on programs, not just to find undefined behaviors. For instance, with this, you can prove that your implementation of a sorting algorithm actually does what it is supposed to do. However, it also takes much longer to learn to use it, and to analyze a real program. If you are actually interested in it, I can only recommend the excellent tutorial written by Allan Blanchard.

What About Other Languages?

We explained why C is an unsafe language. But are the other languages unsafe?

C++. This is essentially a superset of C. And it adds its own undefined behaviors. It does introduce features to avoid doing mistakes common in C code, but the complexity it brings also hands another footgun to the user.

Java, Python, Ruby. These languages work by being compiled to byte code, which is something similar to assembly, except that it is not run directly by the processor. Instead, it is run by the corresponding virtual machine. In other, each language targets a single architecture. While C needs to encompass CPUs whose memory model vary significantly, these languages can specify strictly what should happen. They do not care as much about performance, either. The result is that, if you try to access out of the bounds of an array, you do not get undefined behavior. You get a nice, clean, exception telling you what the problem is, and a call stack telling you where it happened.

JavaScript. JavaScript is the worst client-side language. It is also the only one. Surprisingly, however, it is safe in the sense that there are no undefined behaviors (there were some until ECMAScript 2016).

Rust. Now, Rust is an interesting case. Like C, it is designed to be fast. Unlike C, it is designed to be safe. It achieves this thanks to the concept of unsafety. By default, checks are added wherever they are needed, and dangerous operations are simply prohibited. You can turn off these checks to do lower level operations and improve performance where needed, but you have to do explicitly. The point is that unsafety is opt-in, so developers know to be careful with undefined behaviors.

As you can see, we can do quite a bit about undefined behaviors. There are very fast, very simple-to-use tools, that might find common mistakes, and there are very advanced, and exhaustive tools, that require more involvement. And there are languages trying to make safety the default.

There is still a long way to go until software becomes as reliable as it could be, but we’re getting there, one byte at a time.

Further Reading


Continuous Integration and Delivery Made Easy

These are not really advanced topics, but I wanted to write about them anyway. So here it is.

Continuous Integration

Sometimes, I report bugs in software. Others, in libraries. Or I can just fix things by myself. And more. Point being: there are lots of bugs everywhere. What do we, software developers, do to avoid writing too many bugs?

We run tests. Unit tests. Integration tests. System tests. Acceptance tests. And where possible, we write software to do these tests for us. Usually, that means unit tests and integration tests. Usually, such tests won’t guarantee that our code is without defect. But at least they give us confidence that it does work for the cases we do tests. If we change something and the tests still pass, we probably have not broken something major. This let us iterate quickly without having to double and triple check whether a change broke one of many features. But even then, starting the tests and waiting for the results is burdensome, so we might get complacent and forget to run them before releasing our changes.

So we get even more software to remember that for us. That’s what we refer to as “continuous integration”, or CI for short. Whenever we submit a new feature, a bugfix, or even if we just want to slightly change the shadow under a button, the CI will run the tests and make sure we have not broken anything. For instance, in the CI of Kepler Project, I run linters to find obvious defects in code, static analyzers to find less obvious defects, as well as a suite of unit tests. I plan to add integration tests at some point, but it is slightly annoying to run them in graphical software.

Continuous Delivery

Once we are confident our changes will not accidentally launch the nukes, we want to actually release them to our users. But software developers are lazy, and they will spend hours automating a task that only takes minutes. In the end, this is the next step after continuous integration, so we call it “continuous delivery” (CD).

An added bonus of a well-designed CD is that you will be able to tell exactly what code was used to create a specific release. For instance, in Kepler Project, the CD starts whenever I create a new git tag. The release it generates uses the tag as the version number. Since a git tag points to a specific state of the code, I can easily check out the code corresponding to a version, to locate a bug, for instance. I can also browse all the changes between two versions.


If you have any kind of non-trivial project, you will want to at least run a CI. And even a basic CD will probably help you in tracking changes from a released piece of software back to the corresponding code in your code history.

kepler-project Non classé

Building for Windows without Running Windows

By the way, I use Linux

1998 2008 2018 2022 will be the year of the Linux desktop. Definitely sometimes before 2028. Maybe before 2038? Ok, maybe not. In any case, I am the most comfortable working (and studying, and procrastinating, and gaming, and procrastinating) on Linux (yeah, yeah, GNU/Linux or GNU+Linux, I know). Actually I am a weakling running Debian with the non-tiling WM Xfce4 (I could as well install Xubuntu at this point).

Anyway, the point is that I still need to target Windows as a platform for the 96 % of users who have not yet seen the light. But I do not really want to have to reboot to switch to a different operating system and maintain multiple developing environment. Besides, I need to build my project in the CI/CD pipeline for automatic testing and release building, but I really do not want to deal with a Windows virtual machine in the cloud.

If only I could build for Windows without running Windows…


Well, it’s a thing. Actually, compiling for X without running on X is a thing. It’s even pretty much mandatory when you cannot run a compiler on the target system. So, if you are developing for Arduino, you are most likely compiling from amd64 (Intel and AMD microprocessors) towards avr (Atmel’s microcontrollers). It’s called cross-compiling.

What makes cross-compiling not completely trivial is that the file hierarchy and dependency management are mostly built around the assumption that compiled code will be run on the current system. So, on my computer, binary libraries in /usr/lib are amd64, /usr/bin/gcc compiles to amd64 and apt installs amd64 binaries by default. Debian does support various architectures, so, I can tell apt to install arm64 libraries, and it will install them in /usr/lib/aarch64-linux-gnu/. Binaries of different architectures, however, will conflict: both amd64 and arm64 Firefox will want to be /usr/bin/firefox. But, in any case, I do not want a gcc compiled for arm64! I want a gcc compiled for amd64 that will compile to arm64. For this, I have to install gcc-aarch64-linux-gnu. With this, I can easily set up a working environment where I will be compiling for arm64 from my amd64 computer.

Cross-Compiling to Windows

But wait, I do not want to cross-compile to arm64! I actually want to cross-compile from amd64 to amd64, but for Windows! Does that even make sense? It does! If I install mingw-w64, I get a new gcc at /usr/bin/x86_64-w64-mingw32-gcc (actually a symbolic link to /usr/bin/x86_64-w64-mingw32-gcc-win32). And then:

$ cat hw.c
#include <stdio.h>
int main(void) {
    puts("Hello World!");
$ x86_64-w64-mingw32-gcc hw.c
$ file a.exe 
a.exe: PE32+ executable (console) x86-64, for MS Windows

I actually do get a PE file, an executable for Windows!

Good, I can compile code for Windows. Now I need to install my project’s dependencies, such as GLFW3. But there is no such thing as libglfw3-mingw-w64-dev. In fact, there are only a few libraries available for this platform in the Debian repositories:

  • libassuan-mingw-w64-dev
  • libgcrypt-mingw-w64-dev
  • libgpg-error-mingw-w64-dev
  • libksba-mingw-w64-dev
  • libz-mingw-w64-dev
  • libnpth-mingw-w64-dev

For everything else, you’ll need to compile the dependencies yourself and install them in /usr/local/x86_64-w64-mingw32/. Yup, no magic wand here. And of course, you will need to cross-compile them to Windows. And fix the bugs you will encounter while doing so.

Pro-tip: write yourself a script to build and install these dependencies. It will serve as documentation for when someone (probably future you) wants to install the development environment. And you will have to do it one way or another if you want to cross-compile in Docker for your CI/CD pipeline.

Running Windows binaries from Linux

How do we run the “Hello World!” example from earlier? Easy!

$ wine a.exe
[useless warnings]
Hello World!

Yup, that’s it.

Okay, okay, it is still useful to test the Windows build on an actual Windows or, at the very least a Windows VM since it sometimes behave differently.

Anyway, that’s all for today. And don’t think too much about floating point determinism when cross-compiling.


Astronomical Depth Buffer

Guess what. We’re going to talk about floating point arithmetic and what happens when using distances that range from the millimeter to Pluto’s distance to the Sun. Yes, yet again.

Floating Point Arithmetic

OpenGL uses single precision floating point arithmetic. In any case, that’s what your GPU prefers to use. This means you can be as precise as 60 parts in a billion, but not more. That’s more than enough for drawing human-sized objects: the error is much smaller than the width of human hair. And it causes no problem when drawing the Solar-system: the error to Pluto’s orbit is 352 km, which is not visible when looking at the whole thing. But you are going to have issues with drawing a human if you do not know if it’s right here, or 352 km away.

We can avoid the main problem by doing most computations in double precision and centering around the camera before switching to simple precision for the rendering, as I briefly touched in a previous article. And that’s fine! We do not need more than single point precision to locate a pixel on the screen after all.

So far so good. But using a wide range of coordinates in the scene surfaces another issue: the naive use of the depth-buffer.

The Depth-Buffer

3D rendering, at its core, is not very complex. You ask to draw some triangles, your GPU does some math to know where it should draw it on your screen and give colors to the pixels that end up in it. And that’s it! Well, almost. There is one tiny remaining problem.

Let’s say there is a green square close by, and a red triangle farther away (behind the green square). On the left side, this is what we expect to see. On the right side, this is what happens if we draw the green square first: when we draw the red triangle, its pixels will overwrite these of the green square. (source)

Each pixel might be in the direction of various objects. But, most of the time, it should be of the color of the closest object. If we just draw all the objects overwriting the pixels each time, we might show something which is supposed to be hidden!

The most intuitive way to fix this is to just draw the objects in the appropriate order! This is z-sorting. But it is not a trivial task, and will not work in all cases (when A is partially behind B, B partially behind C and C partially behind A).

The more robust way is to remember for each pixel, how far it comes from. When drawing something over this pixel, we check whether the new value is closer. If it is, we use the new value, otherwise we keep the old one. This is the depth-buffer. For each pixel, we’ll track its current color (red, green, blue and alpha), and its current depth.

Precision Issues

By default, OpenGL actually sets the depth-buffer (d) as the inverse of the distance from camera (z): d = 1 / z. This is a secondary effect of the perspective transform, which makes things look actually 3D.

If we divide the range of d in regular interval (on the left), we notice that they cover vastly different intervals in actual depth (on the bottom); this causes precision issues for objects which are far away (source)

There are various ways to fix this. The Outerra blog offers two approach for a logarithmic depth buffer:

  1. explicitly set the depth-buffer from z (or w) in the fragment shader
  2. override z when it will not affect perspective (which relies on w) and let OpenGL use it to set the depth-buffer

The first method works very well, but disables various optimizations, so it slows down rendering a bit. The second method lets OpenGL do its thing — so it is fast — but it does not work well with large triangles close to the camera.

For more information, you can also have a look at another article from the Outerra blog.

Writing this article has forced me to better understand the technique, and to fix a bug due to using the second method incorrectly (I had kept the multiplication by w from an earlier method).

With this out of the way, you can draw things very close-by or extremely far away in a single scene! No need for hacks such as splitting the scene in a close-up part and a far-away part!


Solving Kepler’s Equation 5 Million Times a Second

I know it is a controversial opinion, but you might need to know where things are when running a simulation. As a bonus, it helps you in knowing what to draw on the screen.

Of course, you can just use a position vector (x, y, z) where x, y and z are the coordinates of the object. But things have the inconvenient tendency not to stay in the same place for all eternity. So we’ll want to know where that object is at the given time.

Now, the more generic approach is to run a physics simulation. We compute the acceleration of the object, update its speed depending on the acceleration and update its position depending on its speed. By the way, use Runge-Kutta 4 for this, not Euler’s method.

Gravity accelerates the Moon towards the Earth, so its speed will tilt counter-clockwise in the picture (source)

That would work for things in orbit too. But we can do much better in this case. This one guy, Johannes Kepler found simpler formulas to describe the movement of two orbiting objects (e.g. the Sun and the Earth, or the Earth and the Moon). Basically, the trajectories will be ellipses, and we can deduce the position of an object in its orbit at any given time.

The orbits of planet1 and planet2 are ellipses such that one focus is the Sun (source)
The area swept by an orbiting object is always the same during a fixed duration (source)

We can describe the position of the object in its orbit as the angle between the direction of the periapsis (where the planet is closest to its sun) and the direction to the object, taken at the relevant focus. This is the true anomaly. Everything is an illusion, so we also like to pretend the ellipse is actually a circle and define the eccentric anomaly. But these angles are not linear with time. So we also define the mean anomaly, which works like an angle (goes from 0 to 2π) but is linear with time. Once we know the direction of the object, we will only need to know its distance, r.

And here are the formulas to get from time to the true anomaly, and how we then obtain the distance from focus (e is the eccentricity of the ellipse):

  • mean anomaly (M): M = M₀ + n × (t – t₀)
  • eccentric anomaly (E): M = E – e × sin E
  • true anomaly (f): $latex \displaystyle f=2\arctan\sqrt{\frac{1+e}{1-e}} \tan\frac E2$
  • distance from focus (r): $latex \displaystyle r=\frac p{1 + e \cos f}$

You won’t be calculating this in your head, but the expressions are pretty easy to translate to working code. Well, except for one. Notice how the eccentric anomaly is not expressed as a function of the mean anomaly? Well, that’s the hard part. And it’s called Kepler’s equation.

Kepler’s equation in all its glory

We do not know a simple way to express E as a formula depending only on e and M. Instead, we must proceed by successive approximations until we decide we are close enough. For this, we will need two things:

  • a starting value E₀ for E
  • a way to improve the current approximation of E

Most introductory books and online articles I have found will tell you to take E₀ = M. This has been obsolete since 1978. It works well enough, but fails in corner cases (near-parabolic orbits around periapsis). And modern starting values are close enough to limit the number of improvements you then need to make. I reviewed 21 papers, selected the 3 most relevant ones, and implemented 21 methods for picking starting values (3 papers for the hyperbolic case with 5 methods for the hyperbolic case).

To improve the starting value, old references will make you do E := M + e sin E, but this is pretty naive. Most modern references will use Newton’s method, which helps you solve an equation faster as long as you know its first derivative. But you can go further by using the second derivative using Halley’s method. You can go arbitrarily far using Householder’s method.

Great, now I have 21 × 4 = 84 approaches to try (5 × 3 = 15 for the hyperbolic case). Which one is the most accurate? Which one is the fastest? And, more importantly, can I find a fast and accurate approach?

You can spend years studying the theoretical properties of these functions. Or, we just try them for real. It is the most efficient way to know how they actually behave on real hardware with real implementations of numbers.

I wrote a benchmark tool to test both the accuracy and the speed of each approach. It turns out that only a handful of approaches are accurate enough for my taste:

# Elliptical case
## Naive method
## Newton's method
## Halley's method
     580 c  3.580e-16  gooding_10
     764 c  3.926e-16  gooding_11
     577 c  3.604e-16  mikkola_1
     742 c  3.653e-16  mikkola_2
## Householder's third order method
     606 c  3.645e-16  gooding_10
     809 c  3.623e-16  gooding_11
     641 c  3.935e-16  mikkola_1
     764 c  3.904e-16  mikkola_2

# Hyperbolic case
## Newton's method
## Halley's method
    1454 c  3.731e-15  mikkola_1
    1578 c  3.725e-15  mikkola_2
## Householder's third order method
    1704 c  3.618e-15  mikkola_1
    1757 c  3.618e-15  mikkola_2

The first column is the number of CPU cycles per solve of Kepler’s equation. The second column is the worst relative error over the test set. The third column is a short name for the method for picking a starting value for E.

As we can see above, no approach seems to work well enough with the naive improvement, or with Newton’s method. And using third derivatives does not do much. gooding_11 and mikkola_2 are relatively slow. This leaves gooding_10 and mikkola_1. I picked mikkola_1 because the paper also treated the hyperbolic case. It comes from A cubic approximation for Kepler’s equation by Seppo Mikkola.

And after all this, I have a very robust way to know where stuff is at any time! Of course, you should not forget to take care of floating point precision issues when implementing all of this…

Appendix: References

Appendix: Full benchmark

# Elliptical case
## Naive method
     264 c  2.500e-01  smith_1
     276 c  5.303e-01  smith_2
     360 c  2.000e-01  smith_3
     444 c  9.701e-02  smith_4
     385 c  1.667e-01  smith_5
     480 c  6.495e+01  smith_6
     246 c  2.500e-01  gooding_1
     299 c  2.000e-01  gooding_2
     356 c  1.667e-01  gooding_3
     246 c  5.303e-01  gooding_4
     460 c  9.701e-02  gooding_5
     253 c  2.417e-01  gooding_6
     325 c  7.497e-02  gooding_7
     334 c  7.497e-02  gooding_7b
     378 c  1.277e-01  gooding_8
     423 c  7.499e-02  gooding_9
     455 c  2.376e-01  gooding_10
     608 c  8.020e-02  gooding_11
     286 c  1.439e-01  gooding_12
     448 c  4.342e-01  mikkola_1
     629 c  4.426e-01  mikkola_2
## Newton's method
     522 c  8.517e-03  smith_1
     539 c  1.379e-02  smith_2
     578 c  8.111e-03  smith_3
     697 c  8.111e-03  smith_4
     706 c  8.111e-03  smith_5
     900 c  1.028e+02  smith_6
     481 c  8.517e-03  gooding_1
     583 c  8.111e-03  gooding_2
     561 c  8.111e-03  gooding_3
     476 c  1.379e-02  gooding_4
     686 c  8.111e-03  gooding_5
     537 c  3.528e-02  gooding_6
     536 c  1.378e-02  gooding_7
     585 c  1.378e-02  gooding_7b
     702 c  2.524e-04  gooding_8
     680 c  1.379e-02  gooding_9
     731 c  1.117e-10  gooding_10
     940 c  2.072e-08  gooding_11
     528 c  5.673e-03  gooding_12
     697 c  1.825e-06  mikkola_1
     815 c  1.825e-06  mikkola_2
## Halley's method
     395 c  1.250e-01  smith_1
     419 c  2.553e-03  smith_2
     488 c  6.250e-02  smith_3
     596 c  1.710e-02  smith_4
     557 c  4.167e-02  smith_5
     989 c  5.130e+01  smith_6
     429 c  1.250e-01  gooding_1
     503 c  6.250e-02  gooding_2
     524 c  4.167e-02  gooding_3
     426 c  2.553e-03  gooding_4
     587 c  1.710e-02  gooding_5
     410 c  6.127e-03  gooding_6
     467 c  2.548e-03  gooding_7
     440 c  2.548e-03  gooding_7b
     508 c  9.153e-05  gooding_8
     552 c  2.550e-03  gooding_9
     598 c  3.580e-16  gooding_10
     821 c  3.926e-16  gooding_11
     430 c  1.028e-03  gooding_12
     594 c  3.604e-16  mikkola_1
     742 c  3.653e-16  mikkola_2
## Householder's third order method
     438 c  1.562e-02  smith_1
     439 c  6.779e-04  smith_2
     501 c  7.812e-03  smith_3
     667 c  2.138e-03  smith_4
     533 c  5.208e-03  smith_5
    1053 c  6.535e+01  smith_6
     453 c  1.562e-02  gooding_1
     501 c  7.812e-03  gooding_2
     594 c  5.208e-03  gooding_3
     429 c  6.779e-04  gooding_4
     595 c  2.138e-03  gooding_5
     454 c  1.638e-03  gooding_6
     526 c  6.736e-04  gooding_7
     585 c  6.736e-04  gooding_7b
     583 c  9.493e-09  gooding_8
     652 c  6.741e-04  gooding_9
     681 c  3.645e-16  gooding_10
     846 c  3.623e-16  gooding_11
     486 c  2.724e-04  gooding_12
     651 c  3.935e-16  mikkola_1
     752 c  3.904e-16  mikkola_2

# Hyperbolic case
## Newton's method
     780 c  7.245e+25  basic
    1064 c  1.421e-03  mikkola_1
    1201 c  1.970e-04  mikkola_2
    1014 c  8.559e-01  gooding
     892 c  1.515e+19  pfelger
## Halley's method
    1134 c  1.729e+17  basic
    1396 c  3.731e-15  mikkola_1
    1526 c  3.725e-15  mikkola_2
    1265 c  9.160e+10  gooding
    1237 c  1.791e+13  pfelger
## Householder's third order method
    1260 c  1.323e+07  basic
    1631 c  3.618e-15  mikkola_1
    1595 c  3.618e-15  mikkola_2
    1467 c  2.276e+17  gooding
    1365 c  9.997e+01  pfelger

Floating Point Woes in Three Dimensions

Sometimes, stuff in space rotates[citation needed]. In this article, I am going to explain the basic idea to simulate rotation efficiently, and a few technical hurdles I encountered while implementing it.

I have not yet got down to reinventing the wheel writing a physics engine. Besides, I don’t want to simulate physics just to keep track of the orientation of the object through time. Doing so would drastically limit how much we can time-wrap (to maybe ×10,000).

Instead, we can just assume a constant rotation (angular velocity), and deduce the orientation at any given time from the base orientation and the elapsed time. In other words, we set:

Orientation(t) = Orientation(t₀) + AngularVelocity × (t – t₀).

This approach fits very well with planets, whose angular velocity can be safely assumed to be constant.

To handle simple changes in angular velocity, we can just update t₀. So, when the ship thrusts its RCS to pivot, we set t₀ to the current time, Orientation(t₀) to the current orientation, and update the angular velocity.

Of course, the assumptions we make are not very accurate for non-spherical objects. For instance, we will not observe the Dzhanibekov effect (also known as tennis racket theorem) with this approach.

One thing remaining: how do we represent an orientation or an angular velocity? There are essentially three ways:

  • Euler angles
  • axis-angle
  • rotation matrix
  • quaternion

Let’s go over each in more detail.

Euler angles use three angles around fixed axes. They are somewhat intuitive, but are very inconvenient to work with.


You can also find the rotation’s axis to get the axis-angle representation. You’ll need three values to describe the axis, and one for the angle.


Rotation matrices can be easily combined. They also adapt well to represent translation and scaling, when using homogeneous coordinates (convenient for 3D rendering). But they need at least 9 parameters.


The theory behind quaternions is hard to grasp, but you do not need to. What matters is that you can easily combine them, extrapolate them, or interpolate between two orientations, and they only use four values.

Actually, an intuitive way to think about them is that the first components x, y and z correspond to the axis of rotation, while the w component represents the angle of the rotation: if your angle of rotation is θ, then w=cos(θ/2). We just need to keep their norm to 1, so we scale x, y and z appropriately.


So, clearly, I use quaternions wherever I can. But there is one thing to keep in mind. Quaternions (almost) uniquely represent a rotation, and a rotation by angle x is the same as the rotation by angle 2π + x around the same axis. So, if we try to store an angular velocity of 2π rad/s (one rotation per second) as a quaternion, we’ll just have a quaternion representing an angular velocity of 0 rad/s (well, 0 rotation per second).

The usual way around this is to represent angular velocity (and impulse) as axis-angle, since we can store whatever we want in the angle. But I like edge cases and floating point Eldritch horrors. So, instead, I just scale the angle down by something like K=2¹⁰⁰. We will be fine as long as the angular velocity is smaller than 7.96×10³⁰ rad/s (26 billion billion (no typo) times faster than the fastest spinning object ever).

To summarize so far, we’ll use a quaternion to represent both the rotation and the angular velocity, except that we will scale the angle of the angular velocity by some gigantic value K.

It means we will create quaternions with very small angles.

It means that the component w will be extremely close to 1 (the distance will generally be about 1 / K = 2⁻¹⁰⁰).

It means that the component w will just be equal to 1 (since even double precision floating point numbers won’t have enough precision to tell the difference). We just lost all our precision!

It means that, when retrieving the angle from w (needed when scaling by t – t₀), we will just get 0.

It means that we will multiply 0 by K.

And 0 multiplied by a finite value (even very large) is still zero.

So nothing will ever spin. Great… back to Euler angles I guess?

Not so fast! Although we cannot retrieve small angles from the w component, the information is actually present in the other ones, x, y and z!

But you just told us that the x, y and z components are just the axis of rotation! Did you lie to us? Must my dreams always come crashing done at the altar of floating point numbers. I knew I should have gone with Cthulhu.

A sensible person when confronted with floating point arithmetic

It is true! But an axis is just a direction. When these three value also store a magnitude. Remember when I told you that we had to scale x, y and z so that the norm of the quaternion stays equal to 1? Well, that’s convenient, because it means that √(x² + y² + z²) = sin(θ/2).

And sin(θ/2) is very precise around zero in floating point arithmetic!

Which means we don’t lose any precision!

After checking with Herbie and opening a merge request in GLM (the library I use for manipulating quaternions), I am able to represent angular velocity as quaternions!

“And yet it moves!”

Could I just have used the axis-angle representation for angular velocity? Probably? Would it have been fun? Probably not as much.


Looking Up-Close at Orbits

In the previous post, I explained the basic idea that is used to draw orbits using line segments, and a little trick to make even very elliptical orbit look good. So, this is where we are:

Looks good.

But if we zoom in:

The line segments are visible again!

What is going on?! I thought we had taken care of spreading the line segments to hide the angle as much as possible!

A distraught hypothetical reader

Well, true. But we did it assuming we were looking straight down at the ellipse. Here, perspective comes into play. Although it conserves lines, perspective has the nasty habits of effecting angles. Here, it magnifies the angles between segments around the point where the orbits appear to bend towards the left. On the contrary, the segments on the left part look pretty much aligned.

But it’s okay. We only need to add a few more points close to the camera to compensate for this. It does mean that they will move as the object follows its orbit.

Here, we can clearly see the additional points moving alongside the focused object on the orbit. When we zoom in, they make sure we don’t see any unsightly corner.

But wait, there is more!

The Earth not being on its own orbit might be a bit of a problem

This is not the path of the Earth-Moon barycenter, which is the surface of the Earth anyway. The line really should pass through the center of the Earth.

So, what is going on? We do have a nice smooth-looking orbit, but it is still an approximation. It just so happens that we have looked at a time when the Earth was in the middle of a line segment. In the diagram below, you can imagine that the line segment is BX and the Earth is betwen the two.


This becomes more apparent as we speed things up:

The Earth follows the actual path of the ellipse while the orbit is just a bunch of line segments. The points between line segments are in the right position, which corresponds to the times when the orbit actually appear to go through Earth’s center.

To avoid this, we just need to ensure that at least one point of our orbit is always in the middle of the Earth. Actually, we can just make it so that it is one of the additional points from earlier!

Now everything looks smooth and the Earth is always on its orbit!

Actually, that was the easy part. Now let’s zoom a bit more for the fun one:

Showing Pluto here, as the effect is more visible. But we cannot ignore it just because it’s not a planet anymore! And the effect is even more visible on smaller objects such as spaceships, because we can zoom closer to the orbit.

To understand what is going on, we need to dive into how we actually calculate the positions of the points of our orbit. Digital computers only know about discrete values (that’s what the “digital” mean). The basic unit is the “bit” (0 or 1) and, by using several bits, we can count natural integers fairly easily (0, 1, 10, 11, 100, 101, 110, 111, 1000). Add a sign bit, and we’ve got relative integers. We’re not going to be able to represent arbitrary real numbers short of using symbolic computation, but we can at least get something good enough by using decimals. The idea is that we would use two relative numbers: one for the number’s digit, and one to tell where the decimal point should be. And that’s exactly what computers do. Well almost.

Every explanation about floating point arithmetic must start with this example. It is known.

Technically, computers use binary instead of base ten, but the principle is the same. If we really want to use base 10, we can, albeit it’s slower:

But the important point to keep in mind when working with non-integer values with computers is that they are (usually) only approximations of the number we want.

For instance, the approximation of π used by Python is slightly smaller than the actual value of π. So, if we compute the sine of this approximation, we get something slightly bigger than zero.

We can get a good idea of the approximation. There are two usual formats used to represent non-integer: binary32 and binary64, more commonly known as float and double respectively (although Python’s float is binary64). The relative error of a binary32 is roughly 2⁻²⁴ and the relative error of a binary64 is roughly 2⁻⁵³.

The error in binary32 when dealing with values around the radius of Pluto’s orbit is of 352 km (Pluto’s radius is 1,188 km). With binary64, it’s less than a millimeter.

Rendering to a screen which is no more than 8,000 pixels a size should not require more than binary32. So this is what is normally used for rendering. Using binary64 here would degrade performance a lot. But it does mean that we are getting imprecision when drawing things naively.

Let’s say the Sun is the center of our system of coordinates, and we use SI units. Then, the Pluto is 5.9 Tm from the Sun. For rendering, we are going to compute the positions of every object relative to the Sun, compute the position of the camera (our point of view) relative to the Sun, and subtract the second from the first to find the position of the objects relatively to the camera.

Since this is naturally done during the rendering in the GPU shaders, we use binary32. The error from doing computations in binary32 is roughly 352 km in this case. So, let’s start from the camera position; we are at a distance of 0 (exactly). Then, let’s move to the Sun; we are at a distance of 5.9 Tm±352km (relatively precise). And let’s move back to Pluto; we are at distance of (5.9 Tm±352km) – (5.9 Tm±352km) = ±352 km (not very precise). This is called loss of significance.

We really want to avoid doing such a computation in binary32. We must either switch to using binary64 in the rendering, or avoid doing this computation in the shaders entirely. We really want to avoid doing the first one, since it can slow down the rendering significantly, and not all GPUs even support it.

How do we avoid doing this computation in the shaders then? The most intuitive approach seems to switch the system of coordinate to take Pluto as center. But this has issues too, since we still have the exact same imprecision when switch from one coordinate to the other. The solution I opted for is actually simple: I do the imprecise computation using binary64 outside the shaders, which lets me have millimeter precision around Pluto. It can be seen as switching coordinates, but only for part of the rendering.

The loss of significance is still here. But it’s now only 1 mm.

Note: I use the same technique when drawing celestial bodies. This is actually why the object being observed is well positioned at the center of the screen and do not move erratically (unlike in the example below).

This is not all roses however. For instance, if we look at Pluto from the Earth.

This requires a FOV of 36 hundreth of an arcsecond. Current telescope are far from being able to do this, but a software simulation should.

Since the camera is positionned at the center of the Earth, the distance to Pluto is of the order of 6 Tm. With this level of zoom however, we can distinctly see the imprecision of roughly 350 km, causing the dwarf planet to not be perfectly centered in the screen.

Changing the system of coordinates is not quite enough, since the camera is not close to the observed object. So I will need ot find another workaround.

So, seriously, don’t let the floating point bugs bite.