Categories
Programming

Float Woes in C

The C programming language is a gift that keeps giving. Today, we are going to see how a seemingly banal and common operation can hide unfathomable depths of unmentionable horrors.

Woes with Integer Coercion

What is the problem of the code below?

// returns the closest integer to f
int float2int(float f) {
    return (int) f;
}

It’s written in C.

A Rust enthusiast

That’s… a good point.

But let’s assume we actually want to write C, for some forsaken reason. And we do not care whether we convert up, down or sideways.

There is still a fundamental issue with this code. What happens when we call float2int with INFINITY?

#include <math.h>
#include <stdio.h>

int float2int(float f) {
    return (int) f;
}

int main(void) {
    printf("%d\n", float2int(INFINITY));
    printf("%d\n", float2int(-INFINITY));
    printf("%d\n", float2int(NAN));
}

When we compile this code with gcc and run the result, we get:

$ gcc -O3 test.c && ./a.out
2147483647
-2147483648
0

Makes sense! The largest integer is 2147483647; the smallest is –2147483648, and what would you choose for NAN but 0?

Someone still young and innocent

Now, let’s just remove that -O3 option:

$ gcc test.c && ./a.out
-2147483648
-2147483648
-2147483648

What?

The innocent who got their world shattered

Wait, it gets better:

$ clang -O3 test.c && ./a.out
1464089272
1488257696
1488257696
$ ./a.out
1259459480
-1806736736
-1806736736
$ ./a.out
-2098630344
1664811680
1664811680

But… why?

All hope is gone

Because it can.

Yup, that’s undefined behavior. Converting a non-finite value (i.e. an infinite or a NaN) to an integer is undefined behavior.

But that’s not all! What should the largest finite floating-point value (3.4028234664e+38) convert to? It’s much bigger than INT_MAX, the value that int can represent (it does not matter whether you are on a 32, 64 or 128 bit architecture, really).

Maybe we could just say that all the floating point-number larger than INT_MAX should be converted to INT_MAX when converted? But alas, it makes for unwelcome surprises down the line, when you realize that the rounded value is not always within 1 of the original value.

Thankfully, the C standard always has the solution whenever there is a hint of ambiguity:

When a value of integer type is converted to a real floating type, if the value being converted can
be represented exactly in the new type, it is unchanged. If the value being converted is in the
range of values that can be represented but cannot be represented exactly, the result is either the
nearest higher or nearest lower representable value, chosen in an implementation-defined manner.

If the value being converted is outside the range of values that can be represented, the behavior is
undefined
. Results of some implicit conversions may be represented in greater range and precision
than that required by the new type (see 6.3.1.8 and 6.8.6.4).

Part 6.3.1.4 paragraph 2 of the C standard

In other words, we’re just going to say all the values that are not quite that convenient trigger undefined behavior. The “range of values that can be represented” by int is the interval [INT_MIN, INT_MAX]. Attempting to convert any float value from outside this range to the int type is undefined behavior.

We’ll just have to check for that!

Someone has not learned their lesson

Woes with Range Checking

We are just getting to the actually tricky part. Let’s have a look at a seemingly robust implementation:

#include <limits.h>
#include <math.h>

int float2int(float f) {
    if (isnan(f)) {
        return 0;
    }
    if (f < INT_MIN) {  // also filters out -INFINITY
        return INT_MIN;
    }
    if (f > INT_MAX) {  // also filters out +INFINITY
        return INT_MAX;
    }
    return (int) f;
}

For the sake of simplicity, we are just providing arbitrary values for the values that we cannot convert. Other implementations might choose to return a status indicating whether the conversion is possible or not.

But, of course, there is a bug lurking in there. And it is extremely easy to miss.

And, this time, it is not caused by an undefined behavior, but by another surprising “feature” of C: implicit type conversions.

To understand what I mean by that, we need to look at what the conditions in the code above really mean. The first one is fine, let’s look at the other two: f < INT_MIN and f > INT_MAX. In both, the left operand is a float, and the right operand is an int. Processors rarely have such built-in operations. So a conversion must be taking place. The way it happens in described in the “usual arithmetic conversion”:

[…]

if the corresponding real type of either operand is float, the other operand is
converted, without change of type domain, to a type whose corresponding real type is float.

[…]

Part 6.3.1.8 paragraph 1 of the C standard

Thankfully, we only need the part I have quoted here. In short, both operands get converted to float. Let’s look at our conditions again.

if (f < (float) INT_MIN) {  // also filters -INFINITY
    return INT_MIN;
}
if (f > (float) INT_MAX) {  // also filters +INFINITY
    return INT_MAX;
}

Let’s look at the first condition. What is the value of (float) INT_MIN? Let’s assume 32-bit 2-complement int type. Then INT_MIN is -2³¹. A float can represent this value exactly. You can check it out with an online converter. So (float) INT_MIN is -2³¹ as well. No problem here, this conversion does not change the value, and this code does exactly what we want.

With the same assumption, INT_MAX is 2³¹ – 1. And there comes the catch: a float cannot represent 2³¹ – 1 exactly. If you put “2147483647” in the online converter, you will see that it is rounded to a float whose value is actually “2147483648”. It makes sense: a float trades off being able to represent all integers in [-2³¹, 2³¹ – 1] in order to cover a much wider range and many more magnitudes. In other words, the actual value of (float) INT_MAX is INT_MAX + 1. The condition is actually doing:

if (f > INT_MAX + 1) {  // also filters +INFINITY
    return INT_MAX;
}

Note: if this were C, this code would contain an undefined behavior because of the signed integer overflow that happens when evaluating INT_MAX + 1. This is just pseudocode for illustration purposes.

Since the upper bound is offset by one, the code fails to filter out the half-open interval (INT_MAX, INT_MAX + 1]. If the input parameter f were to take its value from this range, converting it to int would be undefined behavior. Now, is there any float value in this interval? Yes! INT_MAX + 1 of course (it’s the only one).

Conclusion

The “robust” implementation does not handle 2³¹ properly. This is not just a theoretical oddity. I encountered this issue in critical code in production while working at TrustInSoft. Such bugs will easily stay under the radar and avoid even the most intensive test campaigns. Only formal methods help in detecting and avoiding such errors.

The worse thing is that, since this is undefined behavior, you might encounter it many times, and never now, corrupting data in unknown ways. Things that could have been a runtime error become the entry door for silent bugs and security issues.

C likes to keep you on your toes.

Categories
Programming

Flexible Array Members: Typical C Shenanigans

Speed. I am Speed.

The one thing that C cares about is speed. Seriously, you might not get the right results, but at least, you will get them fast.

Let’s look at two common ways to represent a vector in C.

struct vector {
    unsigned long size;
    int data[42];
};

The first one is to just use the built-in array type. This approach is simple, but implies choosing the size of the vector when compiling. The program cannot decide the size depending on user-input or some other runtime parameter.

Note: Variable-Length Arrays (VLA) do allow declaring arrays whose length is decided at run-time. However, this is limited to local variables and is an optional feature since C11.

struct vector {
    unsigned long size;
    int* data;
};

The alternative is to not use an array at all. Instead, we use a pointer to a memory buffer that will contain the data. This memory location can be created with malloc, for instance.

Pointers can be used like arrays almost transparently, which brings (the “almost” means that most C programmer are paranoid about using sizeof on expressions). The example below shows how accessing an element at some index uses the same syntax in both cases.

But although the syntax is the same, the meaning is different. Look at the assembly code generated to implement vector_index. Indexing the built-in array uses a single MOV instruction; indexing through a pointer uses two MOV instructions.

It makes sense. rdi contains the address to the struct and rsi the value of index. In the first case, we shift by 8 bytes to skip the field size, then we move by index * 4 bytes to select the right array element, then we read the value at this memory location (indicated by the bracket in the Intel assembly syntax). In the second case, we need to first recover the value of the field data, at an offset of 8 bytes from the beginning of the struct, then read at the right offset from the address it contains.

In other words, using a pointer instead of a built-in array is twice as slow! Okay, not really. Any modern CPU is going to recompile, pipeline, prefetch and out-of-order execute this such that it really won’t matter. But it used to! And it might still matter on microcontrollers.

In fact! I did some benchmarks, and, although there was no difference on my laptop (CPU model i5-4210H), I did see a tremendously large difference on my Arduino Leonardo (microcontroller model ATmega32u4). Indeed, the code with the additional indirection through a pointer was 10 % slower.

Yeah, I did expect it to be worse than that. But still, 10 % was not something to scoff at in the previous millennium. Also, CPUs at the time might have seen a larger difference than my ATmega32U4 released not that long ago.

In summary, embedding a built-in C array is fast, but inconvenient. Using dynamic allocation with malloc is more flexible, but not as fast. How can we make our compiler generate faster code while keeping the convenience?

Well, we can lie to it.

Lies. So Many Lies.

Let’s look at how we will typically create a vector with the first approach:

struct vector {
    unsigned long size;
    int data[42];
};
struct vector *vector_new(unsigned long size) {
    assert(size == 42);
    struct vector *ret = malloc(sizeof(struct vector));
    ret->size = size;
    return ret;
}

Notice how malloc is just another C function, that can take any value as an argument at runtime? Notice how we are forcing ourselves to give it a compile-time value (the size of our struct)? What if we actually allocated the size we wanted?

For instance, if we allocated sizeof(struct vector) + 100 bytes, that would mean we would still have the 42 elements from the definition of struct vector, but we would have 100 additional bytes, enough to store 25 more 4-byte integers!

And so, a new idiom was born:

struct vector {
    unsigned long size;
    int data[1];
};
struct vector *vector_new(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * (size - 1)
    );
    ret->size = size;
    return ret;
}

Since we are going to lie to the compiler about the size of the actual array anyway, we make the base size as small as possible. Since the C standard does not allow 0-length arrays (“An array type describes a contiguously allocated nonempty set of objects”), we just use 1. Then, when we create an instance of our vector, we are going to make sure to actually have enough memory for size elements. Since there is already 1 built-in, we need to allocate for a size - 1 elements in addition to the base struct.

C developers embraced this hack, wrote many fast programs, and lived happily ever after.

Well, kind of. I mean, if you write this, your program is probably going to be fast. Just as fast as a fixed-size array. But maybe it won’t. And maybe it will allow random people to take control of your computer. Or maybe it will blow up the spaceship. Who knows?

You might know where I am going with this: this is undefined behavior.

Yes, we did allocate enough memory to do what we want. But this is not the point. We lied to the compiler. And compilers are very delicate, gullible beings. It will believe you. It will believe that your array is actually of size 1. And it will do unmentionable things to your code, because this assumption is false.

Save one Character, Save the Entire World

When they observe developers writing C, compilers authors and language lawyers alike get the urge to start again from scratch. This is actually the reason why there are so many flood myths through the ages.

This time, however, they recognized that the end justified the means. Compiler authors, too, want the programs they write to be fast (except C++ compiler authors).

And so, GCC introduced a way to do this:

struct vector {
    unsigned long size;
    int data[0];
};
struct vector *vectornew(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * size
    );
    ret->size = size;
    return ret;
}

Notice the difference? We just replaced 1 by 0 for the size of the array.

But you said the C standard did not allow this!

Someone who pays attention to details

Indeed! But GNU C does! GNU C is the dialect that GCC understands, and it allows for 0-length arrays as an extension.

Of course, if you were using another compiler, you were out of luck… until!

Finally, in the C99 revision of the C standard, a new syntax was introduced!

struct vector {
    unsigned long size;
    int data[];
};
struct vector *vector_new(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * size
    );
    ret->size = size;
    return ret;
}

Can you see the difference? We just removed the 0.

That’s all?

Someone who wasted too much time reading this; but worry not, it took even longer to write!

Yup. This syntax tells the compiler to not assume anything about the size of the array. You can still use it as usual. But ensuring that you allocate enough memory, and that you do not go out-of-bounds, is up-to-you.

Less is More

We went from 1, to 0, to nothing. Since there is nothing left to take away, we attained perfection. It’s a general truism with C: the less C, the better.

Categories
Programming

Why Undefined Behavior Matters

So Many Bugs

So, you’ve been programming for one month, or for ten years, and you have gotten familiar with Murphy’s and Sturgeon’s laws. And you are wondering if there is some way to catch all the bugs, not just spray-and-pray with regression/unit/integration/system/acceptance testing.

Or you are a curious onlooker, wondering why all these programmers cannot make a piece of software that works reliably, what with apps doing unexpected things and with all the security breaches you hear off.

First, what are we talking about? Essentially, a bug is anytime a system does not do what it is expected to:

The Mars Climate Orbiter disintegrated into Mars because a module was using non-metric units and forgot to convert. Seriously.

As you can see, there is quite a spectrum. I like to sort bugs in three categories:

Today, I would like to focus on the last one. It looks like it should be the more tractable, since the scope is well constrained. But it is still a non-trivial problem.

What we want is two-fold:

  1. Specify what we expect the program to do
  2. Verify that the program meets our specification

Even the first step is tricky. For the sake of simplicity, we are going to reduce the scope of what we want to achieve, and focus on a well-delimited set of things our program should do: have a well-defined behavior.

What does this mean? Let us say you wrote a program that does some kind of scientific computation, expect you used meters instead of kilometers. Then, the result is going to be wrong, of course, but in a predictable way: the result will be off by a factor 1,000. If you run the program again the same way, you should get the same result.

On the contrary, if your program contains undefined behavior, it could do unexpected things. It might output the right value; it might output a wrong value; it might crash; it might allow an attacker to take control of your computer. And it might behave differently every time you run it. Worse: it might run perfectly fine during all your tests, but output the wrong value when it is being used to pilot a real plane with real passengers.

How can we know if a program contains undefined behavior? First, we need to talk about programming languages.

Why is C an “Unsafe” Language?

When people talk about “undefined behavior”, they are usually referring to undefined behaviors in the C programming language. There is a reason why this particular language is criticized as being “memory-unsafe”, and why so many security vulnerabilities stem from code written in C. And that reason is that the C language leaves the door open for many bugs.

To understand why, we must first ask ourselves what “the C programming language” refers to. Who decides what it is, and what if someone does things differently?

Historically, C was loosely defined and had many dialects. Every compiler author would do things a bit differently. The people writing a compiler to compile “C” code to 8080 assembly made different choices than the ones targeting PDP-11. The same program could end up doing different things on different processors. Things were messy.

So, the C compiler writers, and the C developers, and the processors manufacturers joined together and agreed on a common definition for what “C” really meant. Thus was born the C standard.

The C standard had to try to align things between people doing very different things. And these people cared a lot about performance. Processors at the time were about a hundred times slower than the ones today. And I am only talking about clock frequency here; nowadays, each CPU core can do much more in a cycle than the CPUs at that time, and there are now many cores in a modern CPU. It makes sense that they really wanted to keep compiling C code to efficient assembly, to have programs run at decent speeds.

Now, what happens if the code tries to access memory it is not supposed to? Well, the compiler cannot just add runtime checks everywhere, since that would have a negative impact on performance. And it cannot always know in advance that this will happen, since it would require it to solve arbitrary complex mathematical problems.

Because of this, the C standard did not require either. Instead, if the code does such an invalid operation, it is said to have “undefined behavior”.

So what? If code crashes, I’ll notice the problem. If it outputs the wrong result, my tests will notice it.

Simplicius, my imaginary straw-man C developer

Wait a minute, renegade figment of my imagination! Undefined behavior means that the compiler is allowed to produce code that does anything. In particular, it is allowed to assume that undefined behavior never occurs in the problem, and deduce things from that. And you can prove many things if you assume something which is false.

Let’s take an example. The code below uses a toy authorization function that just checks whether the user id (uid) . As you might have already guessed from the topic of this article, there is an undefined behavior in this code. First question: can you see it?

#include <stdio.h>
int authorized_uids[] = {0, 1001, 1005, 1012};
int is_authorized(int uid) {
    for (int i = 0; i <= 4; i++) {
        if (uid == authorized_uids[i]) {
            return 1;
        }
    }
    return 0;
}
int main(void) {
    if (is_authorized(4242)) {
        puts("You are now logged in as root.");
    } else {
        puts("You are a lowly common user.");
    }
}

Spoiler: answer below.

The loop is written as for (int i = 0; i <= 4; i++), so i takes the values 0, 1, 2, 3 and 4. Then we use it as an index to read from an array: authorized_uids. However, this array only contains 4 elements, at indices 0, 1, 2 and 3. There is no index 4 in this array. Trying to access index 4 is an out-of-bounds access, which is a type of undefined behavior.

Second question: what will happen when we compile and run this code? You can actually do it, but do try to have a guess first.

The universal answer is: anything can happen. Since there is an undefined behavior, the program could do anything. However, there are two common things compilers will do in this case.

Option 1. It can translate this C code to assembly in a rather naive way. For an x86 processor, it would translate authorized_uids[i] as an assembly instruction reading memory at the address authorized_uids + i.

This might seem a bit obvious, but the compiler does not have to do that. Even when there is no undefined behavior, the compiler is allowed to emit any code it wants, as long as it produces the same result. For instance, the compiler might find a faster series of instructions that achieves the same results. This is what “optimizing” mean.

If we then run the compiled program, the CPU will try to read at authorized_uids + i. Most likely, however, this will still be in the valid memory of the program. Indeed, memory is usually allocated generously (segments for the stack, pages for the heap). So, the assembly instruction will not cause a crash; it will return some value. Most likely, the value will not be 4242, so the test will fail, and the user will not be authorized.

This is the middle column in the compiler explorer below. You can see at the top that the source code for is_authorized contains various assembly instructions. And, at the bottom, you can see the output of the program: “You are just a lowly common user.”.

Option 2. Now, let’s have a look at the rightmost column. The output is “You are now logged in as root.”! This is not what this program was supposed to do at all! And this is not just a fluke. It happens every time you run it!

Then, why is the result different this time? If you look closely, you will notice that the only difference is that we passed the option -O2 to the compiler. It tells the compiler to try harder to optimize the program.

In fact, if you look at the generated assembly, you will see that the compiler optimized the function to something extremely efficient:

is_authorized(int):
        movl    $1, %eax
        ret

This means that the function is_authorized just always returns 1. Wat?

Let’s try to understand the reasoning of the compiler. It sees:

int is_authorized(int uid) {
    for (int i = 0; i <= 4; i++) {
        if (uid == authorized_uids[i]) {
            return 1;
        }
    }
    return 0;
}

This is equivalent to:

int is_authorized(int uid) {
    if (uid == authorized_uids[0]) { return 1; }
    if (uid == authorized_uids[1]) { return 1; }
    if (uid == authorized_uids[2]) { return 1; }
    if (uid == authorized_uids[3]) { return 1; }
    if (uid == authorized_uids[4]) { return 1; }
    return 0;
}

But we can assume that there is no undefined behavior in the code. So uid == authorized_uids[4] must never be evaluated. So the compiler can safely assume that the function always exits at one of the first 4 return statements. In other words, it always returns 1.

But couldn’t the compiler at least warn about this?

Someone rightfully annoyed with the state of the universe

Fighting Undefined Behavior

If you run gcc with -Wall -Wextra -Wpedantic or even clang with -Weverthing, you still get no warnings about the undefined behavior!

Why don’t they warn about this? Because this is harder than it looks. The compiler cannot always know where an UB will happen. Let’s have a look at the example below.

// hour ∈ [0, 23]
// minute ∈ [0, 59]
// second ∈ [0, 59]
int time_to_seconds(int hour, int minute, int second) {
    return hour * 3600 + minute * 60 + second;
}

If a compiler had to ensure the absence of undefined behavior when compiling the above code, they would have to assume that hour, minute and second contain any possible value from INT_MIN to INT_MAX. The compiler does not understand comments written for humans, and it does not understand variable or function names, even when they are chosen appropriately.

To make the code safe, the compiler would have to introduce runtime checks in the compiled code to make sure that no overflows occur during the computation. In practice, that would imply 2 comparisons and 2 branches before each of the 2 multiplications and each of the 2 additions. But C was not designed for safety. It was designed for speed. All these comparisons and branches add up. On a simple benchmark, I observed that a safe version would be between 2 and 3 times as slow as the unsafe one.

For the same reason, if the compiler had to warn for every potential undefined behavior, developers would get flooded with false positives. In the example above, the compiler would have to emit a warning for each arithmetic operation. With many false alarms, the developers would have to get in the habit of just ignoring all the warnings.

But I do not care about performance that much! I want runtime checks everywhere. Or at least, I want to see the warnings! I set -Wall -Wextra -Wpedantic -Wconversion -Weverything. I really do want to know when there are potential undefined behaviors in my code.

Let’s address each of these.

Runtime checks. The clang compiler does have the ability to add runtime checks! ASan (Address Sanitizer) looks closely at the way memory is allocated, deallocated and used. UBSan (Undefined Behavior Sanitizer) looks at a few additional things, such as invalid arithmetic operations. Both of them work by adding instructions to the resulting binary executable, which will check everything. Since the program has more things to do, it runs slower. Valgrind/memcheck works directly on the compiled program, and inserts its own checks when running it. It also makes running the program much slower. These tools are able to catch a few types of issues every time they actually happen in the program. However, it is often non-trivial to understand why it happened. Using the function from the previous example, compiling with clang -fsanitize=undefined time-to-seconds.c and running the program would result in something like:

time-to-seconds.c:5:17: runtime error: signed integer overflow: 2147483647 * 3600 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior time-to-seconds.c:5:17 in 
time-to-seconds.c:5:33: runtime error: signed integer overflow: 2147483647 * 60 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior time-to-seconds.c:5:33 in 

Syntactic checks. Although they do not understand what your program does, compilers are able to notice suspicious patterns in the source code. For instance, they can see if (x = 42) and emit a warning, since this is a common mistake where the developer should have written if (x == 42). Modern compilers now include an extensive set of such patterns to warn the user about potential mistakes. Some tools are specialized in finding such patterns in the code, such as clang-tidy or cppcheck. Since they do not change the way the code is compiled, they incur no slow down when actually running the program. However, they can only find common mistakes, or they would risk emitting too many false warnings. In the example, no linter emits any warnings, since it would be likely to be a false positive. However, they do warn of more obvious cases:

warning: integer overflow in expression of type ‘int’ results in ‘1’ [-Woverflow]
   39 |     INT_MAX * INT_MAX;
      |             ^

If we want to go further, we need to look at formal methods. The idea is to use mathematical and systematical approaches to find all the undefined behaviors in a provable way. This means that such methods even allow you to prove the absence of undefined behavior (when there is none, of course).

I want to talk about two techniques from formal methods in particular. I am familiar with them because the company where I work, TrustInSoft, sells an industrial tool to apply them to real code used in critical applications.

Abstract interpretation. The idea for this one is intuitive. Essentially, we are going to interpret the source code in an abstract machine. This abstract machine will interpret each statement from the code exactly as specified by the standard. If undefined behavior can happen, a warning can be emitted. Essentially, this can be thought as an implementation of the abstract machine as described by the C standard.

The interesting thing is that you can go even further by having the abstract machine work with sets of values instead of singular values. For instance, you can analyze a function for all possible inputs (the values for each parameter being represented as a set of actual values), instead of just one at a time. For instance, let us consider the example below.

int abs(int x) {
    if (x < 0) {
        return -x;
    } else {
        return x;
    }
}

We can analyze this using the Eva plugin of Frama-C, or the Value plugin of TrustInSoft Analyzer.

$ frama-c -eva abs.c -main abs -verbose=0
[eva:alarm] abs.c:3: Warning: signed overflow. assert -x ≤ 2147483647;

This command interprets the source code in an abstract machine, starting from the abs function. By default, it assumes the argument can be any possible value (any integer from INT_MIN to INT_MAX). When checking the program for all these possible values, it notices a problem in one specific case: when x is INT_MIN, computing -x is undefined behavior, because it would be bigger than INT_MAX. Indeed, usually INT_MIN is 2³² and INT_MAX is 2³²-1, because of the way integers are represented. If you are interested in knowing how these programs manage to work with huge sets of values, look up interval arithmetic.

Weakest Precondition. The other technique is much more general and can actually be used to verify arbitrary facts on programs, not just to find undefined behaviors. For instance, with this, you can prove that your implementation of a sorting algorithm actually does what it is supposed to do. However, it also takes much longer to learn to use it, and to analyze a real program. If you are actually interested in it, I can only recommend the excellent tutorial written by Allan Blanchard.

What About Other Languages?

We explained why C is an unsafe language. But are the other languages unsafe?

C++. This is essentially a superset of C. And it adds its own undefined behaviors. It does introduce features to avoid doing mistakes common in C code, but the complexity it brings also hands another footgun to the user.

Java, Python, Ruby. These languages work by being compiled to byte code, which is something similar to assembly, except that it is not run directly by the processor. Instead, it is run by the corresponding virtual machine. In other, each language targets a single architecture. While C needs to encompass CPUs whose memory model vary significantly, these languages can specify strictly what should happen. They do not care as much about performance, either. The result is that, if you try to access out of the bounds of an array, you do not get undefined behavior. You get a nice, clean, exception telling you what the problem is, and a call stack telling you where it happened.

JavaScript. JavaScript is the worst client-side language. It is also the only one. Surprisingly, however, it is safe in the sense that there are no undefined behaviors (there were some until ECMAScript 2016).

Rust. Now, Rust is an interesting case. Like C, it is designed to be fast. Unlike C, it is designed to be safe. It achieves this thanks to the concept of unsafety. By default, checks are added wherever they are needed, and dangerous operations are simply prohibited. You can turn off these checks to do lower level operations and improve performance where needed, but you have to do explicitly. The point is that unsafety is opt-in, so developers know to be careful with undefined behaviors.

As you can see, we can do quite a bit about undefined behaviors. There are very fast, very simple-to-use tools, that might find common mistakes, and there are very advanced, and exhaustive tools, that require more involvement. And there are languages trying to make safety the default.

There is still a long way to go until software becomes as reliable as it could be, but we’re getting there, one byte at a time.

Further Reading

Categories
Kepler Project

Continuous Integration and Delivery Made Easy

These are not really advanced topics, but I wanted to write about them anyway. So here it is.

Continuous Integration

Sometimes, I report bugs in software. Others, in libraries. Or I can just fix things by myself. And more. Point being: there are lots of bugs everywhere. What do we, software developers, do to avoid writing too many bugs?

We run tests. Unit tests. Integration tests. System tests. Acceptance tests. And where possible, we write software to do these tests for us. Usually, that means unit tests and integration tests. Usually, such tests won’t guarantee that our code is without defect. But at least they give us confidence that it does work for the cases we do tests. If we change something and the tests still pass, we probably have not broken something major. This let us iterate quickly without having to double and triple check whether a change broke one of many features. But even then, starting the tests and waiting for the results is burdensome, so we might get complacent and forget to run them before releasing our changes.

So we get even more software to remember that for us. That’s what we refer to as “continuous integration”, or CI for short. Whenever we submit a new feature, a bugfix, or even if we just want to slightly change the shadow under a button, the CI will run the tests and make sure we have not broken anything. For instance, in the CI of Kepler Project, I run linters to find obvious defects in code, static analyzers to find less obvious defects, as well as a suite of unit tests. I plan to add integration tests at some point, but it is slightly annoying to run them in graphical software.

Continuous Delivery

Once we are confident our changes will not accidentally launch the nukes, we want to actually release them to our users. But software developers are lazy, and they will spend hours automating a task that only takes minutes. In the end, this is the next step after continuous integration, so we call it “continuous delivery” (CD).

An added bonus of a well-designed CD is that you will be able to tell exactly what code was used to create a specific release. For instance, in Kepler Project, the CD starts whenever I create a new git tag. The release it generates uses the tag as the version number. Since a git tag points to a specific state of the code, I can easily check out the code corresponding to a version, to locate a bug, for instance. I can also browse all the changes between two versions.

CI/CD

If you have any kind of non-trivial project, you will want to at least run a CI. And even a basic CD will probably help you in tracking changes from a released piece of software back to the corresponding code in your code history.

Categories
Kepler Project

Building for Windows without Running Windows

By the way, I use Linux

1998 2008 2018 2022 will be the year of the Linux desktop. Definitely sometimes before 2028. Maybe before 2038? Ok, maybe not. In any case, I am the most comfortable working (and studying, and procrastinating, and gaming, and procrastinating) on Linux (yeah, yeah, GNU/Linux or GNU+Linux, I know). Actually I am a weakling running Debian with the non-tiling WM Xfce4 (I could as well install Xubuntu at this point).

Anyway, the point is that I still need to target Windows as a platform for the 96 % of users who have not yet seen the light. But I do not really want to have to reboot to switch to a different operating system and maintain multiple developing environment. Besides, I need to build my project in the CI/CD pipeline for automatic testing and release building, but I really do not want to deal with a Windows virtual machine in the cloud.

If only I could build for Windows without running Windows…

Cross-Compiling

Well, it’s a thing. Actually, compiling for X without running on X is a thing. It’s even pretty much mandatory when you cannot run a compiler on the target system. So, if you are developing for Arduino, you are most likely compiling from amd64 (Intel and AMD microprocessors) towards avr (Atmel’s microcontrollers). It’s called cross-compiling.

What makes cross-compiling not completely trivial is that the file hierarchy and dependency management are mostly built around the assumption that compiled code will be run on the current system. So, on my computer, binary libraries in /usr/lib are amd64, /usr/bin/gcc compiles to amd64 and apt installs amd64 binaries by default. Debian does support various architectures, so, I can tell apt to install arm64 libraries, and it will install them in /usr/lib/aarch64-linux-gnu/. Binaries of different architectures, however, will conflict: both amd64 and arm64 Firefox will want to be /usr/bin/firefox. But, in any case, I do not want a gcc compiled for arm64! I want a gcc compiled for amd64 that will compile to arm64. For this, I have to install gcc-aarch64-linux-gnu. With this, I can easily set up a working environment where I will be compiling for arm64 from my amd64 computer.

Cross-Compiling to Windows

But wait, I do not want to cross-compile to arm64! I actually want to cross-compile from amd64 to amd64, but for Windows! Does that even make sense? It does! If I install mingw-w64, I get a new gcc at /usr/bin/x86_64-w64-mingw32-gcc (actually a symbolic link to /usr/bin/x86_64-w64-mingw32-gcc-win32). And then:

$ cat hw.c
#include <stdio.h>
int main(void) {
    puts("Hello World!");
}
$ x86_64-w64-mingw32-gcc hw.c
$ file a.exe 
a.exe: PE32+ executable (console) x86-64, for MS Windows

I actually do get a PE file, an executable for Windows!

Good, I can compile code for Windows. Now I need to install my project’s dependencies, such as GLFW3. But there is no such thing as libglfw3-mingw-w64-dev. In fact, there are only a few libraries available for this platform in the Debian repositories:

  • libassuan-mingw-w64-dev
  • libgcrypt-mingw-w64-dev
  • libgpg-error-mingw-w64-dev
  • libksba-mingw-w64-dev
  • libz-mingw-w64-dev
  • libnpth-mingw-w64-dev

For everything else, you’ll need to compile the dependencies yourself and install them in /usr/local/x86_64-w64-mingw32/. Yup, no magic wand here. And of course, you will need to cross-compile them to Windows. And fix the bugs you will encounter while doing so.

Pro-tip: write yourself a script to build and install these dependencies. It will serve as documentation for when someone (probably future you) wants to install the development environment. And you will have to do it one way or another if you want to cross-compile in Docker for your CI/CD pipeline.

Running Windows binaries from Linux

How do we run the “Hello World!” example from earlier? Easy!

$ wine a.exe
[useless warnings]
Hello World!

Yup, that’s it.

Okay, okay, it is still useful to test the Windows build on an actual Windows or, at the very least a Windows VM since it sometimes behave differently.

Anyway, that’s all for today. And don’t think too much about floating point determinism when cross-compiling.

Categories
Kepler Project

Astronomical Depth Buffer

Guess what. We’re going to talk about floating point arithmetic and what happens when using distances that range from the millimeter to Pluto’s distance to the Sun. Yes, yet again.

Floating Point Arithmetic

OpenGL uses single precision floating point arithmetic. In any case, that’s what your GPU prefers to use. This means you can be as precise as 60 parts in a billion, but not more. That’s more than enough for drawing human-sized objects: the error is much smaller than the width of human hair. And it causes no problem when drawing the Solar-system: the error to Pluto’s orbit is 352 km, which is not visible when looking at the whole thing. But you are going to have issues with drawing a human if you do not know if it’s right here, or 352 km away.

We can avoid the main problem by doing most computations in double precision and centering around the camera before switching to simple precision for the rendering, as I briefly touched in a previous article. And that’s fine! We do not need more than single point precision to locate a pixel on the screen after all.

So far so good. But using a wide range of coordinates in the scene surfaces another issue: the naive use of the depth-buffer.

The Depth-Buffer

3D rendering, at its core, is not very complex. You ask to draw some triangles, your GPU does some math to know where it should draw it on your screen and give colors to the pixels that end up in it. And that’s it! Well, almost. There is one tiny remaining problem.

Let’s say there is a green square close by, and a red triangle farther away (behind the green square). On the left side, this is what we expect to see. On the right side, this is what happens if we draw the green square first: when we draw the red triangle, its pixels will overwrite these of the green square. (source)

Each pixel might be in the direction of various objects. But, most of the time, it should be of the color of the closest object. If we just draw all the objects overwriting the pixels each time, we might show something which is supposed to be hidden!

The most intuitive way to fix this is to just draw the objects in the appropriate order! This is z-sorting. But it is not a trivial task, and will not work in all cases (when A is partially behind B, B partially behind C and C partially behind A).

The more robust way is to remember for each pixel, how far it comes from. When drawing something over this pixel, we check whether the new value is closer. If it is, we use the new value, otherwise we keep the old one. This is the depth-buffer. For each pixel, we’ll track its current color (red, green, blue and alpha), and its current depth.

Precision Issues

By default, OpenGL actually sets the depth-buffer (d) as the inverse of the distance from camera (z): d = 1 / z. This is a secondary effect of the perspective transform, which makes things look actually 3D.

If we divide the range of d in regular interval (on the left), we notice that they cover vastly different intervals in actual depth (on the bottom); this causes precision issues for objects which are far away (source)

There are various ways to fix this. The Outerra blog offers two approach for a logarithmic depth buffer:

  1. explicitly set the depth-buffer from z (or w) in the fragment shader
  2. override z when it will not affect perspective (which relies on w) and let OpenGL use it to set the depth-buffer

The first method works very well, but disables various optimizations, so it slows down rendering a bit. The second method lets OpenGL do its thing — so it is fast — but it does not work well with large triangles close to the camera.

For more information, you can also have a look at another article from the Outerra blog.

Writing this article has forced me to better understand the technique, and to fix a bug due to using the second method incorrectly (I had kept the multiplication by w from an earlier method).

With this out of the way, you can draw things very close-by or extremely far away in a single scene! No need for hacks such as splitting the scene in a close-up part and a far-away part!

Categories
Kepler Project

Solving Kepler’s Equation 5 Million Times a Second

I know it is a controversial opinion, but you might need to know where things are when running a simulation. As a bonus, it helps you in knowing what to draw on the screen.

Of course, you can just use a position vector (x, y, z) where x, y and z are the coordinates of the object. But things have the inconvenient tendency not to stay in the same place for all eternity. So we’ll want to know where that object is at the given time.

Now, the more generic approach is to run a physics simulation. We compute the acceleration of the object, update its speed depending on the acceleration and update its position depending on its speed. By the way, use Runge-Kutta 4 for this, not Euler’s method.

Gravity accelerates the Moon towards the Earth, so its speed will tilt counter-clockwise in the picture (source)

That would work for things in orbit too. But we can do much better in this case. This one guy, Johannes Kepler found simpler formulas to describe the movement of two orbiting objects (e.g. the Sun and the Earth, or the Earth and the Moon). Basically, the trajectories will be ellipses, and we can deduce the position of an object in its orbit at any given time.

The orbits of planet1 and planet2 are ellipses such that one focus is the Sun (source)
The area swept by an orbiting object is always the same during a fixed duration (source)

We can describe the position of the object in its orbit as the angle between the direction of the periapsis (where the planet is closest to its sun) and the direction to the object, taken at the relevant focus. This is the true anomaly. Everything is an illusion, so we also like to pretend the ellipse is actually a circle and define the eccentric anomaly. But these angles are not linear with time. So we also define the mean anomaly, which works like an angle (goes from 0 to 2π) but is linear with time. Once we know the direction of the object, we will only need to know its distance, r.

And here are the formulas to get from time to the true anomaly, and how we then obtain the distance from focus (e is the eccentricity of the ellipse):

  • mean anomaly (M): M = M₀ + n × (t – t₀)
  • eccentric anomaly (E): M = E – e × sin E
  • true anomaly (f): $latex \displaystyle f=2\arctan\sqrt{\frac{1+e}{1-e}} \tan\frac E2$
  • distance from focus (r): $latex \displaystyle r=\frac p{1 + e \cos f}$

You won’t be calculating this in your head, but the expressions are pretty easy to translate to working code. Well, except for one. Notice how the eccentric anomaly is not expressed as a function of the mean anomaly? Well, that’s the hard part. And it’s called Kepler’s equation.

Kepler’s equation in all its glory

We do not know a simple way to express E as a formula depending only on e and M. Instead, we must proceed by successive approximations until we decide we are close enough. For this, we will need two things:

  • a starting value E₀ for E
  • a way to improve the current approximation of E

Most introductory books and online articles I have found will tell you to take E₀ = M. This has been obsolete since 1978. It works well enough, but fails in corner cases (near-parabolic orbits around periapsis). And modern starting values are close enough to limit the number of improvements you then need to make. I reviewed 21 papers, selected the 3 most relevant ones, and implemented 21 methods for picking starting values (3 papers for the hyperbolic case with 5 methods for the hyperbolic case).

To improve the starting value, old references will make you do E := M + e sin E, but this is pretty naive. Most modern references will use Newton’s method, which helps you solve an equation faster as long as you know its first derivative. But you can go further by using the second derivative using Halley’s method. You can go arbitrarily far using Householder’s method.

Great, now I have 21 × 4 = 84 approaches to try (5 × 3 = 15 for the hyperbolic case). Which one is the most accurate? Which one is the fastest? And, more importantly, can I find a fast and accurate approach?

You can spend years studying the theoretical properties of these functions. Or, we just try them for real. It is the most efficient way to know how they actually behave on real hardware with real implementations of numbers.

I wrote a benchmark tool to test both the accuracy and the speed of each approach. It turns out that only a handful of approaches are accurate enough for my taste:

# Elliptical case
## Naive method
## Newton's method
## Halley's method
     580 c  3.580e-16  gooding_10
     764 c  3.926e-16  gooding_11
     577 c  3.604e-16  mikkola_1
     742 c  3.653e-16  mikkola_2
## Householder's third order method
     606 c  3.645e-16  gooding_10
     809 c  3.623e-16  gooding_11
     641 c  3.935e-16  mikkola_1
     764 c  3.904e-16  mikkola_2

# Hyperbolic case
## Newton's method
## Halley's method
    1454 c  3.731e-15  mikkola_1
    1578 c  3.725e-15  mikkola_2
## Householder's third order method
    1704 c  3.618e-15  mikkola_1
    1757 c  3.618e-15  mikkola_2

The first column is the number of CPU cycles per solve of Kepler’s equation. The second column is the worst relative error over the test set. The third column is a short name for the method for picking a starting value for E.

As we can see above, no approach seems to work well enough with the naive improvement, or with Newton’s method. And using third derivatives does not do much. gooding_11 and mikkola_2 are relatively slow. This leaves gooding_10 and mikkola_1. I picked mikkola_1 because the paper also treated the hyperbolic case. It comes from A cubic approximation for Kepler’s equation by Seppo Mikkola.

And after all this, I have a very robust way to know where stuff is at any time! Of course, you should not forget to take care of floating point precision issues when implementing all of this…

Appendix: References

Appendix: Full benchmark

# Elliptical case
## Naive method
     264 c  2.500e-01  smith_1
     276 c  5.303e-01  smith_2
     360 c  2.000e-01  smith_3
     444 c  9.701e-02  smith_4
     385 c  1.667e-01  smith_5
     480 c  6.495e+01  smith_6
     246 c  2.500e-01  gooding_1
     299 c  2.000e-01  gooding_2
     356 c  1.667e-01  gooding_3
     246 c  5.303e-01  gooding_4
     460 c  9.701e-02  gooding_5
     253 c  2.417e-01  gooding_6
     325 c  7.497e-02  gooding_7
     334 c  7.497e-02  gooding_7b
     378 c  1.277e-01  gooding_8
     423 c  7.499e-02  gooding_9
     455 c  2.376e-01  gooding_10
     608 c  8.020e-02  gooding_11
     286 c  1.439e-01  gooding_12
     448 c  4.342e-01  mikkola_1
     629 c  4.426e-01  mikkola_2
## Newton's method
     522 c  8.517e-03  smith_1
     539 c  1.379e-02  smith_2
     578 c  8.111e-03  smith_3
     697 c  8.111e-03  smith_4
     706 c  8.111e-03  smith_5
     900 c  1.028e+02  smith_6
     481 c  8.517e-03  gooding_1
     583 c  8.111e-03  gooding_2
     561 c  8.111e-03  gooding_3
     476 c  1.379e-02  gooding_4
     686 c  8.111e-03  gooding_5
     537 c  3.528e-02  gooding_6
     536 c  1.378e-02  gooding_7
     585 c  1.378e-02  gooding_7b
     702 c  2.524e-04  gooding_8
     680 c  1.379e-02  gooding_9
     731 c  1.117e-10  gooding_10
     940 c  2.072e-08  gooding_11
     528 c  5.673e-03  gooding_12
     697 c  1.825e-06  mikkola_1
     815 c  1.825e-06  mikkola_2
## Halley's method
     395 c  1.250e-01  smith_1
     419 c  2.553e-03  smith_2
     488 c  6.250e-02  smith_3
     596 c  1.710e-02  smith_4
     557 c  4.167e-02  smith_5
     989 c  5.130e+01  smith_6
     429 c  1.250e-01  gooding_1
     503 c  6.250e-02  gooding_2
     524 c  4.167e-02  gooding_3
     426 c  2.553e-03  gooding_4
     587 c  1.710e-02  gooding_5
     410 c  6.127e-03  gooding_6
     467 c  2.548e-03  gooding_7
     440 c  2.548e-03  gooding_7b
     508 c  9.153e-05  gooding_8
     552 c  2.550e-03  gooding_9
     598 c  3.580e-16  gooding_10
     821 c  3.926e-16  gooding_11
     430 c  1.028e-03  gooding_12
     594 c  3.604e-16  mikkola_1
     742 c  3.653e-16  mikkola_2
## Householder's third order method
     438 c  1.562e-02  smith_1
     439 c  6.779e-04  smith_2
     501 c  7.812e-03  smith_3
     667 c  2.138e-03  smith_4
     533 c  5.208e-03  smith_5
    1053 c  6.535e+01  smith_6
     453 c  1.562e-02  gooding_1
     501 c  7.812e-03  gooding_2
     594 c  5.208e-03  gooding_3
     429 c  6.779e-04  gooding_4
     595 c  2.138e-03  gooding_5
     454 c  1.638e-03  gooding_6
     526 c  6.736e-04  gooding_7
     585 c  6.736e-04  gooding_7b
     583 c  9.493e-09  gooding_8
     652 c  6.741e-04  gooding_9
     681 c  3.645e-16  gooding_10
     846 c  3.623e-16  gooding_11
     486 c  2.724e-04  gooding_12
     651 c  3.935e-16  mikkola_1
     752 c  3.904e-16  mikkola_2

# Hyperbolic case
## Newton's method
     780 c  7.245e+25  basic
    1064 c  1.421e-03  mikkola_1
    1201 c  1.970e-04  mikkola_2
    1014 c  8.559e-01  gooding
     892 c  1.515e+19  pfelger
## Halley's method
    1134 c  1.729e+17  basic
    1396 c  3.731e-15  mikkola_1
    1526 c  3.725e-15  mikkola_2
    1265 c  9.160e+10  gooding
    1237 c  1.791e+13  pfelger
## Householder's third order method
    1260 c  1.323e+07  basic
    1631 c  3.618e-15  mikkola_1
    1595 c  3.618e-15  mikkola_2
    1467 c  2.276e+17  gooding
    1365 c  9.997e+01  pfelger
Categories
Kepler Project

Floating Point Woes in Three Dimensions

Sometimes, stuff in space rotates[citation needed]. In this article, I am going to explain the basic idea to simulate rotation efficiently, and a few technical hurdles I encountered while implementing it.

I have not yet got down to reinventing the wheel writing a physics engine. Besides, I don’t want to simulate physics just to keep track of the orientation of the object through time. Doing so would drastically limit how much we can time-wrap (to maybe ×10,000).

Instead, we can just assume a constant rotation (angular velocity), and deduce the orientation at any given time from the base orientation and the elapsed time. In other words, we set:

Orientation(t) = Orientation(t₀) + AngularVelocity × (t – t₀).

This approach fits very well with planets, whose angular velocity can be safely assumed to be constant.

To handle simple changes in angular velocity, we can just update t₀. So, when the ship thrusts its RCS to pivot, we set t₀ to the current time, Orientation(t₀) to the current orientation, and update the angular velocity.

Of course, the assumptions we make are not very accurate for non-spherical objects. For instance, we will not observe the Dzhanibekov effect (also known as tennis racket theorem) with this approach.

One thing remaining: how do we represent an orientation or an angular velocity? There are essentially three ways:

  • Euler angles
  • axis-angle
  • rotation matrix
  • quaternion

Let’s go over each in more detail.

Euler angles use three angles around fixed axes. They are somewhat intuitive, but are very inconvenient to work with.

(source)

You can also find the rotation’s axis to get the axis-angle representation. You’ll need three values to describe the axis, and one for the angle.

(source)

Rotation matrices can be easily combined. They also adapt well to represent translation and scaling, when using homogeneous coordinates (convenient for 3D rendering). But they need at least 9 parameters.

(source)

The theory behind quaternions is hard to grasp, but you do not need to. What matters is that you can easily combine them, extrapolate them, or interpolate between two orientations, and they only use four values.

Actually, an intuitive way to think about them is that the first components x, y and z correspond to the axis of rotation, while the w component represents the angle of the rotation: if your angle of rotation is θ, then w=cos(θ/2). We just need to keep their norm to 1, so we scale x, y and z appropriately.

(source)

So, clearly, I use quaternions wherever I can. But there is one thing to keep in mind. Quaternions (almost) uniquely represent a rotation, and a rotation by angle x is the same as the rotation by angle 2π + x around the same axis. So, if we try to store an angular velocity of 2π rad/s (one rotation per second) as a quaternion, we’ll just have a quaternion representing an angular velocity of 0 rad/s (well, 0 rotation per second).

The usual way around this is to represent angular velocity (and impulse) as axis-angle, since we can store whatever we want in the angle. But I like edge cases and floating point Eldritch horrors. So, instead, I just scale the angle down by something like K=2¹⁰⁰. We will be fine as long as the angular velocity is smaller than 7.96×10³⁰ rad/s (26 billion billion (no typo) times faster than the fastest spinning object ever).

To summarize so far, we’ll use a quaternion to represent both the rotation and the angular velocity, except that we will scale the angle of the angular velocity by some gigantic value K.

It means we will create quaternions with very small angles.

It means that the component w will be extremely close to 1 (the distance will generally be about 1 / K = 2⁻¹⁰⁰).

It means that the component w will just be equal to 1 (since even double precision floating point numbers won’t have enough precision to tell the difference). We just lost all our precision!

It means that, when retrieving the angle from w (needed when scaling by t – t₀), we will just get 0.

It means that we will multiply 0 by K.

And 0 multiplied by a finite value (even very large) is still zero.

So nothing will ever spin. Great… back to Euler angles I guess?

Not so fast! Although we cannot retrieve small angles from the w component, the information is actually present in the other ones, x, y and z!

But you just told us that the x, y and z components are just the axis of rotation! Did you lie to us? Must my dreams always come crashing done at the altar of floating point numbers. I knew I should have gone with Cthulhu.

A sensible person when confronted with floating point arithmetic

It is true! But an axis is just a direction. When these three value also store a magnitude. Remember when I told you that we had to scale x, y and z so that the norm of the quaternion stays equal to 1? Well, that’s convenient, because it means that √(x² + y² + z²) = sin(θ/2).

And sin(θ/2) is very precise around zero in floating point arithmetic!

Which means we don’t lose any precision!

After checking with Herbie and opening a merge request in GLM (the library I use for manipulating quaternions), I am able to represent angular velocity as quaternions!

“And yet it moves!”

Could I just have used the axis-angle representation for angular velocity? Probably? Would it have been fun? Probably not as much.

Categories
Kepler Project

Looking Up-Close at Orbits

In the previous post, I explained the basic idea that is used to draw orbits using line segments, and a little trick to make even very elliptical orbit look good. So, this is where we are:

Looks good.

But if we zoom in:

The line segments are visible again!

What is going on?! I thought we had taken care of spreading the line segments to hide the angle as much as possible!

A distraught hypothetical reader

Well, true. But we did it assuming we were looking straight down at the ellipse. Here, perspective comes into play. Although it conserves lines, perspective has the nasty habits of effecting angles. Here, it magnifies the angles between segments around the point where the orbits appear to bend towards the left. On the contrary, the segments on the left part look pretty much aligned.

But it’s okay. We only need to add a few more points close to the camera to compensate for this. It does mean that they will move as the object follows its orbit.

Here, we can clearly see the additional points moving alongside the focused object on the orbit. When we zoom in, they make sure we don’t see any unsightly corner.

But wait, there is more!

The Earth not being on its own orbit might be a bit of a problem

This is not the path of the Earth-Moon barycenter, which is the surface of the Earth anyway. The line really should pass through the center of the Earth.

So, what is going on? We do have a nice smooth-looking orbit, but it is still an approximation. It just so happens that we have looked at a time when the Earth was in the middle of a line segment. In the diagram below, you can imagine that the line segment is BX and the Earth is betwen the two.

Source: https://en.wikipedia.org/wiki/File:Chord_in_mathematics.svg

This becomes more apparent as we speed things up:

The Earth follows the actual path of the ellipse while the orbit is just a bunch of line segments. The points between line segments are in the right position, which corresponds to the times when the orbit actually appear to go through Earth’s center.

To avoid this, we just need to ensure that at least one point of our orbit is always in the middle of the Earth. Actually, we can just make it so that it is one of the additional points from earlier!

Now everything looks smooth and the Earth is always on its orbit!

Actually, that was the easy part. Now let’s zoom a bit more for the fun one:

Showing Pluto here, as the effect is more visible. But we cannot ignore it just because it’s not a planet anymore! And the effect is even more visible on smaller objects such as spaceships, because we can zoom closer to the orbit.

To understand what is going on, we need to dive into how we actually calculate the positions of the points of our orbit. Digital computers only know about discrete values (that’s what the “digital” mean). The basic unit is the “bit” (0 or 1) and, by using several bits, we can count natural integers fairly easily (0, 1, 10, 11, 100, 101, 110, 111, 1000). Add a sign bit, and we’ve got relative integers. We’re not going to be able to represent arbitrary real numbers short of using symbolic computation, but we can at least get something good enough by using decimals. The idea is that we would use two relative numbers: one for the number’s digit, and one to tell where the decimal point should be. And that’s exactly what computers do. Well almost.

Every explanation about floating point arithmetic must start with this example. It is known.

Technically, computers use binary instead of base ten, but the principle is the same. If we really want to use base 10, we can, albeit it’s slower:

But the important point to keep in mind when working with non-integer values with computers is that they are (usually) only approximations of the number we want.

For instance, the approximation of π used by Python is slightly smaller than the actual value of π. So, if we compute the sine of this approximation, we get something slightly bigger than zero.

We can get a good idea of the approximation. There are two usual formats used to represent non-integer: binary32 and binary64, more commonly known as float and double respectively (although Python’s float is binary64). The relative error of a binary32 is roughly 2⁻²⁴ and the relative error of a binary64 is roughly 2⁻⁵³.

The error in binary32 when dealing with values around the radius of Pluto’s orbit is of 352 km (Pluto’s radius is 1,188 km). With binary64, it’s less than a millimeter.

Rendering to a screen which is no more than 8,000 pixels a size should not require more than binary32. So this is what is normally used for rendering. Using binary64 here would degrade performance a lot. But it does mean that we are getting imprecision when drawing things naively.

Let’s say the Sun is the center of our system of coordinates, and we use SI units. Then, the Pluto is 5.9 Tm from the Sun. For rendering, we are going to compute the positions of every object relative to the Sun, compute the position of the camera (our point of view) relative to the Sun, and subtract the second from the first to find the position of the objects relatively to the camera.

Since this is naturally done during the rendering in the GPU shaders, we use binary32. The error from doing computations in binary32 is roughly 352 km in this case. So, let’s start from the camera position; we are at a distance of 0 (exactly). Then, let’s move to the Sun; we are at a distance of 5.9 Tm±352km (relatively precise). And let’s move back to Pluto; we are at distance of (5.9 Tm±352km) – (5.9 Tm±352km) = ±352 km (not very precise). This is called loss of significance.

We really want to avoid doing such a computation in binary32. We must either switch to using binary64 in the rendering, or avoid doing this computation in the shaders entirely. We really want to avoid doing the first one, since it can slow down the rendering significantly, and not all GPUs even support it.

How do we avoid doing this computation in the shaders then? The most intuitive approach seems to switch the system of coordinate to take Pluto as center. But this has issues too, since we still have the exact same imprecision when switch from one coordinate to the other. The solution I opted for is actually simple: I do the imprecise computation using binary64 outside the shaders, which lets me have millimeter precision around Pluto. It can be seen as switching coordinates, but only for part of the rendering.

The loss of significance is still here. But it’s now only 1 mm.

Note: I use the same technique when drawing celestial bodies. This is actually why the object being observed is well positioned at the center of the screen and do not move erratically (unlike in the example below).

This is not all roses however. For instance, if we look at Pluto from the Earth.

This requires a FOV of 36 hundreth of an arcsecond. Current telescope are far from being able to do this, but a software simulation should.

Since the camera is positionned at the center of the Earth, the distance to Pluto is of the order of 6 Tm. With this level of zoom however, we can distinctly see the imprecision of roughly 350 km, causing the dwarf planet to not be perfectly centered in the screen.

Changing the system of coordinates is not quite enough, since the camera is not close to the observed object. So I will need ot find another workaround.

So, seriously, don’t let the floating point bugs bite.

Categories
Kepler Project

Drawing Nice Orbits

So, I have been working on Kepler Project again, and I thought I would write about some of the interesting challenges I have encountered and how I overcame them.

The Shapes of an Orbit

Today, we will talk about drawing orbits. I will talk about the orbital computations themselves another time.

Here, we can see the full orbits of the four terrestrial (i.e. rocky) planets: from interior to exterior, Mercury, Venus, Earth and Mars. In the middle, of course, is the Sun, and the outermost orbit partially visible here is that of Ceres, a dwarf planet whose orbit lies between those of Mars and Jupiter. The red circle show the current (as of 2021-01-27) location of the planet in their orbits.

The planets of the Solar System follow an ellipse as they revolve around the Sun. They are fairly close to a circle, except for the notable exception that is Mercury: in the screenshot above, you can tell that the Sun is not in the middle of Mercury’s orbit (the innermost one). In fact, the trajectory of Mercury draws an ellipse whose focus is the Sun.

Zooming out a bit, we see that there are quite a few dwarf planets orbiting the Sun from far away a bit haphazardously. The four regular orbits in the middle are those of the gas giants: from interior to exterior, Jupiter (whose Orbit is partially hidden behind all the red circles of the inner planets), Saturn, Uranus and Neptune. We find the dwarf planet Eris the bottom-right corner, and Sedna, a bit above, whose orbit is so extended, it takes ten thousand years (Earth revolutions) to complete a full revolution in its orbit.

From the first image, and from the usual depictions of the Solar System, you could think that the orbits are always neatly arranged. But that this becomes increasingly inaccurate as we pull farther away from the Sun.

In this side-view, we can better appreciate that the orbits of the eight planets and Ceres lie almost in the same plane, while the orbits of the outer dwarf planets have a much more pronounced inclination.

From this, we know we are going to need to draw ellipses shapes (eccentricity) and sizes (I use the periapsis to characterize the size of the orbits) in all kinds of orientations (since there are three spatial dimensions, we will need three parameters; I use longitude of the ascending node, inclination and argument of periapsis).

Drawing an Ellipse

GPUs let you draw points, line segments and triangles. And that’s pretty much it. The point is that these primitives let you draw anything if you use enough of them. By restricting ourselves to the simplest primitives, we let the GPU designers focus on doing one thing well. Of course, there is a huge swath of features to help you draw millions of them efficiently, but everything you see in a 3D game is pretty much drawn with textured triangles.

So, how do we draw an ellipse when we can only draw line segments? Well, you pick points on the ellipse and connect them together to make a polygon that approximates the real curve:

Here, with only 8 segments to draw an orbit, you might just tell something is off. By the way, notice that, as a consequence, the objects (in the red circles) are not on the orbit we’ve drawn; it will come up later.

If we use enough segments, the illusion becomes good enough:

In Kepler Project, each orbit is drawn using 256 segments. It could become visible at a high enough resolution, but it does the job.

So, we’re done, right? Not quite. We have only looked as near-circular orbits. Look at what happens with a more eccentric (more elliptical) orbit:

I have set the apogee to 924 Mm, at the edge of Earth’s sphere of influence. The orbit has an eccentricity of 0.985.

Drawing a Nice Ellipse

So far so good. But let’s zoom-in closer to Earth and the perigee of our orbit:

The line segments are clearly visible on the right. It would only get worse with more eccentric orbits.

This is not good enough for me. I might accept glitches are extreme and unrealistic orbits, but this is not the case here. We could of course raise the number of segments used to draw the ellipse, and it would work (here, 1024 seems enough). But we should have ourselves why. Why do we need to use more segments in this case? Let’s have a look:

I switched to only drawing the points (well, big points), and only 64 of them per orbit.

The curve is pretty tame in the middle of the ellipse (on the left of the image), but much stronger closer to the extremities (in the center of the image). From this, we should notice that we have more than enough points in the middle, and not enough at the extremities. If we distributed them better, we would not need more points!

How do we do that?

If found a suggestion on Stack Overflow. But the reasoning is not very clear to follow, so I expose it here. We want to make sure the angle between two consecutive line segments is small enough so that it is not visible.

First trick: We can simplify things by reframing the problem as making sure the angle between two consecutive segments is always the same. To understand why it makes sense, imagine all the angle between consecutive segments is always the same; then, if you decrease the angle between two consecutive segments, you have to increase at least one angle between two other consecutive segments (for a fixed number of segments of course).

Second trick: to avoid the annoying part of finding a formula for the angle between two segments depending on the placement of their extremities, we will actually look at the tangents to the ellipse in the points we choose.

Third trick: we already know we want a constant angle difference between two consecutive tangents, but that means we already pretty much know this angle: 2 π / n where n is the number of segments (note: strictly, we should not assume one of the points will have an angle of 0; but we’re going to want to have a point at the periapsis anyway).

These tricks allow us to forget entirely about the line segments and just look at the tangent of the orbit in a given point:

The primary is one of the foci of the ellipse. The center of the ellipse, though, does not map to anything in particular and could very well be in outer space. E is short of “eccentric anomaly”, which is the angle between the periapsis and the object measured at the center (hence “eccentric”). Another common angle is the “true anomaly”, which is the angle between the periapsis and the object, measure at the primary, usually the actual point of reference (hence “true”).

In the figure above, we know α, but we need to know E. Once we know E, the actual coordinates of A can be calculated using the parametric formula for the ellipse: x = a cos E and y = b sin E. And now we use:

Just kidding, we use applied mathematics witchcraft:

$latex \begin{aligned}\tan \alpha &= \frac {\mathop{dy}} {\mathop{dx}} = \frac { \frac {\mathop{dy}} {\mathop{dE}} } { \frac {\mathop{dx}} {\mathop{dE}} } = \frac { \frac {\mathop{d}} {\mathop{dE}} b \sin E } { \frac {\mathop{d}} {\mathop{dE}} a \cos E } \\ &= \frac { b \cos E } { – a \sin E } = – \frac b { a \tan E } \end{aligned}&s=4$

Now, solving for E, we get:

$latex E = – \arctan \frac b { a \tan \alpha }&s=4$

This formula is not convenient because of the discontinuity around α = 0, so we use some more trigonometric sorcery to get to:

$latex E = \frac \pi 2 – \arctan \frac a b \tan \alpha&s=4$

Let’s look at the effect of using this formula on the repartition of our points:

Still with 64 points. They are much more concentrated around the perigee (and apogee on the other side, not visible here)

And the final result when switching back to line segments:

Still with 256 segments

Now, that’s good enough. Actually, it works well even with extreme orbits:

The orbit shown here as a perihelion of 1 Gm and an apohelion of 1 Zm (or about a hundred thousand light-years). Drawing it with 256 line segments looks fine from above, but I actually increase the number to 4096 in this case so that looking at the apohelion when near the perihelion does not look too weird due to perspective (the formula above does not take it into account).

To be continued

The formula above can easily be adapted to the parabolic and hyperbolic orbits.

But there are actually still a few issues when we look up-close (when we actually want to look at our spaceship for instance). But there is enough to talk about for another article. Until then… don’t let the floating point bugs bite!