Categories
Uncategorized

Float Woes in C

The C programming language is a gift that keeps giving. Today, we are going to see how a seemingly banal and common operation can hide unfathomable depths of unmentionable horrors.

Woes with Integer Coercion

What is the problem of the code below?

// returns the closest integer to f
int float2int(float f) {
    return (int) f;
}

It’s written in C.

A Rust enthusiast

That’s… a good point.

But let’s assume we actually want to write C, for some forsaken reason. And we do not care whether we convert up, down or sideways.

There is still a fundamental issue with this code. What happens when we call float2int with INFINITY?

#include <math.h>
#include <stdio.h>

int float2int(float f) {
    return (int) f;
}

int main(void) {
    printf("%d\n", float2int(INFINITY));
    printf("%d\n", float2int(-INFINITY));
    printf("%d\n", float2int(NAN));
}

When we compile this code with gcc and run the result, we get:

$ gcc -O3 test.c && ./a.out
2147483647
-2147483648
0

Makes sense! The largest integer is 2147483647; the smallest is –2147483648, and what would you choose for NAN but 0?

Someone still young and innocent

Now, let’s just remove that -O3 option:

$ gcc test.c && ./a.out
-2147483648
-2147483648
-2147483648

What?

The innocent who got their world shattered

Wait, it gets better:

$ clang -O3 test.c && ./a.out
1464089272
1488257696
1488257696
$ ./a.out
1259459480
-1806736736
-1806736736
$ ./a.out
-2098630344
1664811680
1664811680

But… why?

All hope is gone

Because it can.

Yup, that’s undefined behavior. Converting a non-finite value (i.e. an infinite or a NaN) to an integer is undefined behavior.

But that’s not all! What should the largest finite floating-point value (3.4028234664e+38) convert to? It’s much bigger than INT_MAX, the value that int can represent (it does not matter whether you are on a 32, 64 or 128 bit architecture, really).

Maybe we could just say that all the floating point-number larger than INT_MAX should be converted to INT_MAX when converted? But alas, it makes for unwelcome surprises down the line, when you realize that the rounded value is not always within 1 of the original value.

Thankfully, the C standard always has the solution whenever there is a hint of ambiguity:

When a value of integer type is converted to a real floating type, if the value being converted can
be represented exactly in the new type, it is unchanged. If the value being converted is in the
range of values that can be represented but cannot be represented exactly, the result is either the
nearest higher or nearest lower representable value, chosen in an implementation-defined manner.

If the value being converted is outside the range of values that can be represented, the behavior is
undefined
. Results of some implicit conversions may be represented in greater range and precision
than that required by the new type (see 6.3.1.8 and 6.8.6.4).

Part 6.3.1.4 paragraph 2 of the C standard

In other words, we’re just going to say all the values that are not quite that convenient trigger undefined behavior. The “range of values that can be represented” by int is the interval [INT_MIN, INT_MAX]. Attempting to convert any float value from outside this range to the int type is undefined behavior.

We’ll just have to check for that!

Someone has not learned their lesson

Woes with Range Checking

We are just getting to the actually tricky part. Let’s have a look at a seemingly robust implementation:

#include <limits.h>
#include <math.h>

int float2int(float f) {
    if (isnan(f)) {
        return 0;
    }
    if (f < INT_MIN) {  // also filters out -INFINITY
        return INT_MIN;
    }
    if (f > INT_MAX) {  // also filters out +INFINITY
        return INT_MAX;
    }
    return (int) f;
}

For the sake of simplicity, we are just providing arbitrary values for the values that we cannot convert. Other implementations might choose to return a status indicating whether the conversion is possible or not.

But, of course, there is a bug lurking in there. And it is extremely easy to miss.

And, this time, it is not caused by an undefined behavior, but by another surprising “feature” of C: implicit type conversions.

To understand what I mean by that, we need to look at what the conditions in the code above really mean. The first one is fine, let’s look at the other two: f < INT_MIN and f > INT_MAX. In both, the left operand is a float, and the right operand is an int. Processors rarely have such built-in operations. So a conversion must be taking place. The way it happens in described in the “usual arithmetic conversion”:

[…]

if the corresponding real type of either operand is float, the other operand is
converted, without change of type domain, to a type whose corresponding real type is float.

[…]

Part 6.3.1.8 paragraph 1 of the C standard

Thankfully, we only need the part I have quoted here. In short, both operands get converted to float. Let’s look at our conditions again.

if (f < (float) INT_MIN) {  // also filters -INFINITY
    return INT_MIN;
}
if (f > (float) INT_MAX) {  // also filters +INFINITY
    return INT_MAX;
}

Let’s look at the first condition. What is the value of (float) INT_MIN? Let’s assume 32-bit 2-complement int type. Then INT_MIN is -2³¹. A float can represent this value exactly. You can check it out with an online converter. So (float) INT_MIN is -2³¹ as well. No problem here, this conversion does not change the value, and this code does exactly what we want.

With the same assumption, INT_MAX is 2³¹ – 1. And there comes the catch: a float cannot represent 2³¹ – 1 exactly. If you put “2147483647” in the online converter, you will see that it is rounded to a float whose value is actually “2147483648”. It makes sense: a float trades off being able to represent all integers in [-2³¹, 2³¹ – 1] in order to cover a much wider range and many more magnitudes. In other words, the actual value of (float) INT_MAX is INT_MAX + 1. The condition is actually doing:

if (f > INT_MAX + 1) {  // also filters +INFINITY
    return INT_MAX;
}

Note: if this were C, this code would contain an undefined behavior because of the signed integer overflow that happens when evaluating INT_MAX + 1. This is just pseudocode for illustration purposes.

Since the upper bound is offset by one, the code fails to filter out the half-open interval (INT_MAX, INT_MAX + 1]. If the input parameter f were to take its value from this range, converting it to int would be undefined behavior. Now, is there any float value in this interval? Yes! INT_MAX + 1 of course (it’s the only one).

Conclusion

The “robust” implementation does not handle 2³¹ properly. This is not just a theoretical oddity. I encountered this issue in critical code in production while working at TrustInSoft. Such bugs will easily stay under the radar and avoid even the most intensive test campaigns. Only formal methods help in detecting and avoiding such errors.

The worse thing is that, since this is undefined behavior, you might encounter it many times, and never now, corrupting data in unknown ways. Things that could have been a runtime error become the entry door for silent bugs and security issues.

C likes to keep you on your toes.

Categories
Uncategorized

Ham Crypto

Amateur Radio

Amateur radio is about tinkering with the electromagnetic field. This may involve designing analog circuits to shape or unveil the modulation of signals, or sometimes using digital ones to recover information from the noise, or operating multi-hundred-watts transceivers to chat directly with the other side of the world, or lower-power ones to go just as far with Morse code, or even bouncing signals off the Moon. Also, lasers.

Yes, this is the 2005 picture from Wikipedia. No, things have not changed that much since. Physics still work the same.

Amateur radio operators, or “hams” for short, benefit from special privileges to this end. The radio spectrum is a crowded place; yet hams are allowed to transmit at various frequencies spread between 135,7 kHz and 250 GHz (not all bands are visible in the chart). Radio equipment is normally strictly regulated to limit interference and operations in unauthorized bands; yet hams are allowed to modify transmitters as they please, or even build and operate their own from scratch.

The allocation of the radio spectrum in the United States. The allocation is similar in other countries. You might notice there is not much free room. Click to enlarge.

It would be very tempting for a commercial operator, for instance, to make use of these privileges for their own benefit. To avoid this, there is one very important restriction: all communications must be done in the clear.

Code is Law

The exact wording from the international regulations is:

Transmissions between amateur stations of different countries shall not be encoded for the purpose of obscuring their meaning

Article 25, paragraph 2A from the Radio Regulations edited by the ITU (emphasis mine)
(yes, the website is very slow, which is a bit sad for the International Telecommunication Union)

Note: you might think that “of different countries” means that hams can encrypt their communications within a country, and that can actually be the case. However, the article is meant to bind countries to a common standard, not to restrict what happens within each country. In practice, each country will decide whether they want to allow it locally. And, in general, they don’t.

Encoding is the mapping of information to a particular representation. For instance, Morse Code and ASCII are encodings that map the letters of the alphabet to short and long signals, and to numbers, respectively. AM and FM are encodings that map analog signals to other analog signals, that is to say modulation.

Encoding is necessary for any form of communication, so it is of course allowed in general. What is forbidden is to represent information in a way that hides its meaning.

The ASCII encoding maps the uppercase and lowercase Latin letters, as well as the Arabic numerals, some punctuation marks, and control characters

Note that the wording does not say “cryptography”. This is intentional. Cryptography is just a set of techniques which can be used to hide information. It does not have to be this way, and this is not the only way to do so. This is about intent.

If you are doing Morse Code, SSB, FT8, or experimenting with your own super-advanced error-correction encoding that is meant to be usable by anyone, you would be in the spirit of the law. On the opposite, if you are hiding information in the low bits of a digital image, or using a custom protocol only you and your buddies know about, or using a proprietary protocol whose specification is not public, you are going against the spirit of the law.

Steganography is the idea of hiding information in plain sight

Note that the difference between “experimenting with a new protocol” and “designing a protocol restricted to just a few people” is slim. What matters is intent. A way to show good faith would be to include a link (using plain voice or Morse code) to an open source repository that anyone can easily install to decode the rest of the communication. To be clear, that’s not an original take.

What about cryptography? It should obviously be banned, right? The entire point of this discipline is to hide information.

My plant in the audience to ask a convenient question at a convenient time for my transition

Mostly, yes. But we are here to tinker, so let us see what this “mostly” entails.

Confidentiality and Authentication

The two main use cases of cryptography are confidentiality and authentication. Mostly everything else derives from these.

They are mostly used together. For instance, the most widespread implementation of cryptography is indubitably TLS, the protocol used to secure all communications of the Web. This the S in HTTPS. This is also what enabled online shopping, and grew the Web from a niche protocol to visit personal homepages to the societal backbone it is today.

The green padlock is telling you that the communications with the website are made secure using cryptography. It does not tell you that the website itself is trustworthy, however.

Nowadays, when you connect to a website, your Web browser will ensure two things.

  1. Communications are encrypted, so that no one snooping around can read them. That’s the “confidentiality” part.
  2. You are talking to the right website. Otherwise, it would be pretty easy to intercept all your communications and read them. This is how company proxies can monitor what you are doing. That’s the “authentication” part.

Note that authentication is necessary if you want confidentiality. Without at least some basic form of authentication (e.g. sharing a private key), encryption will only give confidentiality against passive attack models.

However, authentication on its own could definitely make sense. And the purpose of authentication is not to obscure the meaning. In particular, a cryptographic signature is a piece of data that anyone can decode and interpret. The information it contains is simply the fact that the associated message was indeed sent by its author, not by someone else.

Note: the usual rebuttal is this would allow an attacker to send the same message again. However, this is a well-known topic and easily solvable.

To be clear, I am not the first person to think about doing this. Nor the second:

But, why?

Why not? The purpose of amateur radio is to experiment with communication technology. Knowledge can be an end in itself. But still, having some ideas of what practical applications could look like can be motivating, so I list some here.

The Glider, the smallest moving pattern in Conway’s Game of Life

Remote Control. Since 2004, Radio Regulations include an exception to article 25.2A for the control of satellites. Indeed, going to orbit is hard¹, so you want to be able to service satellites remotely. But if anyone can remotely turn off your satellite, or make it dive into the atmosphere, you’ve got a problem. So you need some way to authenticate the commands. But you do not actually need encryption. This means that remote control over amateur radio is possible for things other than satellites. Remote base stations, or repeaters, for instance.

¹ In fact, low Earth orbit is halfway to anywhere in the Solar System.

Yes, people are actually building satellites, and actually sending them to space

Authenticated QSOs. Such schemes could be used to authenticate contacts, or QSOs, to prevent people from spoofing call signs. This could serve a similar purpose as PGP or GPG for email.

Contesting. Possibly the most popular activity among hams is contesting. That is, contacting as many other operators in the world in a given time frame. Nowadays, scoring is done by having all the operators send the log of their contacts to the contest’s website. If K1AA says they have contacted F6ZZ at 28,500 MHz around 14:00 UTC, and F6ZZ says they have contacted K1AA at the same frequency at around the same time, we can probably assume that this actually happened. Unless K1AA and F6ZZ are colluding, of course, but this is hard to prevent. But if, for some reason, F6ZZ is unable, or forgets, to upload their logs, then the contest will assume that K1AA made a mistake, and subtract points. With authenticated QSOs, K1AA could provide strong evidence that the contacts happened as described. Contests already routinely require the exchange of serial numbers, so it might not be such a stretch to implement. Instead of an incrementing number, each operator might have a computer generate a short code for each contact.

Wow. Much radio. So frequency. Many QSOs.

Online services. Now that digital modes and packet communications are finally being used in amateur radio, it is possible to use protocols such as HTTP. But, without authentication, the use cases are as limited as the Web in the early 1990s. With authentication, we could envision message boards and editable wikis with secure user accounts.

These are just the few examples I could think of, but there might be other relevant applications. Of course, there are wrinkles to be ironed out, but I do not pretend to have it all figured out.

Null Cipher

How do we use that in practice? We could roll our own crypto, or not. On the one hand, it is deceptively easy to make mistakes when implementing cryptography by oneself. On the other hand, the purpose of amateur radio is to experiment. So we might play around with custom cryptographic protocols over the air, but maybe we should use proven ones for actual persistent use. Let’s see what we could use for this.

TLS. The first obvious candidate is the omnipresent TLS. It is mostly used to authenticate servers, but you can also use it to authenticate clients. This has already been discussed in the past in the context of amateur radio to authenticate clients for remote base stations. Using TLS with the “NULL” cipher is a common topic, and it is not very hard to tell OpenSSL to do so (-cipher NULL-SHA256:@SECLEVEL=0). Modifying some web server to serve content with the NULL cipher is more involved, but probably not that hard.

However, both Mozilla Firefox and Google Chrome have removed the NULL cipher from the list of supported ciphers. In other words, they will refuse to connect without encryption. Fixing this would involve modifying their source code (easy), navigating their complicated build process (annoying), distributing them to the users (hard), and maintaining the fork (who wants to do that anyway?). Unfortunately, this is probably necessary, if we also want to prevent users from inadvertently using encryption by connecting to HTTPS websites. I’ll probably look into it at some point, but this is a significant project.

Me writing some random text through an SSH connection

SSH. Another candidate is SSH, which is already widely used for remote control of virtually everything nowadays (yes, I know, everything is terrible). It authenticates both the server and the client. The source code and the build process are also fairly simple, so it is easy to make a server and a client that can talk to each other without encryption. The nice thing with SSH is that you can then redirect ports through it, use it as a SOCKS proxy, or even as a VPN.

Wireshark intercepting the raw communications for the SSH sessions over the network. As you can see, the text is visible in clear. The [0m, [1m, [27m and so on are ANSI escape sequences. The rest are the SSH handshake and the cryptographic signature, which are easier to read when displayed as such.

Conclusion

To sum up, I like the idea of using cryptographic tools to provide proper authentication for remote control and enable new use cases through amateur radio. Of course, there is significant work to be done before we can get to there, but it’s nice to know that we do not need to start from scratch.

Categories
Uncategorized

Flexible Array Members: Typical C Shenanigans

Speed. I am Speed.

The one thing that C cares about is speed. Seriously, you might not get the right results, but at least, you will get them fast.

Let’s look at two common ways to represent a vector in C.

struct vector {
    unsigned long size;
    int data[42];
};

The first one is to just use the built-in array type. This approach is simple, but implies choosing the size of the vector when compiling. The program cannot decide the size depending on user-input or some other runtime parameter.

Note: Variable-Length Arrays (VLA) do allow declaring arrays whose length is decided at run-time. However, this is limited to local variables and is an optional feature since C11.

struct vector {
    unsigned long size;
    int* data;
};

The alternative is to not use an array at all. Instead, we use a pointer to a memory buffer that will contain the data. This memory location can be created with malloc, for instance.

Pointers can be used like arrays almost transparently, which brings (the “almost” means that most C programmer are paranoid about using sizeof on expressions). The example below shows how accessing an element at some index uses the same syntax in both cases.

But although the syntax is the same, the meaning is different. Look at the assembly code generated to implement vector_index. Indexing the built-in array uses a single MOV instruction; indexing through a pointer uses two MOV instructions.

It makes sense. rdi contains the address to the struct and rsi the value of index. In the first case, we shift by 8 bytes to skip the field size, then we move by index * 4 bytes to select the right array element, then we read the value at this memory location (indicated by the bracket in the Intel assembly syntax). In the second case, we need to first recover the value of the field data, at an offset of 8 bytes from the beginning of the struct, then read at the right offset from the address it contains.

In other words, using a pointer instead of a built-in array is twice as slow! Okay, not really. Any modern CPU is going to recompile, pipeline, prefetch and out-of-order execute this such that it really won’t matter. But it used to! And it might still matter on microcontrollers.

In fact! I did some benchmarks, and, although there was no difference on my laptop (CPU model i5-4210H), I did see a tremendously large difference on my Arduino Leonardo (microcontroller model ATmega32u4). Indeed, the code with the additional indirection through a pointer was 10 % slower.

Yeah, I did expect it to be worse than that. But still, 10 % was not something to scoff at in the previous millennium. Also, CPUs at the time might have seen a larger difference than my ATmega32U4 released not that long ago.

In summary, embedding a built-in C array is fast, but inconvenient. Using dynamic allocation with malloc is more flexible, but not as fast. How can we make our compiler generate faster code while keeping the convenience?

Well, we can lie to it.

Lies. So Many Lies.

Let’s look at how we will typically create a vector with the first approach:

struct vector {
    unsigned long size;
    int data[42];
};
struct vector *vector_new(unsigned long size) {
    assert(size == 42);
    struct vector *ret = malloc(sizeof(struct vector));
    ret->size = size;
    return ret;
}

Notice how malloc is just another C function, that can take any value as an argument at runtime? Notice how we are forcing ourselves to give it a compile-time value (the size of our struct)? What if we actually allocated the size we wanted?

For instance, if we allocated sizeof(struct vector) + 100 bytes, that would mean we would still have the 42 elements from the definition of struct vector, but we would have 100 additional bytes, enough to store 25 more 4-byte integers!

And so, a new idiom was born:

struct vector {
    unsigned long size;
    int data[1];
};
struct vector *vector_new(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * (size - 1)
    );
    ret->size = size;
    return ret;
}

Since we are going to lie to the compiler about the size of the actual array anyway, we make the base size as small as possible. Since the C standard does not allow 0-length arrays (“An array type describes a contiguously allocated nonempty set of objects”), we just use 1. Then, when we create an instance of our vector, we are going to make sure to actually have enough memory for size elements. Since there is already 1 built-in, we need to allocate for a size - 1 elements in addition to the base struct.

C developers embraced this hack, wrote many fast programs, and lived happily ever after.

Well, kind of. I mean, if you write this, your program is probably going to be fast. Just as fast as a fixed-size array. But maybe it won’t. And maybe it will allow random people to take control of your computer. Or maybe it will blow up the spaceship. Who knows?

You might know where I am going with this: this is undefined behavior.

Yes, we did allocate enough memory to do what we want. But this is not the point. We lied to the compiler. And compilers are very delicate, gullible beings. It will believe you. It will believe that your array is actually of size 1. And it will do unmentionable things to your code, because this assumption is false.

Save one Character, Save the Entire World

When they observe developers writing C, compilers authors and language lawyers alike get the urge to start again from scratch. This is actually the reason why there are so many flood myths through the ages.

This time, however, they recognized that the end justified the means. Compiler authors, too, want the programs they write to be fast (except C++ compiler authors).

And so, GCC introduced a way to do this:

struct vector {
    unsigned long size;
    int data[0];
};
struct vector *vectornew(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * size
    );
    ret->size = size;
    return ret;
}

Notice the difference? We just replaced 1 by 0 for the size of the array.

But you said the C standard did not allow this!

Someone who pays attention to details

Indeed! But GNU C does! GNU C is the dialect that GCC understands, and it allows for 0-length arrays as an extension.

Of course, if you were using another compiler, you were out of luck… until!

Finally, in the C99 revision of the C standard, a new syntax was introduced!

struct vector {
    unsigned long size;
    int data[];
};
struct vector *vector_new(unsigned long size) {
    struct vector *ret = malloc(
        sizeof(struct vector)
        + sizeof(int) * size
    );
    ret->size = size;
    return ret;
}

Can you see the difference? We just removed the 0.

That’s all?

Someone who wasted too much time reading this; but worry not, it took even longer to write!

Yup. This syntax tells the compiler to not assume anything about the size of the array. You can still use it as usual. But ensuring that you allocate enough memory, and that you do not go out-of-bounds, is up-to-you.

Less is More

We went from 1, to 0, to nothing. Since there is nothing left to take away, we attained perfection. It’s a general truism with C: the less C, the better.

Categories
Uncategorized

Why Undefined Behavior Matters

So Many Bugs

So, you’ve been programming for one month, or for ten years, and you have gotten familiar with Murphy’s and Sturgeon’s laws. And you are wondering if there is some way to catch all the bugs, not just spray-and-pray with regression/unit/integration/system/acceptance testing.

Or you are a curious onlooker, wondering why all these programmers cannot make a piece of software that works reliably, what with apps doing unexpected things and with all the security breaches you hear off.

First, what are we talking about? Essentially, a bug is anytime a system does not do what it is expected to:

The Mars Climate Orbiter disintegrated into Mars because a module was using non-metric units and forgot to convert. Seriously.

As you can see, there is quite a spectrum. I like to sort bugs in three categories:

Today, I would like to focus on the last one. It looks like it should be the more tractable, since the scope is well constrained. But it is still a non-trivial problem.

What we want is two-fold:

  1. Specify what we expect the program to do
  2. Verify that the program meets our specification

Even the first step is tricky. For the sake of simplicity, we are going to reduce the scope of what we want to achieve, and focus on a well-delimited set of things our program should do: have a well-defined behavior.

What does this mean? Let us say you wrote a program that does some kind of scientific computation, expect you used meters instead of kilometers. Then, the result is going to be wrong, of course, but in a predictable way: the result will be off by a factor 1,000. If you run the program again the same way, you should get the same result.

On the contrary, if your program contains undefined behavior, it could do unexpected things. It might output the right value; it might output a wrong value; it might crash; it might allow an attacker to take control of your computer. And it might behave differently every time you run it. Worse: it might run perfectly fine during all your tests, but output the wrong value when it is being used to pilot a real plane with real passengers.

How can we know if a program contains undefined behavior? First, we need to talk about programming languages.

Why is C an “Unsafe” Language?

When people talk about “undefined behavior”, they are usually referring to undefined behaviors in the C programming language. There is a reason why this particular language is criticized as being “memory-unsafe”, and why so many security vulnerabilities stem from code written in C. And that reason is that the C language leaves the door open for many bugs.

To understand why, we must first ask ourselves what “the C programming language” refers to. Who decides what it is, and what if someone does things differently?

Historically, C was loosely defined and had many dialects. Every compiler author would do things a bit differently. The people writing a compiler to compile “C” code to 8080 assembly made different choices than the ones targeting PDP-11. The same program could end up doing different things on different processors. Things were messy.

So, the C compiler writers, and the C developers, and the processors manufacturers joined together and agreed on a common definition for what “C” really meant. Thus was born the C standard.

The C standard had to try to align things between people doing very different things. And these people cared a lot about performance. Processors at the time were about a hundred times slower than the ones today. And I am only talking about clock frequency here; nowadays, each CPU core can do much more in a cycle than the CPUs at that time, and there are now many cores in a modern CPU. It makes sense that they really wanted to keep compiling C code to efficient assembly, to have programs run at decent speeds.

Now, what happens if the code tries to access memory it is not supposed to? Well, the compiler cannot just add runtime checks everywhere, since that would have a negative impact on performance. And it cannot always know in advance that this will happen, since it would require it to solve arbitrary complex mathematical problems.

Because of this, the C standard did not require either. Instead, if the code does such an invalid operation, it is said to have “undefined behavior”.

So what? If code crashes, I’ll notice the problem. If it outputs the wrong result, my tests will notice it.

Simplicius, my imaginary straw-man C developer

Wait a minute, renegade figment of my imagination! Undefined behavior means that the compiler is allowed to produce code that does anything. In particular, it is allowed to assume that undefined behavior never occurs in the problem, and deduce things from that. And you can prove many things if you assume something which is false.

Let’s take an example. The code below uses a toy authorization function that just checks whether the user id (uid) . As you might have already guessed from the topic of this article, there is an undefined behavior in this code. First question: can you see it?

#include <stdio.h>
int authorized_uids[] = {0, 1001, 1005, 1012};
int is_authorized(int uid) {
    for (int i = 0; i <= 4; i++) {
        if (uid == authorized_uids[i]) {
            return 1;
        }
    }
    return 0;
}
int main(void) {
    if (is_authorized(4242)) {
        puts("You are now logged in as root.");
    } else {
        puts("You are a lowly common user.");
    }
}

Spoiler: answer below.

The loop is written as for (int i = 0; i <= 4; i++), so i takes the values 0, 1, 2, 3 and 4. Then we use it as an index to read from an array: authorized_uids. However, this array only contains 4 elements, at indices 0, 1, 2 and 3. There is no index 4 in this array. Trying to access index 4 is an out-of-bounds access, which is a type of undefined behavior.

Second question: what will happen when we compile and run this code? You can actually do it, but do try to have a guess first.

The universal answer is: anything can happen. Since there is an undefined behavior, the program could do anything. However, there are two common things compilers will do in this case.

Option 1. It can translate this C code to assembly in a rather naive way. For an x86 processor, it would translate authorized_uids[i] as an assembly instruction reading memory at the address authorized_uids + i.

This might seem a bit obvious, but the compiler does not have to do that. Even when there is no undefined behavior, the compiler is allowed to emit any code it wants, as long as it produces the same result. For instance, the compiler might find a faster series of instructions that achieves the same results. This is what “optimizing” mean.

If we then run the compiled program, the CPU will try to read at authorized_uids + i. Most likely, however, this will still be in the valid memory of the program. Indeed, memory is usually allocated generously (segments for the stack, pages for the heap). So, the assembly instruction will not cause a crash; it will return some value. Most likely, the value will not be 4242, so the test will fail, and the user will not be authorized.

This is the middle column in the compiler explorer below. You can see at the top that the source code for is_authorized contains various assembly instructions. And, at the bottom, you can see the output of the program: “You are just a lowly common user.”.

Option 2. Now, let’s have a look at the rightmost column. The output is “You are now logged in as root.”! This is not what this program was supposed to do at all! And this is not just a fluke. It happens every time you run it!

Then, why is the result different this time? If you look closely, you will notice that the only difference is that we passed the option -O2 to the compiler. It tells the compiler to try harder to optimize the program.

In fact, if you look at the generated assembly, you will see that the compiler optimized the function to something extremely efficient:

is_authorized(int):
        movl    $1, %eax
        ret

This means that the function is_authorized just always returns 1. Wat?

Let’s try to understand the reasoning of the compiler. It sees:

int is_authorized(int uid) {
    for (int i = 0; i <= 4; i++) {
        if (uid == authorized_uids[i]) {
            return 1;
        }
    }
    return 0;
}

This is equivalent to:

int is_authorized(int uid) {
    if (uid == authorized_uids[0]) { return 1; }
    if (uid == authorized_uids[1]) { return 1; }
    if (uid == authorized_uids[2]) { return 1; }
    if (uid == authorized_uids[3]) { return 1; }
    if (uid == authorized_uids[4]) { return 1; }
    return 0;
}

But we can assume that there is no undefined behavior in the code. So uid == authorized_uids[4] must never be evaluated. So the compiler can safely assume that the function always exits at one of the first 4 return statements. In other words, it always returns 1.

But couldn’t the compiler at least warn about this?

Someone rightfully annoyed with the state of the universe

Fighting Undefined Behavior

If you run gcc with -Wall -Wextra -Wpedantic or even clang with -Weverthing, you still get no warnings about the undefined behavior!

Why don’t they warn about this? Because this is harder than it looks. The compiler cannot always know where an UB will happen. Let’s have a look at the example below.

// hour ∈ [0, 23]
// minute ∈ [0, 59]
// second ∈ [0, 59]
int time_to_seconds(int hour, int minute, int second) {
    return hour * 3600 + minute * 60 + second;
}

If a compiler had to ensure the absence of undefined behavior when compiling the above code, they would have to assume that hour, minute and second contain any possible value from INT_MIN to INT_MAX. The compiler does not understand comments written for humans, and it does not understand variable or function names, even when they are chosen appropriately.

To make the code safe, the compiler would have to introduce runtime checks in the compiled code to make sure that no overflows occur during the computation. In practice, that would imply 2 comparisons and 2 branches before each of the 2 multiplications and each of the 2 additions. But C was not designed for safety. It was designed for speed. All these comparisons and branches add up. On a simple benchmark, I observed that a safe version would be between 2 and 3 times as slow as the unsafe one.

For the same reason, if the compiler had to warn for every potential undefined behavior, developers would get flooded with false positives. In the example above, the compiler would have to emit a warning for each arithmetic operation. With many false alarms, the developers would have to get in the habit of just ignoring all the warnings.

But I do not care about performance that much! I want runtime checks everywhere. Or at least, I want to see the warnings! I set -Wall -Wextra -Wpedantic -Wconversion -Weverything. I really do want to know when there are potential undefined behaviors in my code.

Let’s address each of these.

Runtime checks. The clang compiler does have the ability to add runtime checks! ASan (Address Sanitizer) looks closely at the way memory is allocated, deallocated and used. UBSan (Undefined Behavior Sanitizer) looks at a few additional things, such as invalid arithmetic operations. Both of them work by adding instructions to the resulting binary executable, which will check everything. Since the program has more things to do, it runs slower. Valgrind/memcheck works directly on the compiled program, and inserts its own checks when running it. It also makes running the program much slower. These tools are able to catch a few types of issues every time they actually happen in the program. However, it is often non-trivial to understand why it happened. Using the function from the previous example, compiling with clang -fsanitize=undefined time-to-seconds.c and running the program would result in something like:

time-to-seconds.c:5:17: runtime error: signed integer overflow: 2147483647 * 3600 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior time-to-seconds.c:5:17 in 
time-to-seconds.c:5:33: runtime error: signed integer overflow: 2147483647 * 60 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior time-to-seconds.c:5:33 in 

Syntactic checks. Although they do not understand what your program does, compilers are able to notice suspicious patterns in the source code. For instance, they can see if (x = 42) and emit a warning, since this is a common mistake where the developer should have written if (x == 42). Modern compilers now include an extensive set of such patterns to warn the user about potential mistakes. Some tools are specialized in finding such patterns in the code, such as clang-tidy or cppcheck. Since they do not change the way the code is compiled, they incur no slow down when actually running the program. However, they can only find common mistakes, or they would risk emitting too many false warnings. In the example, no linter emits any warnings, since it would be likely to be a false positive. However, they do warn of more obvious cases:

warning: integer overflow in expression of type ‘int’ results in ‘1’ [-Woverflow]
   39 |     INT_MAX * INT_MAX;
      |             ^

If we want to go further, we need to look at formal methods. The idea is to use mathematical and systematical approaches to find all the undefined behaviors in a provable way. This means that such methods even allow you to prove the absence of undefined behavior (when there is none, of course).

I want to talk about two techniques from formal methods in particular. I am familiar with them because the company where I work, TrustInSoft, sells an industrial tool to apply them to real code used in critical applications.

Abstract interpretation. The idea for this one is intuitive. Essentially, we are going to interpret the source code in an abstract machine. This abstract machine will interpret each statement from the code exactly as specified by the standard. If undefined behavior can happen, a warning can be emitted. Essentially, this can be thought as an implementation of the abstract machine as described by the C standard.

The interesting thing is that you can go even further by having the abstract machine work with sets of values instead of singular values. For instance, you can analyze a function for all possible inputs (the values for each parameter being represented as a set of actual values), instead of just one at a time. For instance, let us consider the example below.

int abs(int x) {
    if (x < 0) {
        return -x;
    } else {
        return x;
    }
}

We can analyze this using the Eva plugin of Frama-C, or the Value plugin of TrustInSoft Analyzer.

$ frama-c -eva abs.c -main abs -verbose=0
[eva:alarm] abs.c:3: Warning: signed overflow. assert -x ≤ 2147483647;

This command interprets the source code in an abstract machine, starting from the abs function. By default, it assumes the argument can be any possible value (any integer from INT_MIN to INT_MAX). When checking the program for all these possible values, it notices a problem in one specific case: when x is INT_MIN, computing -x is undefined behavior, because it would be bigger than INT_MAX. Indeed, usually INT_MIN is 2³² and INT_MAX is 2³²-1, because of the way integers are represented. If you are interested in knowing how these programs manage to work with huge sets of values, look up interval arithmetic.

Weakest Precondition. The other technique is much more general and can actually be used to verify arbitrary facts on programs, not just to find undefined behaviors. For instance, with this, you can prove that your implementation of a sorting algorithm actually does what it is supposed to do. However, it also takes much longer to learn to use it, and to analyze a real program. If you are actually interested in it, I can only recommend the excellent tutorial written by Allan Blanchard.

What About Other Languages?

We explained why C is an unsafe language. But are the other languages unsafe?

C++. This is essentially a superset of C. And it adds its own undefined behaviors. It does introduce features to avoid doing mistakes common in C code, but the complexity it brings also hands another footgun to the user.

Java, Python, Ruby. These languages work by being compiled to byte code, which is something similar to assembly, except that it is not run directly by the processor. Instead, it is run by the corresponding virtual machine. In other, each language targets a single architecture. While C needs to encompass CPUs whose memory model vary significantly, these languages can specify strictly what should happen. They do not care as much about performance, either. The result is that, if you try to access out of the bounds of an array, you do not get undefined behavior. You get a nice, clean, exception telling you what the problem is, and a call stack telling you where it happened.

JavaScript. JavaScript is the worst client-side language. It is also the only one. Surprisingly, however, it is safe in the sense that there are no undefined behaviors (there were some until ECMAScript 2016).

Rust. Now, Rust is an interesting case. Like C, it is designed to be fast. Unlike C, it is designed to be safe. It achieves this thanks to the concept of unsafety. By default, checks are added wherever they are needed, and dangerous operations are simply prohibited. You can turn off these checks to do lower level operations and improve performance where needed, but you have to do explicitly. The point is that unsafety is opt-in, so developers know to be careful with undefined behaviors.

As you can see, we can do quite a bit about undefined behaviors. There are very fast, very simple-to-use tools, that might find common mistakes, and there are very advanced, and exhaustive tools, that require more involvement. And there are languages trying to make safety the default.

There is still a long way to go until software becomes as reliable as it could be, but we’re getting there, one byte at a time.

Further Reading

Categories
Uncategorized

Code Golfing in Python (1)

Code Golf

I spend way too much time playing Clash of Code on CodinGame as you can see in the “featured image” above. I still haven’t caught up with Haskell, and I struggle to stay above Allis. Both are really strong, and seem to know Ruby very well!

My favorite mode is “Shortest”. The goal is to write a program against a given specification within 15 minutes. But, unlike the “Fastest” and “Reverse” modes, the main criterion is the number of characters in the submitted code.

So, if the problem is to write a program that asks for an integer n and then prints n! (the factorial of n), you are not going to submit the following snippet (69 characters):

from math import factorial
n = int(input())
f = factorial(n)
print(f)

At the very least, you are going to remove the extraneous spaces and avoid unnecessary declarations (47 characters):

import math
print(math.factorial(int(input())))

Or you might notice that importing a module and calling a method with a long name costs you many characters and that you are better off implementing it yourself (45 characters):

n=int(input())
p=1
while n:p*=n;n-=1
print(p)

This is less readable and less efficient, but it is terser. And that’s all that count!

Note: to be more specific, if two people get the same character count for their submissions, the one who submitted it first gets the advantage; however, this seldom happens

Uglifying Python

I am now going to list some of the techniques I commonly use when code golfing in Python.

1. Remove Extraneous Spaces

There is no need to put spaces after or before parentheses, brackets, quotes. You do not need spacing after a number either. Assume you are given an input of the form:

n
a b c d e…

And your program should print True if n+1 is among a, b, c, d, e… and False otherwise. If you follow the PEP 8, you could write:

print(int(input()) + 1 in [int(x) for x in input().split()])

Instead, we can do:

print(int(input())+1in[int(x)for x in input().split()])

Note that no space is needed between 1 and the keyword in.

2. Reduce Indentation

Indentation is semantic in Python. But there is no point in putting more than one space per level of indentation (88 characters):

t=0
for i in range(w):
    for j in range(h):
        t+=i+j
        print(i+j)
print(t)

This could just as well be written as (73 characters):

t=0
for i in range(w):
 for j in range(h):
  t+=i+j
  print(i+j)
print(t)

Better yet, since there are no further blocks (if, for, while) in the inner loop, we can put it in a single line (68 characters):

t=0
for i in range(w):
 for j in range(h):t+=i+j;print(i+j)
print(t)

Note: combining several lines of code with a semi-colon is normally useless, since we save the newline character but it pay the price of the semi-colon; however, when an indented block, we also save the characters of the indent, which makes it worth it to put several statements on a single line.

But it’s even better if we can get rid of the indent altogether (63 characters):

print(sum(print(i+j)or i+j for i in range(w)for j in range(h)))

Note: here, I combined the print statement with the elements of the same by exploiting the fact and print always returns None, which is falsy (that is, it evaluates as False when evaluated as a boolean), which means that the expression will always evaluate as i+j after printing the value.

3. Exploit Iterables

Earlier, I wrote the following to read a list of integers from the standard input (32 characters):

[int(x)for x in input().split()]

This is a list comprehension, which has the advantage of being easy to read quickly once you understand the syntax. But, before PEP 202, it was already possible to do it without a loop block, using the functional-style map() and filter() built-in methods. Both take as argument a method and an iterable, and apply the method to each element of the iterable. map() simply returns the result of this evaluation, while filter() returns the original element if the result is truthy, otherwise, it skips it. I have rarely found a use for filter() while code golfing, but map() lets me write the following (24 characters):

map(int,input().split())

This expression evaluates to a generator of integers corresponding to the numbers in the line. So, if all you want is their sum, you could just write (29 characters):

sum(map(int,input().split()))

Note that, if you need to iterate over the integers several times, or if you need to access them by index, then you need to exhaust the iterable into a list. It is still shorter than a list comprehension however (30 characters):

list(map(int,input().split()))

But many things are iterable in Python. If you need to count the number of unique characters of a string s, you can just write (11 characters):

len(set(s))

In fact, you can exploit it for when you want to iterate a fixed number of times and do not care about the loop index. For instance, instead of writing (40 characters):

for _ in range(99):print(int(input())%2)

You could just as well write (36 characters):

for i in[0]*99:print(int(input())%2)

Note: there used to be a reduce(), method in the built-ins of Python, that would let you aggregate a range in the fashion of sum(), max(), min(), any(), all(). With it, you could easily have computed the product of the elements of an iterable. However, it was moved into the functools module in Python 3 because comprehensions are deemed more Pythonic.

4. Use Shorthands

Suppose you need to read a single string, then a list of string (their count on a line, then a string per line), and count how many of the strings contain the initial string. You could do (58 characters):

s=input()
print(sum(s in input()for _ in[0]*int(input())))

Note: the input() of [0]*int(input()) will be evaluated before that of s in input() since it is necessary to start the (generator) comprehension in the first place.

Note: s in input() is a boolean expression but, in Python, True is equal to 1 and False is equal to 0; thus, we can easily count the number of time a predicate is met.

Now, you might notice that we have written input thrice. Since Python methods and classes are themselves objects, we can affect them to a variable and write (54 characters):

I=input
s=I()
print(sum(s in I()for _ in[0]*int(I())))

We save 4 characters for each call to input and assigning it to I costs us 8 characters, so we save 4 characters here. We would break even if there were only two calls to input. But it allows us another neat (horrible) trick (50 characters):

I=input
s=I()
I(sum(s in I()for _ in[0]*int(I())))

Seriously. Here, we are exploiting the fact that input takes an optional position argument, which will be used as the prompt for the user. For instance, if you are asking the name of the user, you would write: name=input('Name: '), which is essentially equivalent to print('Name: ', end='');name = input(). Since there is nothing more to read, the program will crash, but not before having written the result on the standard input.

Note: this trick only works as the final output of a program (you cannot use it several times in a loop for instance), and you need to give it a single argument (you might need to format the output first).

5. Use f-strings

I referred to the idea of passing several arguments to print at the end of the previous section. This is something very handy, even outside code golf, when you’d like to print several values separated by spaces:

x = 4
y = 6
name = 'Hector'
print(x, y, name)

This will result in printing 4 6 Hector to the standard output. You can easily apply it to an enumerable:

l = [1, 2, 3, 4]
print(*l)
# Output:
# 1 2 3 4

If you prefer to separate the elements with a comma, you can use an optional keyword parameter of print:

l = [1, 2, 3, 4]
print(*l, sep=',')
# Output:
# 1,2,3,4

However, if you have a generator comprehension that outputs strings, you could save writings extraneous parentheses with str.join():

l = [1, 2, 3, 4]
print(*(chr(96+i)for i in l),sep=',')
# versus
print(','.join(chr(96+i)for i in l))
# Output:
# a,b,c,d

Where are the f-strings you promised in all this?!

A reader annoyed by my prolixity

Well, at some point, you will want to output something in a particular manner, or you will need to format a string to pass it as a single argument to input() (60 characters):

print('Values are x= '+str(x)+' and y='+str(y)+' for '+name)

Or, using %-formatting (51 characters):

print('Values are x=%d and y=%d for %s'%(x,y,name))

This is advantageously rewritten with an f-string as (47 characters):

print(f'Values are x={x} and y={y} for {name}')

Note: since Python 3.8 (CodinGame uses 3.7.4 as of writing this article), you can write print(f'Values are {x=} and {y=} for {name}'), saving two characters.

But f-string are much more than string interpolation. They are %-formatting on steroids. So, you can still pad left or right, with spaces or zeroes, toggle positive sign, manage decimal precision, convert to hexadecimal (lower or upper case). But you can also convert to octal and decimal, center strings and call repr(). So instead of the well-known way to convert to binary (10 characters):

bin(x)[2:]

You can now write (8 characters):

f'{x:b}'