Rust Strings for C Programmers

This article will quickly explain the Rust types [T; N], &[T; N], &[T], Vec<T>, &Vec<T> with C code, and what the str, &str, String, OsString and CString add.

Arrays and Slices


Rust
C
[T; N] (array)
Example: [i32; 100]
Allocated on the stack
T[N]
Example: int[100]
Allocated on the stack
&[T; N] (array reference)
Example: &[i32; 100]
N is tracked at compilation. Bounds-checks are done at runtime (opt-out using get_unchecked¹).
const T[N] in function parameters or const T*
Example: const int[100] or const int*
Partially tracked at compilation², no access bounds-checks at runtime.
&mut [T; N] (exclusive array reference)
Example: &mut [i32; 100]
Same as above, and allows writing to the array.
T[N] in function parameters or T*
Example: int[100] or int*
Same as above, and allows writing to the array.
Box<[T; N]> (boxed array)
Example: Box<[i32; 100]>
Same as &mut [T; N]³, but the underlying array is allocated on the heap. The memory is relinquished when the object is Dropped.
T* from malloc()
Example: int*
Same as T*, but the underlying array is allocated on the heap. The memory must be relinquished manually by calling free().
&[T] (slice reference)
Example: &[i32]
The size is not fixed at compile time. It will be tracked using an additional variable along with the base pointer. The two make a “fat pointer”. As for &[T; N], you can opt out of runtime bounds-checks using get_unchecked³.
struct { const T *base; size_t size }Example: struct{const int *base;size_t size}
In practice, many C functions will take the base pointers and the size as two separate parameters. For instance, you will see: memset(base, 0, size). The compiler won’t perform any bounds-checks automatically.
&mut [T] (exclusive slice reference)
Example: &[mut i32]
Same as &[i32], and allows writing to the array.
struct { T *base; size_t size }
Example: struct{ int *base; size_t size }
Same as above, and allows writing to the array.
Vec<T> (vector)
Example: Vec<i32>
Lets you push (append) an arbitrary number of elements. &Vec<T> can automatically be coerced to &[T] because Vec<T> implements Deref, and &mut Vec<T> can automatically be coerced to &mut [T] because Vec<T> implements DerefMut⁴.
struct{T* data;size_t size;size_t avail}
Example: struct { int *data; size_t size; size_t avail }
Implementation of a dynamic array using realloc(). You can pass the data and size fields to functions that expect a T* and size_t parameters.

¹ Note this is slice::get_unchecked. Rust lets you coerce a &[T; N] array into a &[T] slice (a bit like a const int[N] can decay into a int*). When you write v.get_unchecked(0), it implicitly means (&v).get_unchecked(). The compilers then figures out that it can use slice::get_unchecked even though &v is a reference to an array.

² From the point of view of the standard, T[N] is just syntactic sugar for T*, but compilers to emit warnings when they see an incorrect function call, such as in:

void f(int p[100]);
void g(void) {
    int v[10];
    f(v);
}

³ Technically, you will need the Box<[T; N]> object itself to be mut to modify the underlying array:

fn f() {
    let mut v = Box::new([0; 100]);
    v[2] = 1;
}

⁴ This means you can write:

fn f(v: &[i32]) { }
fn g(v: &mut [i32]) { }
fn main() {
    let mut v = vec![1, 2, 3];
    f(&v);
    g(&mut v);
}

With this, we can map the following patterns:

RustC
&p[..]base, size
&p[2..]base + 2, size - 2
&p[..40]base, 40
&p[2..40]base + 2, 38

The main difference is that the Rust version will do bound-checks, while the C version won’t. You can again use get_unchecked() to opt out of these checks, as it works with ranges just as well as indices.

Strings

Once you understand arrays and slices well, strings become easy:

  • str is a [u8] which is guaranteed to contain valid UTF-8 data
  • String is a Vec<u8> which is guaranteed to contain valid UTF-8 data
  • CStr is a [u8] which is guaranteed to be null-terminated
  • CString is a Vec<u8> which is guaranteed to be null-terminated
  • OsStr is a [u8] which is guaranteed to contain data valid for the system’s API⁵
  • OsString is a Vec<u8> which is guaranteed to contain data valid for the system’s API⁵
  • Path is an OsStr that is used to represent a path
  • PathBuf is an OsString that is used to represent a path

⁵ To understand why OsStr/OsString is different from CStr/CString, take a look at WTF-8.

Since str is just a [u8], the patterns below work as well. The difference is that all the resulting &str must still contain valid UTF-8. In other words, you cannot slice in the middle of the UTF-8 encoding of a codepoint. As usual, you can opt out of the automatic checks by using get_unchecked, but you will have undefined behavior if the range you pass cuts in the middle of a codepoint.

RustC
&s[..]base, size
&s[2..]base + 2, size - 2
&s[..40]base, 40
&s[2..40]base + 2, 38

Leave a Reply

Your email address will not be published. Required fields are marked *