Mutation - Part 1: Variables

Sun, Feb 17, 2019

What is a variable?

Ok, here there is really a lot that could be said, and a simple blog post will not cover it. I will just talk about two important and distinct thinking patterns involved:

the idea of binding a name to a value;
the idea of reading/writing values from/into memory;

1. Identity binding

The best example in Rust for “just a binding” (i.e., with no backing memory) is const

1type is = i32;
2
3const x          :is= the_answer + 27;
4const the_answer :is= 42;
5

Here we have defined the_answer as an alias for the value 42; meaning that, from now on, everytime we say the_answer what we actually mean is 42.

We are at the meta level: if we ask the program to evaluate x, what we are actually¹ asking the program to do is to evaluate the_answer + 27, i.e., to evaluate 42 + 27.

The same in C, with #define:

1#define x          (the_answer + 27)
2
3#define the_answer (42)

If you look at both code snippets, you can notice that there is no notion of temporality (regarding the definitions):

It is all about essence / identity.

Thus, if the_answer is and will always be 42, there is no room for mutation.

42 cannot mutate and become 43; 42 is and will always be 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0²
in a similar manner, the_answer cannot change either, since it was defined as being 42: it is an identity, and…
and yet … we feel like we should be able to redefine what the_answer is, don’t we?
re-define?
- This implies some form of temporality, and for that there are two solutions:
  1. scoped bindings (and shadowing).
  2. a binding to (runtime) memory.

1.1 Scoped bindings

You can limit the “scope” of your identity-definition:

1// undefined
2{const the_answer :is= 42;
3    // defined
4}
5// undefined
6

This is one of the ways to have some form of temporality with bindings: time progresses as we enter and leave scopes.

1{const the_answer :is= 42;
2    {const x :is= the_answer + 27;
3        // x -> the_answer + 27 -> 42 + 27
4    }
5}
6

But what about re-defining the_answer?

By using a narrower scope, we can indeed re-define the value of the_answer, even though what we are actually doing is defining a brand new the_answer that happens to lead to a name collision. But given the different scopes and their temporality, this is unambiguous and thus allowed:

1{const the_answer :is= 42;
2    // 'the_answer' <-> 42
3    {const the_answer :is= 0;
4        // 'the_answer' <-> 0
5    }
6    // 'the_answer' <-> 42
7}
8

This trick is called shadowing and is a clever way to avoid using mutation when it is not needed.

2. Runtime memory

We are so used to manipulating memory through variables that I have defined some functions to pretend we are manipulating low-level memory; it thus provides “primitives / intrinsics” that allow us to read/write from/into the memory and also to print values:

 1#![allow(non_camel_case_types, non_snake_case)]
 2
 3/// 4Ko RAM?
 4static mut MEMORY: [u8; 4096] = [0; 4096];
 5type address = usize;
 6
 7fn READ (whence: address) -> u8
 8{
 9    unsafe { MEMORY[whence] }
10}
11
12fn WRITE (where_to: address, what: u8)
13{
14    unsafe { MEMORY[where_to] = what }
15}
16
17use ::std::dbg as PRINT;
18
19fn main ()
20{
21    WRITE(1000, 42);
22    WRITE(1001, 27);
23    PRINT!(READ(1000)); // -> 42
24    WRITE(1000, READ(1000) + READ(1001));
25    PRINT!(READ(1000)); // -> 69
26}
27

This way, we can see that we were able to

write at the address 1000 the value 42;
then write at that same address the value 69.

We have successfully mutated something!

Good, now we have a clearer understanding of what mutation actually is: modifying memory contents.

But what have variables to do with any of this?

2.1 Runtime memory with variables

The following sentence explains the relation between bindings and memory:

A (runtime) variable is a (static) binding to some place in memory, and is used to read/write from/into that place in memory

const at_the_answer: address = 1000;
const at_x: address = 1001;

fn main ()
{
    WRITE(at_the_answer, 42);
    WRITE(at_x, 27);
    PRINT(READ(at_the_answer)); // -> 42
    WRITE(at_the_answer, READ(at_the_answer) + READ(at_x));
    PRINT!(READ(at_the_answer)); // -> 69
}

And now with some sugar:

const at_the_answer: address = 1000;
const at_x: address = 1001;

main! {
    the_answer <- 42;
    x <- 27;
    PRINT!(the_answer); // -> 42
    the_answer <- the_answer + x;
    PRINT!(the_answer); // -> 69
}

How does that sugar work? Let’s describe the unsugaring of line 139:

The statement the_answer <- the_answer + x; is identified as an assignment: a value (right hand side) is written to memory (left hand side)
1. First the right hand side of the assignment (the_answer + x) is treated, and every variable occurrence is replaced with READ memory access:
  - the_answer becomes READ(at_the_answer)
  - x becomes READ(at_x)
2. Then the left hand side of the assignment (the_answer) is treated, and the variable is replaced with a WRITE memory access:
  - the_answer <- ... becomes WRITE(at_the_answer, ...)

We end up with the following unsugaring:

WRITE(at_the_answer, READ(at_the_answer) + READ(at_x));

Hence my aforementioned definition of a variable (let’s repeat it):

A (runtime) variable is a (static) binding to some place in memory, and is used to read/write from/into that place in memory

In memory? But where exactly?

`static` vs `let`

In Rust, there are two³ ways to declare a variable:

static, for global variables; the variable refers to a section of memory statically laid out before the program runs, and which is never deallocated.
There often are two such sections:
- The .rodata section, which is read-only. Attempting to write in this section will result in a memory violation, a.k.a. a Segmentation Fault.
- The .data section⁴, which is both readable and writable.
let⁵, for local variables: the variable refers to a section of memory called the stack. It is always readable and writable.
Clever pointer arithmetic using two⁶ CPU registers leads to a very useful abstraction: stack frames, which can be created and accumulated on top of each other using LIFO semantics (hence the name of the section).
They thus become very well suited for scoped memory, such as function locals:
1. When a scope starts, the compiler looks at all the local variables declared within that scope (e.g., x: i32 and y: u8 within foo’s function);
2. Then a new stack frame is created by shifting the stack pointer(s), in order to reserve enough memory for all those variables (e.g., 4 + 1 = 5 bytes are allocated in foo’s stack frame);
3. Within that function’s body, each “variable” is just a statically known offset from the beginning of the stack frame (e.g., at_x and at_y are the offsets 0 and 4 from the beginning⁷ of the stack frame). And as said previously, each “variable access” is actually the result of dereferencing⁸ those offsets, each time;
4. When the scope ends, the frame is automatically and inevitably pop-ped off, i.e., freed.

We finally have the bases required for thinking about mutation in Rust!

Appendix

Full unsugaring code

#![allow(non_upper_case_globals, non_snake_case)]

/// 4Ko RAM?
static mut MEMORY: [u8; 4096] = [0; 4096];

fn READ (whence: address) -> u8
{
    unsafe { MEMORY[whence] }
}

fn WRITE (where_to: address, what: u8)
{
    unsafe { MEMORY[where_to] = what }
}

/// Recursively replaces:
///     the_answer => READ(at_the_answer)
///     x          => READ(at_x)
macro_rules! replace_rvalues {
    (
        [ $($replaced:tt)* ]
        the_answer
        $($rest:tt)*
    ) => (replace_rvalues!(
        [ $($replaced)* READ(at_the_answer) ]
        $($rest)*
    ));

    (
        [ $($replaced:tt)* ]
        x
        $($rest:tt)*
    ) => (replace_rvalues!(
        [ $($replaced)* READ(at_x) ]
        $($rest)*
    ));


    (
        [ $($replaced:tt)* ]
        $tt:tt
        $($rest:tt)*
    ) => (replace_rvalues!(
        [ $($replaced)* $tt ]
        $($rest)*
    ));

    (
        [ $($replaced:tt)* ]
    ) => (
        $($replaced)*
    );

    ( $($tt:tt)* ) => (
        replace_rvalues!([] $($tt)*)
    )
}

macro_rules! exec {
    /* replaces */ (
        the_answer <- $($tt:tt)*
    ) => /* with */ (
        WRITE(at_the_answer,
            replace_rvalues!( $($tt)* )
        )
    );

    /* replaces */ (
        x <- $($tt:tt)*
    ) => /* WITH */ (
        WRITE(at_x,
            replace_rvalues!( $($tt)* )
        )
    );

    /* replaces */ (
        PRINT ( $($tt:tt)* )
    ) => /* with */ (
        println!("{} = {}", stringify!($($tt)*), replace_rvalues!( $($tt)* ));
    );
}

macro_rules! main {
    (@parsing
        [ $($stmts:tt)* ]
        [ $($current:tt)* ]
        ;
        $($rest:tt)*
    ) => (main!{@parsing
        [ $($stmts)* ( $($current)* ) ]
        []
        $($rest)*
    });

    (@parsing
        $stmts:tt
        [ $($current:tt)* ]
        $tt:tt $($rest:tt)*
    ) => (main!{@parsing
        $stmts
        [ $($current)* $tt]
        $($rest)*
    });

    (@parsing
        [ $( ( $($stmts:tt)* ) )* ]
        []
    ) => (
        fn main () {
            $(
                exec!{ $($stmts)* }
            )*
        }
    );

    (
        $($tt:tt)*
    ) => (main!{@parsing
        []
        []
        $($tt)*
    });
}

#[allow(non_camel_case_types)]
type address = usize;

const at_the_answer: address = 1000;
const at_x: address = 1001;

main! {
    the_answer <- 42;
    x <- 27;
    PRINT(the_answer); // -> 42
    the_answer <- the_answer + x;
    PRINT(the_answer); // -> 69
}

i.e., on runtime ^[return]
The representation of a number (here in base ten) vs its true nature is a (very interesting) topic in and on itself, which will most probably deserve its own blog post. One great definition / view of a (natural) number’s nature is Peano’s arithmetic, with which one can prove that 1+1+0 + 1+1+0 = 1+1+1+1+0 ^[return]
const does not declare a variable, but a meta binding. ^[return]
and .bss for initially zero-ed (global) memory ^[return]
and pattern bindings such as match arms or function parameters ^[return]
or can even use just one register (since the allocations are all static) ^[return]
offsets are actually negative and do not start at 0, but those are just implementation details ^[return]
accessing the memory at a given address ^[return]