(This blog covers the implementation details of cargo-call-stack v0.1.2. If you are only interested in using the tool these tweets and the README will give you an idea of what it can do.)

Motivation

So, why would you ever want to analyze the stack usage of your program at compile time?

The obvious answer is when you want to know if your application can stack overflow at runtime without actually running your program. This is important for embedded applications where stack overflows can corrupt memory, and it’s actually a hard requirement for certifying the functional safety of embedded applications.

Of course, there are several stack usage analysis tools out there but they only target C/C++, are closed source and/or cost $$$$$. So here at LTU we set out to build a stack usage analysis tool for Rust: cargo-call-stack. Of course we don’t claim that you can use our tool to certify your Rust applications, at least not yet ;-).

How does it work?

Stack usage analysis consists of three steps: first, computing the stack usage of every function (subroutine) in the program, then computing the full call graph of the program and finally using these two pieces of information to compute the worst-case stack usage.

`.stack_sizes`

For the first step we use -Z emit-stack-sizes. This experimental flag enables an LLVM feature that creates a linker section named .stack_sizes; this section contains the stack usage of every function in the program.

Consider this program:

// template: https://github.com/rust-embedded/cortex-m-quickstart
// rustc: nightly-2019-03-01 (~1.33.0)

#![feature(asm)]
#![no_std]
#![no_main]

extern crate panic_halt; // panic-halt = "=0.2.0"

use cortex_m_rt::entry; // cortex-m-rt = "=0.6.7"

#[entry]
fn main() -> ! {
    foo();
    bar();

    loop {}
}

#[inline(never)]
fn foo() {
    // spill variables onto the stack
    unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5)) }
}

#[inline(never)]
fn bar() {
    // spill even more variables onto the stack
    unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) "r"(7)) }
}

If we build this example with -Z emit-stack-sizes we’ll get the .stack_sizes section.

$ cargo rustc --bin app --release -- -Z emit-stack-sizes

$ size -Ax target/thumbv7m-none-eabi/release/app | head -n8
target/thumbv7m-none-eabi/release/app  :
section             size         addr
.vector_table      0x400          0x0
.text              0x292        0x400
.rodata              0x0        0x694
.data                0x0   0x20000000
.bss                 0x0   0x20000000
.stack_sizes        0x23          0x0

We can use the cargo-stack-sizes tool to parse the .stack_sizes section and print it in human readable format.

$ cargo stack-sizes --bin app --release
address         stack   name
0x00000414      16      app::bar::ha68263a6ecabd8f4
0x00000400      8       app::foo::h447512bba9d1cb41
0x0000042c      0       main
0x00000436      0       Reset
0x00000686      0       DefaultHandler_
0x00000688      0       DefaultPreInit
0x00000690      0       HardFault_

The number reported here is the stack usage in isolation; it doesn’t consider the call graph of the program.

Call graph

To build the call graph we parse the LLVM-IR of the program, which we get from the --emit=llvm-ir rustc flag. This flag produces a .ll file that contains the LLVM-IR of the crate.

If you are not familiar with LLVM-IR this is what it looks like:

NOTE: I have cleaned up all the LLVM-IR snippets shown in this post by removing uninteresting bits like attributes and metadata with the goal of making them short enough to fit the screen.

; app::baz
; Function Attrs: noinline nounwind
define internal fastcc i32 @_ZN3app3baz17h677ca7ea9c7e9167E() unnamed_addr #0 {
start:
; call app::bar
  tail call fastcc void @_ZN3app3bar17h2e93e4e5ba96c2a6E()
  %0 = load volatile i32, i32* bitcast (<{ [4 x i8] }>* @0 to i32*), align 4
  ret i32 %0
}

This is a function definition. The signature of this function is i32 () in LLVM type system. This function calls a function named app::bar (its mangled name, @_ZN3app3bar17h2e93e4e5ba96c2a6E, appears in the IR), which has signature void ().

The LLVM-IR shown above corresponds to this function:

#[inline(never)]
fn baz() -> i32 {
    bar();

    unsafe { core::ptr::read_volatile(&0) }
}

By reading all the function definitions in the IR we can build an accurate call graph of the program as long as it only contains direct function calls like the one shown above. (Later we’ll see how to deal with function pointers and trait objects)

Once you have the stack usage of all functions and the call graph computing the worst-case stack usage boils down to walking the graph and adding up the individual stack usage numbers.

Consider this modification of the first example.

#[entry]
fn main() -> ! {
    foo();
    // bar(); // CHANGED!

    // NEW!
    // NOTE we use a volatile load here to prevent LLVM from
    // optimizing away the return value
    unsafe {
        core::ptr::read_volatile(&baz());
    }

    loop {}
}

#[inline(never)]
fn foo() {
    // spill variables onto the stack
    unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5)) }
}

#[inline(never)]
fn bar() {
    unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) "r"(7)) }
}

#[inline(never)]
fn baz() -> i32 {
    bar();

    unsafe { core::ptr::read_volatile(&0) }
}

Let’s see what cargo-call-stack produces

$ # here we choose `main` as the start point of the call graph
$ cargo call-stack --bin app main > cg.dot

$ dot -Tsvg cg.dot > cg.svg

The nodes in this graph are functions and the edges indicate which functions a function may call. The local number is the stack usage of the function that we got from the .stack_sizes section. The max number is the worst-case stack usage of a function; this number includes the stack space used by functions called by the function.

In this case, main’s worst-case stack usage is 32 bytes and occurs when main calls baz and then baz calls bar, so the worst-case call stack is: main, baz, bar.

Why LLVM-IR?

At this point you may be wondering why we are using LLVM-IR to build the call graph and not something more Rust-y like MIR.

The main reason is that LLVM-IR is very close in structure to the final binary and that’s because the .ll file produced by rustc contains the IR after optimizations have been applied.

If we had built the call graph using MIR it would have contained way more nodes than there actually are in the binary. The extra MIR nodes wouldn’t exist in the final binary because LLVM inlined the functions or optimized them away. More work would had been required to remove these extra nodes.

Another advantage of using LLVM-IR is that is a relatively stable format, unlike MIR, so moving to a newer Rust compiler is less likely to break the tool.

Cycles

Finding the worst-case stack usage of an acyclic call graph is straightforward: you walk the graph node by node in reverse topological order. Things get a bit trickier when you have cycles in the graph. Cycles appear when the program uses recursion. Fibonacci is probably the most known example of recursion so let’s use it as an example.

static X: AtomicUsize = AtomicUsize::new(0);

#[entry]
fn main() -> ! {
    X.store(fib(X.load(Ordering::Relaxed)), Ordering::Relaxed);

    loop {}
}

#[inline(never)]
fn fib(n: usize) -> usize {
    if n < 2 {
        1
    } else {
        fib(n - 1) + fib(n - 2)
    }
}

// could change the value of `X` at any time
#[exception]
fn SysTick() {
    X.fetch_add(1, Ordering::Relaxed);
}

The above program produces the following call graph:

The fib function itself uses some stack: 16 bytes. However, because it recursively calls itself we can’t place an upper bound on its stack usage – we don’t how deep the recursion will go. Thus we can only say that calling fib will use at least 16 bytes of stack; that’s why we report >= 16 as the worst-case stack usage. Since main calls this single-node cycle it inherits its unbounded worst-case stack usage.

If you encounter unbounded stack usage due to recursion you should consider removing the recursion to make the program analyzable. For example, fib can be rewritten to use a loop instead of recursion.

#[inline(never)]
fn fib(n: usize) -> usize {
    let (mut a, mut b) = (1, 1);
    for _ in 0..n {
        b = core::mem::replace(&mut a, b) + b;
    }
    a
}

But, does this mean that recursion always implies unbounded stack usage? No. In some (rare?) cases, the stack usage of cycles can be bounded. Here’s one such example:

#[entry]
fn main() -> ! {
    foo();

    quux();

    loop {}
}

// these three functions form a cycle that breaks when `SysTick` runs
#[inline(never)]
fn foo() {
    if X.load(Ordering::Relaxed) {
        bar()
    }
}

#[inline(never)]
fn bar() {
    if X.load(Ordering::Relaxed) {
        baz()
    }
}

#[inline(never)]
fn baz() {
    if X.load(Ordering::Relaxed) {
        foo()
    }
}

#[inline(never)]
fn quux() {
    // spill variables onto the stack
    unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5)) }
}

#[exception]
fn SysTick() {
    X.store(false, Ordering::Relaxed);
}

This program produces the call graph shown at the beginning of this post.

Here foo, bar and baz form a cycle but none of them uses the stack so the worst-case (max) stack usage of the whole cycle is 0 bytes. To compute the worst-case stack usage of main we walk the graph as we usually do but consider each cycle as a single node, or in more technical terms: we walk over the SCCs (Strongly Connected Components) of the graph in reverse topological order.

If you are a bit incredulous about this result (I was too!) check out the cargo-call-stack README. It contains this exact same example but expands on it further by showing the disassembly and the GDB logs of running the example on a Cortex-M3 microcontroller. (Yes, the stack usage is indeed zero!)

Function pointers

We have only seen direct function calls so far; how do we deal with function pointer calls? Turns out that the LLVM-IR representation of function pointers calls include the signature of the function and that helps.

Consider this program:

#[no_mangle] // shorter name in the llvm-ir
static F: AtomicPtr<fn() -> bool> = AtomicPtr::new(foo as *mut _);

#[entry]
#[inline(never)]
fn main() -> ! {
    if let Some(f) = unsafe { F.load(Ordering::Relaxed).as_ref() } {
        // call via function pointer
        f();
    }

    loop {}
}

// ..

Which produces the following LLVM-IR:

; Function Attrs: noreturn nounwind
define void @main() unnamed_addr #2 {
start:
  %0 = load atomic i32, i32* bitcast (<{ i8*, [0 x i8] }>* @F to i32*) monotonic
  %1 = icmp eq i32 %0, 0
  br i1 %1, label %bb6.preheader, label %bb3

bb3:                                              ; preds = %start
  %2 = inttoptr i32 %0 to i1 ()**
  %3 = load i1 ()*, i1 ()** %2, align 4, !nonnull !46
; ↓
  %4 = tail call zeroext i1 %3() #7
  br label %bb6.preheader

bb6.preheader:                                    ; preds = %start, %bb3
  br label %bb6

bb6:                                              ; preds = %bb6.preheader, %bb6
  br label %bb6
}

The function pointer call appears in line %4 = on the right hand side. Value %3 is the function pointer. This line also includes the signature of the function: i1 (), which is the LLVM version of Rust’s fn() -> bool. We can distinguish a function pointer call from a direct function call because the latter uses the name of a function (@foo) as the callee whereas the former uses a value (%3 in this case).

So from the IR we know that main calls a function pointer with type fn() -> bool. If the crate and its dependency graph are written in pure Rust then we know the signatures of all the functions in the final binary. With that information we can (pessimistically) assume that main could call any function with signature fn() -> bool.

So if we had these other functions in our example:

fn foo() -> bool {
    false
}

fn bar() -> bool {
    // spill variables onto the stack
    unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5)) }

    true
}

// this handler can change the function pointer at any time
#[exception]
fn SysTick() {
    F.store(bar as *mut _, Ordering::Relaxed);
}

The call graph reported by the tool would look like this:

Here i1 ()* is a fictitious node that denotes a function pointer call. The outgoing neighbors of this node are the functions that could be invoked as a result of this call.

Trait objects

We can use a similar approach to reason about dynamic dispatch. Calling a trait object method shows in LLVM-IR as calling a function pointer but with the peculiarity that the signature of the function, for example, looks like this: i1 ({}*) where LLVM’s {}* type is basically Rust’s *mut () type.

Consider the following code:

use spin::Mutex; // spin = "=0.5.0"

#[no_mangle] // shorter name in the llvm-ir
static TO: Mutex<&'static (dyn Foo + Sync)> = Mutex::new(&Bar);

#[entry]
fn main() -> ! {
    // trait object dispatch
    (*TO.lock()).foo();

    loop {}
}

trait Foo {
    // default implementation of this method
    fn foo(&mut self) -> bool {
        foo(&mut 0)
    }
}

// ..

Which produces the following LLVM-IR:

; Function Attrs: minsize noreturn nounwind optsize
define void @main() unnamed_addr #4 {
start:
; call <spin::mutex::Mutex<T>>::lock
  tail call fastcc void @"_ZN36_$LT$spin..mutex..Mutex$LT$T$GT$$GT$4lock17h2602c9565791e858E"()
  %0 = load {}*, {}** bitcast (i8** getelementptr inbounds (<{ [4 x i8], i8*, i8*, [0 x i8] }>, <{ [4 x i8], i8*, i8*, [0 x i8] }>* @TO, i32 0, i32 1) to {}**), align 4, !nonnull !23
  %1 = load i1 ({}*)**, i1 ({}*)*** bitcast (i8** getelementptr inbounds (<{ [4 x i8], i8*, i8*, [0 x i8] }>, <{ [4 x i8], i8*, i8*, [0 x i8] }>* @TO, i32 0, i32 2) to i1 ({}*)***), align 4, !nonnull !23
  %2 = getelementptr inbounds i1 ({}*)*, i1 ({}*)** %1, i32 3
  %3 = load i1 ({}*)*, i1 ({}*)** %2, align 4, !invariant.load !23, !nonnull !23
; ↓
  %4 = tail call zeroext i1 %3({}* nonnull align 1 %0) #9
  store atomic i8 0, i8* getelementptr inbounds (<{ [4 x i8], i8*, i8*, [0 x i8] }>, <{ [4 x i8], i8*, i8*, [0 x i8] }>* @TO, i32 0, i32 0, i32 0) release, align 4
  br label %bb5

bb5:                                              ; preds = %bb5, %start
  br label %bb5
}

Line %4 = is the dynamic dispatch. Trait method Foo::foo has signature fn(&self) -> bool; that’s why the signature of the function shows as i1 ({}*) in the LLVM-IR. The type of self has been erased so we only know that the first argument (the receiver) is a pointer.

Which functions could be dispatched by this call? In principle, any with signature i1 (???*) where ??? could be any type: i1 (bool), i32, %crate::module::Struct, etc., but this approach would also include inherent methods and functions whose first argument is a pointer in the list of candidates. For example the following function would also be considered a candidate:

#[inline(never)]
fn foo(x: &mut i32) -> bool {
    // memory clobber to trick LLVM into believing that `x` may be modified
    unsafe { asm!("" : : : "memory") }

    *x > 0
}

We can narrow down the list of candidates a bit further by looking at the name of the function. Methods in trait implementations are named using the format: <crate::module::Trait for crate::module::Type>::method, whereas default trait methods are named using the format: crate::module::Trait::method. The app::foo function shown above doesn’t match either format.

Now let’s see what cargo-call-stack produces when it uses both signature matching and name matching. This is the rest of the program:

struct Bar;

// uses the default method implementation
impl Foo for Bar {}

struct Baz;

impl Foo for Baz {
    // overrides the default method
    fn foo(&self) -> bool {
        // spill variables onto the stack
        unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5)) }

        true
    }
}

// could change the trait object at any time
#[exception]
fn SysTick() {
    *TO.lock() = &Baz;
}

Here i1 ({}*) is a fictitious node that denotes the dynamic dispatch of a trait method with signature fn(&[mut] self) -> bool. This operation could result in either Bar.foo or Baz.foo being called. Note that Bar.foo appears as app::Foo::foo in the graph because Bar uses the default implementation of this method.

Importantly, there’s no edge between i1 ({}*) and app::foo even though app::foo matches the signature of the trait method.

For completeness sake here’s the LLVM-IR of all the functions that match the LLVM type i1 ({}*).

NOTE: Here I have omitted the mangled names since they are long, not really useful and the demangled version already appears in the comments.

; <app::Baz as app::Foo>::foo
; Function Attrs: minsize nounwind optsize
define zeroext i1 @_(%Baz* noalias nonnull align 1) unnamed_addr #2 {
; ..
}

; app::Foo::foo
; Function Attrs: minsize nounwind optsize
define zeroext i1 @_(%Bar* noalias nonnull align 1) unnamed_addr #2 {
; ..
}

; app::foo
; Function Attrs: minsize noinline nounwind optsize
define fastcc zeroext i1 @_(i32* align 4 dereferenceable(4)) unnamed_addr #3 {
; ..
}

Checking correctness

Now that you know how cargo-call-stack works it’s a good time to check if it’s correct by using it on a non-contrived example. The example I’ll use in this section is a CoAP / 6LoWPAN server.

This server exposes an LED resource at path /led; this LED can be controlled and polled using PUT and GET requests. The payloads of PUT requests and Content responses are in JSON format. This application does not use interrupts; all the logic is in an infinite loop (AKA a “super loop”).

For starters, here’s the call graph produced by cargo-call-stack. (Clicking the image should open the SVG in your browser)

The tool reports that the worst-case stack usage of main is 1152 bytes. To check whether this number is correct or not we’ll measure the stack usage by running the program for a bit. To measure the stack usage in a non-invasive way we’ll fill the RAM with some known bit pattern. This can be done in the linker script.

SECTIONS
{
  /* .. */

  /* add this after the last section placed in RAM */

  .fill : ALIGN(4)
  {
    FILL(0xfeedc0de);
    . = ORIGIN(RAM) + LENGTH(RAM) - 4;
    LONG(0xfeedc0de);
  } > RAM AT > RAM

  /* .. */
}

This will tell the linker to create an output section named .fill that contains the bit pattern 0xfeedc0de repeated all over the place. This section will span the part of the RAM that’s not used for static variables, which is where the (call) stack resides.

As this section is marked as loadable (which is the default; NOLOAD can be used to opt out) the debugger will load it into memory ..

$ cargo run --example sixlowpan --release
(..)
Loading section .fill, size 0x4ffc lma 0x20000004
Loading section .vector_table, size 0x130 lma 0x8000000
Loading section .text, size 0x4c18 lma 0x8000130
Loading section .rodata, size 0x520 lma 0x8004d50
(..)

(gdb) _

.. effectively initializing all unused RAM to the specified bit pattern. Examining the memory in GDB confirms that the bit pattern is everywhere. I’m running this program on a device with 20 KiBi (0x5000) of RAM, which is located at address 0x2000_0000.

(gdb) x/16x 0x20004000
0x20004040:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe
0x20004050:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe
0x20004060:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe
0x20004070:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe

(gdb) x/16x 0x20004fc0
0x20004fc0:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe
0x20004fd0:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe
0x20004fe0:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe
0x20004ff0:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xfeedc0de

(And yes the pattern appears inverted because the target is little endian.)

What I did next was run the application and use it as I normally would. I sent a few requests and pinged the device to try to hit as many code paths as possible.

$ coap -I lowpan0 GET 'coap://[fe80::2219:220:23:5959]/led'
-> {type: Confirmable, code: Method::Get, message_id: 4324, options: {UriPath: "led"}}
<- {type: Acknowledgement, code: Response::Content, message_id: 4324}
{"led":true}

$ coap -I lowpan0 PUT 'coap://[fe80::2219:220:23:5959]/led' '{"led":false}'
-> {type: Confirmable, code: Method::Put, message_id: 4378, options: {UriPath: "led"}}
<- {type: Acknowledgement, code: Response::Changed, message_id: 4378}

$ coap -I lowpan0 GET 'coap://[fe80::2219:220:23:5959]/led'
-> {type: Confirmable, code: Method::Get, message_id: 1654, options: {UriPath: "led"}}
<- {type: Acknowledgement, code: Response::Content, message_id: 1654}
{"led":false}

$ ping -6 -c2 fe80::2219:220:23:5959%lowpan0
PING fe80::2219:220:23:5959%lowpan0(fe80::2219:220:23:5959%lowpan0) 56 data bytes
64 bytes from fe80::2219:220:23:5959%lowpan0: icmp_seq=1 ttl=64 time=15.0 ms
64 bytes from fe80::2219:220:23:5959%lowpan0: icmp_seq=2 ttl=64 time=16.2 ms

--- fe80::2219:220:23:5959%lowpan0 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 2ms
rtt min/avg/max/mdev = 15.002/15.612/16.222/0.610 ms

After interacting with the server for a bit I paused the program and proceeded to inspect the RAM in GDB. Eventually I found the high-water mark of the stack at address 0x20004dc0.

(gdb) x/16x 0x20004db0
0x20004db0:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe
0x20004dc0:     0xdec0edfe      0xdec0edfe      0xdec0edfe      0xdec0edfe
0x20004dd0:     0x00000001      0x00000000      0x00000000      0x00000000
0x20004de0:     0x3aaa0001      0x00000000      0x00000000      0x00000000

At its deepest the stack spanned from 0x20005000 to 0x20004dd0 (the stack grows downwards); that’s 560 bytes, which is a much smaller value than the 1152 bytes reported by the tool as the worst-case stack usage. That’s reassuring; it would have been bad if the measured stack usage had been greater than the worst-case stack usage reported by the tool.

Known limitations

So far I have only shown you examples where cargo-call-stack does a good job at building the call graph; however, the tool is far from perfect. In this second part of the post we’ll see where the tool falls short.

LLVM intrinsics

I said that the call @_ instruction in LLVM-IR corresponds to a direct function call. That’s not always the case because LLVM intrinsics are also invoked using the call instruction. One good example is the llvm.memset intrinsic; consider the following code:

#[entry]
fn main() -> ! {
    let x = [0u8; 32];

    // force `x` to be allocated on the stack
    unsafe {
        core::ptr::read_volatile(&x.as_ptr());
    }

    loop {}
}

It produces the following LLVM-IR:

; Function Attrs: noinline noreturn nounwind
define void @main() unnamed_addr #0 {
start:
  %_7 = alloca i8*, align 4
  %x = alloca [32 x i8], align 1
  %0 = getelementptr inbounds [32 x i8], [32 x i8]* %x, i32 0, i32 0
; ↓
  call void @llvm.memset.p0i8.i32(i8* nonnull align 1 %0, i8 0, i32 32, i1 false)
  %_7.0.sroa_cast2 = bitcast i8** %_7 to i8*
  store i8* %0, i8** %_7, align 4
  %_7.0._7.0. = load volatile i8*, i8** %_7, align 4
  br label %bb3

bb3:                                              ; preds = %bb3, %start
  br label %bb3
}

Although you get a call instruction there the actual disassembly contains no function call:

00000400 <main>:
 400:   b089            sub     sp, #36 ; 0x24
 402:   2000            movs    r0, #0
 404:   e9cd 0006       strd    r0, r0, [sp, #24]
 408:   e9cd 0004       strd    r0, r0, [sp, #16]
 40c:   e9cd 0002       strd    r0, r0, [sp, #8]
 410:   e9cd 0000       strd    r0, r0, [sp]
 414:   4668            mov     r0, sp
 416:   9008            str     r0, [sp, #32]
 418:   9808            ldr     r0, [sp, #32]
 41a:   e7fe            b.n     41a <main+0x1a>

Turns out the stack-allocated array was initialized inline using a few register operations and no loops!

But this is not always the case; the output machine code depends on the size of the array and register availability. For example, if we increase the size of the array to 64 we get a call to __aeabi_memclr4.

00000400 <main>:
 400:   b092            sub     sp, #72 ; 0x48
 402:   ac01            add     r4, sp, #4
 404:   2140            movs    r1, #64 ; 0x40
 406:   4620            mov     r0, r4
 408:   f000 f998       bl      73c <__aeabi_memclr4>
 40c:   9411            str     r4, [sp, #68]   ; 0x44
 40e:   9811            ldr     r0, [sp, #68]   ; 0x44
 410:   e7fe            b.n     410 <main+0x10>

Though we see almost no change in the LLVM-IR.

define void @main() unnamed_addr #0 {
start:
  %_7 = alloca i8*, align 4
; <changed>
  %x = alloca [64 x i8], align 1
  %0 = getelementptr inbounds [64 x i8], [64 x i8]* %x, i32 0, i32 0
  call void @llvm.memset.p0i8.i32(i8* nonnull align 1 %0, i8 0, i32 64, i1 false)
; </changed>
  %_7.0.sroa_cast2 = bitcast i8** %_7 to i8*
  store i8* %0, i8** %_7, align 4
  %_7.0._7.0. = load volatile i8*, i8** %_7, align 4
  br label %bb3

bb3:                                              ; preds = %bb3, %start
  br label %bb3
}

We may not always get a call to __aeabi_memclr4; we could get a call to __aeabi_memclr depending on the alignment of the array. So for call @llvm.memset(%0, 0, ..) we could get either a call to __aeabi_memclr, a call to __aeabi_memclr4 or no function call, at least when compiling for ARMv7-M. And this is just one LLVM intrinsic! There are at least a dozen of intrinsics that behave like this and whose behaviors are architecture specific.

ELF to the rescue

How do we deal with these problematic LLVM intrinsics? We could port the relevant LLVM code to cargo-call-stack but that would be a lot of work and likely to get outdated in an LLVM update. The alternative is to ignore all calls to LLVM intrinsics in the IR and instead directly look for function calls in the output binary: the ELF file. As of v0.1.2 cargo-call-stack does the latter but only for ARM Cortex-M binaries.

The idea is simple: we walk through all the subroutines (functions) in the output ELF, take note of all branch instructions (e.g. B and BL) and use that information to add more edges to the call graph. BL (branch link) instructions always result in an edge; B (branch) instructions only result in an edge if they lead to a different subroutine (B instructions are also used to implement if conditionals and for loops within a subroutine). We ignore all BLX (branch link exchange) instructions; these correspond to function pointer calls and dynamic dispatch but provide no more information than what’s contained in the LLVM-IR.

Here’s the call graph of the ARM Cortex-M program that zeroes a 64-byte array.

When dealing with non-Cortex-M binaries the tool doesn’t parse the machine code. However, it still tries to convert calls to llvm intrinsics into edges. For example, when dealing with ARM binaries the tool assumes that call @llvm.memcpy can lower to either __aeabi_memcpy or __aeabi_memcpy4 so it inserts two edges. The tool has hard-coded rules to deal with some intrinsics, but when it sees an intrinsic it doesn’t know about it assumes that it doesn’t produce a function call and prints a warning about it.

Other LLVM instructions

This instruction selection problem is not exclusive to LLVM intrinsics; you can also see it in LLVM instructions other than call. One example is the fadd instruction, which is used to add two floating point numbers. If the compilation target has an FPU (e.g. ARMv7E-M) then this LLVM instruction lowers to the appropriate machine instructions; if not then the LLVM instruction lowers to a call to __aeabi_fadd, at least when compiling for ARM. There are dozens of LLVM instructions that behave like this.

For ARM Cortex-M binaries these LLVM instructions are not a problem because we parse the machine code but for other binaries the call graph will likely be missing edges. Unlike how intrinsics are handled, the tool doesn’t try to reason about these LLVM instructions as that would be a lot of work and parsing the machine code solves the problem much more efficiently.

Here’s an example that showcases one of these problematic LLVM instructions:

#[no_mangle] // shorter name in the llvm-ir
static X: AtomicU32 = AtomicU32::new(0);

#[entry]
fn main() -> ! {
    let x = X.load(Ordering::Relaxed);
    let y = f32::from_bits(x) * 1.1;
    X.store(y.to_bits(), Ordering::Relaxed);

    loop {}
}

// can change the value of `X` at any time
#[exception]
fn SysTick() {
    X.fetch_add(1, Ordering::Relaxed);
}

Here’s the LLVM IR of main. As you can see there’s no call instruction.

; Function Attrs: norecurse noreturn nounwind
define void @main() unnamed_addr #0 {
start:
  %0 = load atomic i32, i32* bitcast (<{ [4 x i8] }>* @X to i32*) monotonic, align 4
  %1 = bitcast i32 %0 to float
  %2 = fmul float %1, 0x3FF19999A0000000
  %3 = bitcast float %2 to i32
  store atomic i32 %3, i32* bitcast (<{ [4 x i8] }>* @X to i32*) monotonic, align 4
  br label %bb5

bb5:                                              ; preds = %bb5, %start
  br label %bb5
}

However, there is a BL instruction in its machine code:

00000400 <main>:
 400:   f240 0400       movw    r4, #0
 404:   f64c 41cd       movw    r1, #52429      ; 0xcccd
 408:   f2c2 0400       movt    r4, #8192       ; 0x2000
 40c:   f6c3 718c       movt    r1, #16268      ; 0x3f8c
 410:   6820            ldr     r0, [r4, #0]
 412:   f000 f9d5       bl      7c0 <__aeabi_fmul>
 416:   6020            str     r0, [r4, #0]
 418:   e7fe            b.n     418 <main+0x18>

As this is a Cortex-M program the tool will parse the machine code and produce an accurate call graph:

Binary blobs

rustc can produce LLVM-IR and .stack_sizes information for all the Rust code it compiles from source. This includes the crates in the std facade even though they come pre-compiled as part of the rust-std component.

There’s an exception though: the compiler-builtins crate. This crate contains only compiler intrinsics. When LLVM can’t lower an instruction like fadd to machine code it lowers it to a call to a compiler intrinsic like __aeabi_fadd. Compiler intrinsics are just Rust functions that come pre-compiled in the form of the libcompiler_builtins.rlib, which is part of the rust-std component.

Unlike the other crates in the std facade compiler-builtins’s metadata is never loaded by rustc when compiling a crate thus no LLVM-IR or .stack_sizes section is produced for it. Loading the metadata is not required because this special crate is always linked in as a separate object, even when LTO is enabled.

Thus, from the point of view of the analysis, the compiler-builtins crate is a binary blob; the tool has no type or stack usage information about the symbols (functions) in that crate. There are other binary blobs that a crate can depend on: C libraries, like libc.a, and pre-compiled assembly files, these are used in the cortex-m crates.

The problems

If you are compiling for ARM Cortex-M the tool will extract call graph information from the binary blobs – the machine code is always available – thus all direct function calls will be accounted for. The tool will not, however, attempt to analyze the indirect function calls that it finds in the machine code as there’s no type information that can be used to build a list of potential callees; you will get warnings whenever such indirect function call is found. If you are compiling for a different architecture the tool will not parse the machine code of these blobs so the graph will be missing both direct and indirect edges.

The lack of stack usage information means that it’s not possible to find the worst-case stack usage of any function that, directly or indirectly, calls into a symbol that resides in a binary blob. In this case the tool will only report a lower bound like it did when it encountered potentially unbounded cycles. Here’s an example:

use cortex_m::asm;

#[entry]
fn main() -> ! {
    asm::nop();

    loop {}
}

As there’s no inline assembly on stable the cortex-m crate implements its asm API using pre-compiled assembly files; these pre-compiled assembly files are binary blobs so all the asm API has unknown stack usage.

Sometimes there’s no type information for all the symbols that reside in binary blobs. If the symbol is called via FFI then there will be type information for that symbol – the type information comes from the extern block in the Rust source code. But if that external function calls another function then there will be no type information for the second function. Also, there’s no type information for compiler intrinsics like __aeabi_memcpy in the LLVM-IR – these don’t appear in extern blocks since they are not explicitly called by the program.

If there’s at least one symbol for which the tool has no type information then the tool must assume that any function pointer call could potentially call into such “untyped” symbol. It does so by adding an edge between an unknown node, labeled "?" in the graph, and every one of the fictitious function pointer call nodes.

What can we do?

We are currently focusing our efforts on making pure Rust embedded applications fully analyzable so let’s start by addressing the problems that affect that use case.

The main problem for this use case is the compiler-builtins crate. We have no type information about the symbols in it, but we can paper over this problem in the tool itself. All compiler intrinsics have stable an interface / ABI so we can hard-code this information in the tool (and we already do). Note that this doesn’t include functions called by the compiler intrinsics; those are implementation details and have no stable interface. So with this trick you get most but not all the type information.

We don’t have stack usage information about the compiler intrinsics but that’s easy to fix! One simply has to recompile the compiler-builtins crate with -Z emit-stack-sizes. I tested this locally and it works. To make this as convenient as possible this should be done in rust-lang/rust CI so the end-user doesn’t need to do any extra step. In the meantime, I have hard-coded some of this information in the tool so, for example, it knows what’s the stack usage of __aeabi_memcpy4 when the target is thumbv7m-none-eabi.

Those two fixes should cover all use cases where the program only contains direct function calls. To fully support programs that do function pointer calls we would need complete type information, and I think the only solution for that would be to include the compiler-builtins crate in the LTO process instead of always treating it as a separate object file. That way we would have access to the LLVM-IR of compiler-builtins.

However, I believe that to be impossible due to how LLVM works: simply put if compiler-builtins participated in the LTO process all its IR would be removed by the optimization passes and it wouldn’t appear in the .ll file that the tool parses. The reason for that is that the IR contains call @llvm.memset rather than call @__aeabi_memcpy4 so the dead code elimination pass will remove the __aeabi_memcpy4 function; the same thing would happen with all the other compiler intrinsics and the internal functions they call.

Now going back to the general case of Rust programs that link to C libraries: in theory it should be possible to (re)compile all C code using clang and the -fstack-size-section flag, which is the equivalent of rustc’s -Z emit-stack-sizes. That would give us stack usage information about all the C functions.

Here’s a contrived example to illustrate the idea:

void foo() {
  // allocate an array on the stack
  char x[8];

  return;
}

void bar() {
  return;
}

$ clang -target thumbv7m-none-eabi -c -fstack-size-section foo.c

$ size -Ax foo.o | head -n5
foo.o  :
section           size   addr
.text              0x8    0x0
.ARM.exidx        0x10    0x0
.stack_sizes       0xa    0x0

$ arm-none-eabi-objdump -d foo.o | tail -n7
00000000 <foo>:
   0:   b082            sub     sp, #8
   2:   b002            add     sp, #8
   4:   4770            bx      lr

00000006 <bar>:
   6:   4770            bx      lr

$ # `stack-sizes` is a tool shipped with `cargo-stack-sizes`
$ stack-sizes foo.o
address         stack   name
0x00000000      8       foo
0x00000006      0       bar

With stack usage information and the ability to extract call graph information from machine code it should be possible to analyze all programs that link to C code and only do direct function calls.

To analyze programs that do indirect function calls we would need type information. Here a potential solution could be cross language LTO. This feature would convert the C code compiled by clang into LLVM-IR that the tool can analyze. I haven’t looked into this yet as pure Rust applications are our priority.

Lossy type conversions

To reason about function pointer calls and trait objects we use LLVM types. LLVM types, specially its primitives, don’t quite match Rust types. For example, there are no unsigned integers in the LLVM type system so a Rust type like u8 gets transformed into LLVM’s i8. This mismatch can cause functions to be labeled with the wrong signature which results in incorrect edges being added to the graph.

Consider the following example, which is a variation of the example shown in the ‘Function pointers’ section.

// return type changed
static F: AtomicPtr<fn() -> u32> = AtomicPtr::new(foo as *mut _);

#[entry]
fn main() -> ! {
    if let Some(f) = unsafe { F.load(Ordering::Relaxed).as_ref() } {
        f();
    }

    // volatile load to preserve the return value of `baz`
    unsafe {
        core::ptr::read_volatile(&baz());
    }

    loop {}
}

fn foo() -> u32 { // return type changed
    0
}

fn bar() -> u32 { // return type changed
    unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5)) }

    1
}

// new
#[inline(never)]
fn baz() -> i32 {
    unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) "r"(7)) }

    F.load(Ordering::Relaxed) as usize as i32
}

#[exception]
fn SysTick() {
    F.store(bar as *mut _, Ordering::Relaxed);
}

The tool produces the following call graph:

Note that the function pointer call shows as i32 ()*, i.e. fn() -> i32, even though F has type AtomicPtr<fn() -> u32>. The reason is that both Rust types i32 and u32 map to LLVM’s i32 type. The result is that there’s an edge between i32 ()* and baz which shouldn’t be there!

This lossy conversion not only affects integers; it also affects newtypes. For example, the Rust type of struct Int(i32) can be transformed into one of these two LLVM types: i32 or %Int. As far as I have seen you will only get the latter if you use an attribute like #[repr(align(8))] on the newtype; in most cases you will get i32.

For completeness sake here’s the LLVM-IR of functions foo, bar and baz. It will come in handy for the next section.

; app::foo
; Function Attrs: norecurse nounwind readnone
define i32 @_ZN3app3foo17h6b6463a4119b6a8bE() unnamed_addr #0 {
; ..
}

; app::bar
; Function Attrs: nounwind
define i32 @_ZN3app3bar17h81b67229f2291d28E() unnamed_addr #1 {
; ..
}

; app::baz
; Function Attrs: noinline nounwind
define fastcc i32 @_ZN3app3baz17hf28930053d134261E() unnamed_addr #2 {
; ..
}

Metadata

How can we deal with this problem? One potential solution is to make rustc add Rust type information to the LLVM-IR in the form of metadata.

In all the previous LLVM-IR snippets I have been removing the metadata to keep them short but metadata can appear next to function definitions, function calls, function arguments, etc. and usually takes one of two forms: !scope !"string" or !scope !0. The latter form is an alias; the definition of these aliases usually appears at the bottom of the .ll file and has the form: !0 = !"string".

How would metadata help? We can use metadata to tag every function definition with its Rust type and also to tag every function pointer call with its Rust type. Then the tool can use these tags to match every function pointer call to the functions that have the same Rust type.

To illustrate the idea let’s revisit the previous example assuming that we have a custom rustc that injects metadata as described above.

The function definitions would now look like this:

; app::foo
; Function Attrs: norecurse nounwind readnone
define i32 @_ZN3app3foo17h6b6463a4119b6a8bE() unnamed_addr #0 !fn !0 {
;                                                             ^^^^^^ NEW!
; ..
}

; app::bar
; Function Attrs: nounwind
define i32 @_ZN3app3bar17h81b67229f2291d28E() unnamed_addr #1 !fn !0 {
;                                                             ^^^^^^ NEW!
; ..
}

; app::baz
; Function Attrs: noinline nounwind
define fastcc i32 @_ZN3app3baz17hf28930053d134261E() unnamed_addr #2 !fn !1 {
;                                                                    ^^^^^^ NEW!
; ..
}

; ..

; at the bottom
!0 = "fn() -> u32"
!1 = "fn() -> i32"

And the function pointer call would look like this:

define void @main() unnamed_addr #3 {
; ..
  %4 = tail call i32 %3() #8, !fn !0
;                           ^^^^^^^^ NEW!
; ..
}

With these changes the tool would be able to look at the metadata instead of the LLVM type. In this case it would know that line %4 is calling a function with signature fn() -> u32 and that only two functions exactly match that signature: foo and bar. The result would be the following call graph:

What string sits behind the metadata is not really important as long as each different type gets a unique metadata ID but if we use the string version of the type we can use that instead of the LLVM type, which would be an improvement in readability!

Trait misselection

We used LLVM types to reason about dynamic dispatch so naturally the lossy conversion from Rust types to LLVM types will be an issue, but there’s another problem specific to dynamic dispatch: namely, at call site, we don’t know which trait is being called – we only know the signature of the method that’s being called.

Here’s an example that illustrates the issue. It’s a variation of the example shown in the ‘Trait objects’ section.

(I hope that by this point you have grow accustomed to all the inline assembly, volatile operations and atomics required to prevent LLVM from optimizing the program beyond our expectations!)

static TO: Mutex<&'static (dyn Foo + Sync)> = Mutex::new(&Bar);
static X: AtomicI32 = AtomicI32::new(0);

#[entry]
fn main() -> ! {
    (*TO.lock()).foo();

    if X.load(Ordering::Relaxed).quux() {
        unsafe { asm!("NOP" : : : : "volatile") }
    }

    loop {}
}

trait Foo {
    fn foo(&self) -> bool {
        unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5)) }

        false
    }
}

struct Bar;

impl Foo for Bar {}

struct Baz;

impl Foo for Baz {
    fn foo(&self) -> bool {
        unsafe { asm!("" : : "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) "r"(7)) }

        true
    }
}

trait Quux {
    fn quux(&self) -> bool;
}

impl Quux for i32 {
    #[inline(never)]
    fn quux(&self) -> bool {
        unsafe { core::ptr::read_volatile(self) > 0 }
    }
}

#[exception]
fn SysTick() {
    *TO.lock() = &Baz;
    X.fetch_add(1, Ordering::Relaxed);
}

Here’s the call graph produced by cargo-call-stack:

First, let me remind you what the dynamic dispatch looks like in LLVM-IR:

; Function Attrs: noreturn nounwind
define void @main() unnamed_addr #3 {
; ..
  %8 = tail call zeroext i1 %7({}* nonnull align 1 %4) #8
; ..
}

The type of the first argument ({}*) tells us this is dynamic dispatch and the LLVM type information tells us that the method signature is i1 ({}*), or roughly fn(&[mut] self) -> bool in Rust type system. From that info alone, we don’t know whether this is calling Foo.foo or Quux.quux; both match the i1 ({}*) signature so the tool adds edges to all implementers of these two traits.

Here’s the LLVM-IR of all the Foo.foo and Quux.quux definitions. It will be useful for the next section.

NOTE: I have omitted the mangled function names because they were too long.

; app::Foo::foo
; Function Attrs: nounwind
define zeroext i1 @_(%Bar* nocapture nonnull align 1) unnamed_addr #1 {
; ..
}

; <app::Baz as app::Foo>::foo
; Function Attrs: nounwind
define zeroext i1 @_(%Baz* nocapture nonnull align 1) unnamed_addr #1 {
; ..
}

; <i32 as app::Quux>::quux
; Function Attrs: noinline norecurse nounwind
define fastcc zeroext i1 @_(i32* align 4) unnamed_addr #2 {
; ..
}

More metadata

How can we do better in this case? We can use even more metadata!

The idea is simple: we add a metadata ID to every trait method implementation and every dynamic dispatch site; the IDs will be unique for each trait name, method name pair. The tool then can use these IDs to match the dynamic dispatch to the set of trait implementers.

Let’s illustrate with a re-evaluation of the previous example assuming a custom rustc.

All the trait method implementations would get extra metadata as shown below:

; app::Foo::foo
; Function Attrs: nounwind
define zeroext i1 @_(%Bar* nocapture nonnull align 1) unnamed_addr #1 !dyn !0 {
;                                                                     ^^^^^^^ NEW!
; ..
}

; <app::Baz as app::Foo>::foo
; Function Attrs: nounwind
define zeroext i1 @_(%Baz* nocapture nonnull align 1) unnamed_addr #1 !dyn !0 {
;                                                                     ^^^^^^^ NEW!
; ..
}

; <i32 as app::Quux>::quux
; Function Attrs: noinline norecurse nounwind
define fastcc zeroext i1 @_(i32* align 4) unnamed_addr #2 !dyn !1 {
;                                                         ^^^^^^^ NEW!
; ..
}

; at the bottom
!0 = "(dyn app::Foo).foo"
!1 = "(dyn app::Ouux).quux"

The dynamic dispatch site would also get !dyn metadata:

; Function Attrs: noreturn nounwind
define void @main() unnamed_addr #3 {
; ..
  %8 = tail call zeroext i1 %7({}* nonnull align 1 %4) #8, !dyn 0
;                                                          ^^^^^^ NEW!
; ..
}

Now the tool knows that line %8 is calling method foo on a dyn Foo trait object and will produce the following call graph:

`asm!`

Inline assembly (asm!) is not accounted for in LLVM’s stack-sizes-section analysis. I imagine that’s because asm! strings are free form and LLVM never really parses them, beyond what’s required to do string interpolation.

Here’s a contrived example that demonstrates the problem:

#[inline(never)]
#[no_mangle]
unsafe fn foo() {
    asm!(r#"
sub sp, #8
add sp, #8
"#);
}

$ cargo stack-sizes --bin app --release | head -n2
address         stack   name
0x00000400      0       foo

$ arm-none-eabi-objdump -Cd target/thumbv7m-none-eabi/release/app | sed -n '7,10p'
00000400 <foo>:
 400:   b082            sub     sp, #8
 402:   b002            add     sp, #8
 404:   4770            bx      lr

cargo stack-sizes reports zero stack usage for foo, but if you remember the example C code I used in the ‘Binary blobs’ section you’ll notice that the disassembly is the same and that clang reported 8 bytes of stack usage for the C version. Therefore the number reported for the Rust version of foo is wrong due to the presence of inline assembly.

It’s important to note that LLVM does account for the register allocation and register spilling caused by asm! blocks when it computes stack usage – we have seen this in the previous examples (remember the “spill variables onto the stack” asm! blocks?). It just doesn’t check if the assembly instructions modify the stack pointer; it always assumes that they don’t.

What the tool does today is assume that all asm! blocks use zero stack space, which we believe it’s the most common scenario, and additionally prints a warning that includes the asm! string every time it does. The application author has to inspect these warnings and make sure that all instances of inline assembly are indeed using zero stack space.

Alternative solutions that we could explore here:

Parse the assembly string to make sense out of it and compute its stack usage. This would be a ton of work to get 100% right, and it’s also a bit tricky because asm! strings are not always valid assembly; they are closer to format strings so you would have to deal with strings like "mrs $0, BASEPRI".
Keep a list of assembly strings known to use zero stack space. Only display warnings about assembly strings that are not in this list.
Assume zero stack usage and display warnings as we do today but add an option to let the end user provide stack usage information about assembly strings.

External assembly

I have covered inline assembly but there’s also external assembly! By “external assembly” I mean assembly files that are compiled on the fly or shipped pre-compiled with a crate and then called from Rust code via FFI.

External assembly has been covered in the ‘Binary blobs’ section but in this section I wanted to stress the fact that it’s not possible to get stack information about it from LLVM. For example:

  .global foo
foo:
  sub sp, 8
  add sp, 8
  bx lr

$ clang -target thumbv7m-none-eabi -fstack-size-section -c foo.s
clang-7: warning: argument unused during compilation: '-fstack-size-section'

$ size -Ax foo.o
foo.o  :
section           size   addr
.text              0x6    0x0
.ARM.attributes   0x25    0x0
Total             0x2b

No stack usage information!

I see a few ways to deal with the lack of stack usage information:

We add an option to let the end-user pass extra stack usage information about these external assembly functions.
We make the tool analyze the machine code to compute the stack usage of every subroutine for which we don’t have .stack_sizes information. This would be a ton of work to get 100% right.
We do a much more limited form of option (2) where we only try to determine if a function uses zero stack space or not. If we can’t determine the function uses zero stack space then we consider its stack usage to be unknown. This is much simpler to implement because we only have to confirm the absence of instructions that modify the stack pointer (e.g. push, sub sp, 8, etc.). Although limited in what it can analyze this approach would be enough to handle several uses of external assembly in the wild (cortex_m::asm for example).

`core::fmt`

core::fmt uses this interesting pattern that makes its call graph hard to compute. This is the relevant source code:

// src/libcore/fmt/mod.rs (1.33.0)

// NB. Argument is essentially an optimized partially applied formatting function,
// equivalent to `exists T.(&T, fn(&T, &mut Formatter) -> Result`.

struct Void {
    _priv: (),
    _oibit_remover: PhantomData<*mut dyn Fn()>,
}

/// This struct represents the generic "argument" which is taken by the Xprintf
/// family of functions. It contains a function to format the given value. At
/// compile time it is ensured that the function and the value have the correct
/// types, and then this struct is used to canonicalize arguments to one type.
pub struct ArgumentV1<'a> {
    value: &'a Void,
    formatter: fn(&Void, &mut Formatter) -> Result,
}

impl<'a> ArgumentV1<'a> {
    pub fn new<'b, T>(x: &'b T,
                      f: fn(&T, &mut Formatter) -> Result) -> ArgumentV1<'b> {
        unsafe {
            ArgumentV1 {
                formatter: mem::transmute(f),
                value: mem::transmute(x)
            }
        }
    }

    pub fn from_usize(x: &usize) -> ArgumentV1 {
        ArgumentV1::new(x, ArgumentV1::show_usize)
    }

    fn as_usize(&self) -> Option<usize> {
        if self.formatter as usize == ArgumentV1::show_usize as usize {
            Some(unsafe { *(self.value as *const _ as *const usize) })
        } else {
            None
        }
    }
}

Which gets used like this: (argv1.formatter)(argv1.value, fmt).

ArgumentV1 is very close to a trait object: it has a pointer to the data (value) and a pointer to a vtable (formatter), which in this case only contains a single method. Except that it can represent one of many similar traits. This is easier to understand with an example:

fn main() {
    println!("{} and {:?}", 1i32, 2i64);
}

This program cargo-expands into:

fn main() {
    {
        ::std::io::_print(::std::fmt::Arguments::new_v1(
            &["", " and ", "\n"],
            &match (&1i32, &2i64) {
                (arg0, arg1) => [
                    ::std::fmt::ArgumentV1::new(arg0, ::std::fmt::Display::fmt),
                    ::std::fmt::ArgumentV1::new(arg1, ::std::fmt::Debug::fmt),
                ],
            },
        ));
    };
}

In this case the println! macro creates two instances of ArgumentV1. The first one is basically &dyn fmt::Display where Self == i32 and the second one is &dyn fmt::Debug where Self == i64. This means that when main eventually invokes (argv1.formatter)(argv1.value, fmt) it could result in either <i32 as fmt::Display>::fmt or <i64 as fmt::Debug>::fmt being invoked.

Of course it’s impossible to figure that out from the types: argv1.formatter has type fn(&fmt::Void, &mut fmt::Formatter) but at runtime it holds values that originally had types like fn(&i32, &mut fmt::Formatter) and fn(&i64, &mut fmt::Formatter). This only works because the type of the first argument gets erased using a transmute operation in ArgumentV1::new.

Because core::fmt is so pervasively used the tool has a special case in the code that analyzes function pointer calls to handle fn(&fmt::Void, &mut fmt::Formatter) -> fmt::Result. Basically, it treats the fictitious node of that function pointer call as a dynamic dispatch node and adds edges from it to functions that match the signature (fn(&???, &mut fmt::Formatter) -> fmt::Result) where ??? can be any type.

Here’s an example of a minimal embedded program that uses core::fmt.

// cortex-m-semihosting = "=0.3.2"
use cortex_m_semihosting::hprintln;

#[entry]
fn main() -> ! {
    hprintln!("{}", true).ok();

    loop {}
}

And here’s its call graph

This type erasure pattern can’t be reasoned about in the general case, or at least I don’t know how one would go about it with just type information. It’s also not easy for the tool to detect (but you may be able to spot it when you look at the call graph) so it’s best to avoid using libraries that use this pattern (core is fine) if you want to keep your program analyzable.

What’s next?

Congratulations on reaching this point! This post ended up being way longer than I originally planned.

I hope you got some value out of it. I certainly found very useful to write this post. I originally started writing this last week right after I released v0.1.1 but midway I realized that cargo-call-stack still had some holes that wouldn’t be too hard to fix so I ended up working on it some more and released v0.1.2. Now that I’m about to publish this blog post I still see some low hanging fruit so we’ll probably see v0.1.3 in the near future!

So what’s next for cargo-call-stack?

Fixing more bugs.

Send a PR to rust-lang/rust to build the compiler-builtins crate with -Z emit-stack-sizes, at least when the target is ARM Cortex-M.
Implement option (3) of the possible solutions to the external assembly problem.
Write an RFC / RFE to add an experimental rustc flag that adds Rust type and trait information to the LLVM-IR as described in this post.
Real Time For the Masses (RTFM) integration / inter-operation. In this post we have focused on the stack usage of the main function, but when you have interrupts, which are common in embedded applications, you need to consider preemption to compute the overall worst-case stack usage. To properly factor in preemption you need to know the priorities of all interrupts. RTFM enforces static priorities, which is requirement for being able to statically reason about the stack usage of programs that use interrupts, and knows the priorities of all interrupts. We need some mechanism to pass RTFM information to cargo-call-stack so the latter can compute the overall worst-case stack usage.

Maybe add support for parsing ARM instructions (to get more call graph info) since I’m also working with Cortex-R cores. As of version v0.1.2 the tool only parses Thumb instructions, which is what Cortex-M cores use.

Thank you patrons! ❤️

I want to wholeheartedly thank:

Iban Eguia, Geoff Cant, Harrison Chin, Brandon Edens, whitequark, James Munns, Fredrik Lundström, Kjetil Kjeka, Kor Nielsen, Alexander Payne, Dietrich Ayala, Hadrien Grasland, vitiral, Lee Smith, Florian Uekermann, Ivan Dubrov and 56 more people for supporting my work on Patreon.

Let’s discuss on reddit.

Implementing a static stack usage analysis tool

March 13, 2019 by Jorge Aparicio