Overhead analysis of the RTFM framework

May 23, 2017 by Jorge Aparicio


In the last post I introduced the RTFM framework, and made several claims about it being highly efficient both in memory usage and runtime overhead. In this post I’ll analyze all the RTFM concurrency primitives to back up those claims. To do that I’ll first introduce a non-invasive timing method that’s accurate to a single clock cycle, which is the time the processor spends to execute one of the simplest instructions.

Let’s dive in.

NB All the measurements shown in this post have been performed on a Cortex M3 microcontroller (STM32F100RBT6B) running at 8 MHz with zero Flash memory wait states.

The timing method

DWT and CYCCNT

Cortex-M processors provide plenty of functionality for debugging and profiling programs in the form of core peripherals. One such peripheral is the Data Watchpoint and Trace (DWT) peripheral. This peripheral includes a cycle counter that counts every single core clock cycle. The core clock is the one driving the processor; its frequency is the processor frequency. This counter only makes progress when the processor is running; IOW, the counter will stop when, for example, the debugger halts the processor. The count of this cycle counter is available through the 32-bit CYCCNT register of the DWT peripheral.

How can we use this counter to measure the runtime of routines?

The obvious approach

Let’s say that we want to measure how many clock cycles it takes to execute a NOP (No OPeration) instruction. Let’s start by writing a task that executes only that instruction:

use cortex_m::asm;

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    asm::nop();
}

This is the disassembly of the task when the program is compiled in release mode:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	bf00      	nop
 8000330:	4770      	bx	lr

The obvious approach to time this NOP is to use the std approach. As long as we don’t modify the CYCCNT register the cycle counter will effectively be a monotonically increasing timer. So, we can take a snapshot of CYCCNT before and after the NOP instruction; the difference between those two values will be the number of clock cycles spent executing the NOP instruction (spoilers well, not exactly).

Here’s a program that does that:

fn init(ref prio: P0, thr: TMax) {
    // NB the cycle counter is disabled by default
    let dwt = DWT.access(prio, thr);
    dwt.enable_cycle_counter();
}

fn t1(_task: Exti0Irq, ref prio: P1, ref thr: T1) {
    let dwt = DWT.access(prio, thr);

    let before = dwt.cyccnt.read();
    asm::nop();
    let after = dwt.cyccnt.read();

    let elapsed = after.wrapping_sub(before);

    // volatile magic to prevent LLVM from optimizing away `elapsed`'s value
    unsafe { ptr::write_volatile(0x2000_0000 as *mut _, elapsed) }

    rtfm::bkpt();
}

(The full code of this program is in the appendix. We’ll perform several modifications to this program during the rest of this post.)

This program will store the difference of the CYCCNT snapshots at address 0x2000_0000. Let’s debug the program and inspect that address.

$ arm-none-eabi-gdb target/thumbv7m-none-eabi/release/(..)
(..)
> continue
66          rtfm::bkpt();

> x 0x20000000
0x20000000:     0x00000002

The result, according to the measurement, is 2 clock cycles. That is not the number of clock cycles spent executing the NOP instruction because that number also includes the time spent reading the CYCCNT register. The correct answer is actually 1 clock cycle; the other cycle was spent reading the register.

This method is rather invasive: it heavily changes the original program and makes use of processor registers that weren’t being used before. Look at the disassembly:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	f241 0004 	movw	r0, #4100	; 0x1004
 8000332:	f2ce 0000 	movt	r0, #57344	; 0xe000
 8000336:	6801      	ldr	r1, [r0, #0]	; read CYCCNT
 8000338:	bf00      	nop
 800033a:	6800      	ldr	r0, [r0, #0]	; read CYCCNT
 800033c:	1a40      	subs	r0, r0, r1
 800033e:	f04f 5100 	mov.w	r1, #536870912	; 0x20000000
 8000342:	6008      	str	r0, [r1, #0]
 8000344:	be00      	bkpt	0x0000
 8000346:	4770      	bx	lr

We can do much better than that.

A better approach

Instead of reading the CYCCNT register in the program itself we can use GDB to read the register. With this approach no processor register has to be used. This approach can also be used in an interactive debug session since the cycle counter will pause its count when the processor is halted.

Let’s revise the program to use this new approach:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    rtfm::bkpt(); // read CYCCNT in GDB
    asm::nop();
    rtfm::bkpt(); // read CYCCNT in GDB
}

The disassembly now looks very similar to the original program’s:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	be00      	bkpt	0x0000		; read CYCCNT
 8000330:	bf00      	nop
 8000332:	be00      	bkpt	0x0000		; read CYCCNT
 8000334:	4770      	bx	lr

The CYCCNT register is located at address 0xe000_1004; this location is the same regardless of the microcontroller. Let’s debug this new program and inspect that address.

> continue
> info registers pc
pc             0x800032e        0x800032e <overhead::main::INTERRUPTS::t1>

> x 0xe0001004
0xe0001004:     0x0074e9dc

> continue
> info registers pc
pc             0x8000332        0x8000332 <overhead::main::INTERRUPTS::t1+4>

> x 0xe0001004
0xe0001004:     0x0074e9dd

> print 0x0074e9dd - 0x0074e9dc
$1 = 1

This time the measurement returns the correct answer: Executing NOP took a single clock cycle.

Armed with a proper timing method we can go ahead and start measuring the runtime overhead of the different RTFM primitives.

Tasks

Under the RTFM framework a program is split in tasks; tasks are the unit of concurrency in the RTFM framework. Tasks are usually triggered by events, but their execution can be manually requested as well. Each task is assigned a priority that indicates its urgency. Let’s analyze them first.

Scheduling

The RTFM scheduler is a tickless fully preemptive task scheduler. The scheduler decides which task to execute next depending on its priority: higher priority tasks are more urgent and have to be completed first so those tasks can preempt lower priority ones.

In the Cortex-M implementation of the RTFM framework tasks are interrupts and the Nested Vectored Interrupt Controller (NVIC) is used as the task scheduler. The NVIC takes care of servicing interrupts: that is of launching interrupts handlers (tasks) as events arrive, and also of scheduling the execution of handlers (tasks) according to their priorities. Because RTFM leverages the NVIC no task bookkeeping is done in the program; the NVIC takes care of doing all the scheduling.

IOW, the scheduling overhead is effectively zero. The NVIC, which is hardware independent of the processor, will take care of scheduling tasks, that is of deciding which task must be executed next, freeing the core processor from doing the job.

The scheduling overhead is zero

Context switching

Context switches do use processor time so let’s measure their cost.

Preemption

We’ll start measuring the context switching cost of going from a lower priority task to a higher priority task.

This is the code we’ll use to measure the context switching cost:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    rtfm::bkpt();

    // task `t2` will preempt this task
    rtfm::request(t2);

    rtfm::bkpt();
}

fn t2(_task: Exti1Irq, _prio: P2, _thr: T2) {
    rtfm::bkpt();
}

The disassembly is shown below:

08000324 <overhead::main::INTERRUPTS::t1>:
 8000324:	f24e 2000 	movw	r0, #57856	; 0xe200
 8000328:	2180      	movs	r1, #128	; 0x80
 800032a:	be00      	bkpt	0x0000
 800032c:	f2ce 0000 	movt	r0, #57344	; 0xe000
 8000330:	6001      	str	r1, [r0, #0]	; read CYCCNT
 8000332:	be00      	bkpt	0x0000		; read CYCCNT
 8000334:	4770      	bx	lr

08000336 <overhead::main::INTERRUPTS::t2>:
 8000336:	be00      	bkpt	0x0000		; read CYCCNT
 8000338:	4770      	bx	lr

The disassembly contains comments that indicate the points where the CYCCNT register will be read. The debug session of running this code is shown below:

> continue
> stepi
> # PC = 0x08000330, (CYCCNT & 0xff) = 0x3a

> continue
> # PC = 0x08000336, (CYCCNT & 0xff) = 0x45

> continue
> # pc = 0x08000332, (CYCCNT & 0xff) = 0x4f

> print 0x45 - 0x3a
$1 = 11

> print 0x4f - 0x45
$2 = 10

The first difference is 11 cycles; this is the time spent switching from the lower priority task to the higher priority one. In the ARM documentation this switching time is known as interrupt latency 1, the latency between the interrupt signal arrival and the execution of the interrupt handler.

The second difference is 10 cycles; this is the time spent switching from the higher priority task back to the lower priority one. So the total context switching cost is the sum: 21 cycles.

Interrupt latency = 11 cycles

Context switching cost (preemption) = 21 cycles

Extra register stacking

In the previous measurement the higher priority task was an empty function, and didn’t make use of any register. The context switching cost will increase if the higher priority task uses more than 5 registers because registers #6 and higher won’t be automatically saved and restored by the NVIC, so extra instructions are needed to do that. Those extra instructions will be automatically inserted by the compiler as the prologue of the task function. We need to consider the time spent executing the prologue as part of the context switching cost because the prologue will be executed before the task code we wrote is executed.

We can force the task t2 to use more than 5 registers with some assembly:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    rtfm::bkpt();

    // task `t2` will preempt this task
    rtfm::request(t2);

    rtfm::bkpt();
}

fn t2(_task: Exti1Irq, _prio: P1, _thr: T1) {
    // load values from 0 to 5 into 6 registers
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
    }

    rtfm::bkpt();
}

The disassembly of the above code is shown below:

08000324 <overhead::main::INTERRUPTS::t1>:
 8000324:	f24e 2000 	movw	r0, #57856	; 0xe200
 8000328:	2180      	movs	r1, #128	; 0x80
 800032a:	be00      	bkpt	0x0000
 800032c:	f2ce 0000 	movt	r0, #57344	; 0xe000
 8000330:	6001      	str	r1, [r0, #0]	; read CYCCNT
 8000332:	be00      	bkpt	0x0000		; read CYCCNT
 8000334:	4770      	bx	lr

08000336 <overhead::main::INTERRUPTS::t2>:
 8000336:	b580      	push	{r7, lr}
 8000338:	466f      	mov	r7, sp
 800033a:	f04f 0c00 	mov.w	ip, #0		; read CYCCNT
 800033e:	f04f 0e01 	mov.w	lr, #1
 8000342:	2202      	movs	r2, #2
 8000344:	2303      	movs	r3, #3
 8000346:	2004      	movs	r0, #4
 8000348:	2105      	movs	r1, #5
 800034a:	be00      	bkpt	0x0000		; read CYCCNT
 800034c:	bd80      	pop	{r7, pc}

The push and the following mov instructions at address 0x08000336 are the prologue of the function t2. For the measurement we’ll read the CYCCNT register after the prologue of t2 has been executed as indicated in the comments of the disassembly.

The debug session is shown bellow:

> continue
> stepi
> # PC = 0x08000330, (CYCCNT & 0xff) = 0xd4

> break overhead::main::INTERRUPTS::t2
> continue
> # PC = 0x0800033a, (CYCCNT & 0xff) = 0xe3

> continue
> # PC = 0x0800034a, (CYCCNT & 0xff) = 0xe9

> continue
> # PC = 0x08000332, (CYCCNT & 0xff) = 0xf6

> print 0xe3 - 0xd4
$1 = 15

> print 0xf6 - 0xe9
$2 = 13

The interrupt latency increases to 15 cycles and the total switching cost increases to 28 cycles. Let’s update our numbers:

Interrupt latency = 11-15 cycles

Context switching cost (preemption) = 21-28 cycles

vs function calls

Task preemption looks very similar to function calls except that is the NVIC, and not the user, who calls the tasks. Let’s see how the runtime cost of preemption compares to the runtime cost of doing a function call.

Consider the following program:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    rtfm::bkpt();

    foo();

    rtfm::bkpt();
}

#[inline(never)]
fn foo() {
    rtfm::bkpt();
}

With disassembly:

0800032e <overhead::foo>:
 800032e:	be00      	bkpt	0x0000		; read CYCCNT
 8000330:	4770      	bx	lr

08000332 <overhead::main::INTERRUPTS::t1>:
 8000332:	b580      	push	{r7, lr}
 8000334:	466f      	mov	r7, sp
 8000336:	be00      	bkpt	0x0000		; read CYCCNT
 8000338:	f7ff fff9 	bl	800032e <overhead::foo>
 800033c:	be00      	bkpt	0x0000		; read CYCCNT
 800033e:	bd80      	pop	{r7, pc}

The debug session reports 4 cycles of overhead:

> continue
> # PC = 0x08000336, (CYCCNT & 0xff) = 0x70

> continue
> # PC = 0x0800032e, (CYCCNT & 0xff) = 0x72

> continue
> # PC = 0x0800033c, (CYCCNT & 0xff) = 0x74

> print 0x74 - 0x70
$1 = 4

Like before we can repeat the measurement but changing foo to use more than 5 registers.

Here’s the revised program:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    rtfm::bkpt();

    foo();

    rtfm::bkpt();
}

#[inline(never)]
fn foo() {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
    }

    rtfm::bkpt();
}

Disassembly:

0800032e <overhead::foo>:
 800032e:	b580      	push	{r7, lr}
 8000330:	466f      	mov	r7, sp
 8000332:	f04f 0c00 	mov.w	ip, #0		; read CYCCNT
 8000336:	f04f 0e01 	mov.w	lr, #1
 800033a:	2202      	movs	r2, #2
 800033c:	2303      	movs	r3, #3
 800033e:	2004      	movs	r0, #4
 8000340:	2105      	movs	r1, #5
 8000342:	be00      	bkpt	0x0000		; read CYCCNT
 8000344:	bd80      	pop	{r7, pc}

08000346 <overhead::main::INTERRUPTS::t1>:
 8000346:	b580      	push	{r7, lr}
 8000348:	466f      	mov	r7, sp
 800034a:	be00      	bkpt	0x0000		; read CYCCNT
 800034c:	f7ff ffef 	bl	800032e <overhead::foo>
 8000350:	be00      	bkpt	0x0000		; read CYCCNT
 8000352:	bd80      	pop	{r7, pc}

And the interactive session reports 12 cycles of overhead:

> continue
> # PC = 0x0800034a, (CYCCNT & 0xff) = 0x56

> break overhead::foo
> continue
> # PC = 0x08000332, (CYCCNT & 0xff) = 0x5c

> continue
> # PC = 0x08000342, (CYCCNT & 0xff) = 0x62

> continue
> # PC = 0x08000350, (CYCCNT & 0xff) = 0x68

> print (0x5c - 0x56) + (0x68 - 0x62)
$1 = 12

In conclusion,

Function call cost = 4-12 cycles

So context switching due to preemption is about 2x slower than function calls.

Tail chaining

Another case that we need to analyze is when an event arrives during the execution of a task but there’s no preemption. The following program showcases that scenario:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    // no preemption (tasks have the same priority)
    rtfm::request(t2);

    rtfm::bkpt();
}

fn t2(_task: Exti1Irq, _prio: P1, _thr: T1) {
    rtfm::bkpt();
}

As both tasks have the same priority no preemption occurs: task t2 will be executed after task t1 ends. This is known as tail chaining because the context will switch from t1 to t2 without returning to idle.

Disassembly below:

08000324 <overhead::main::INTERRUPTS::t1>:
 8000324:	f24e 2000 	movw	r0, #57856	; 0xe200
 8000328:	2180      	movs	r1, #128	; 0x80
 800032a:	f2ce 0000 	movt	r0, #57344	; 0xe000
 800032e:	6001      	str	r1, [r0, #0]
 8000330:	be00      	bkpt	0x0000		; read CYCCNT
 8000332:	4770      	bx	lr

08000334 <overhead::main::INTERRUPTS::t2>:
 8000334:	be00      	bkpt	0x0000		; read CYCCNT
 8000336:	4770      	bx	lr

Debug session:

> continue
> # PC = 0x08000330, CYCCNT = 0x54

> continue
> # PC = 0x08000334, CYCCNT = 0x5a

> print 0x5a - 0x54
$1 = 6

The cost of this kind of context switch is 6 cycles.

Let’s measure again but changing t2 to use more than 5 registers.

Revised program:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    rtfm::request(t2);

    rtfm::bkpt();
}

fn t2(_task: Exti1Irq, _prio: P1, _thr: T1) {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
    }
}

Disassembly:

08000324 <overhead::main::INTERRUPTS::t1>:
 8000324:	f24e 2000 	movw	r0, #57856	; 0xe200
 8000328:	2180      	movs	r1, #128	; 0x80
 800032a:	f2ce 0000 	movt	r0, #57344	; 0xe000
 800032e:	6001      	str	r1, [r0, #0]
 8000330:	be00      	bkpt	0x0000		; read CYCCNT
 8000332:	4770      	bx	lr

08000334 <overhead::main::INTERRUPTS::t2>:
 8000334:	b580      	push	{r7, lr}
 8000336:	466f      	mov	r7, sp
 8000338:	f04f 0c00 	mov.w	ip, #0		; read CYCCNT
 800033c:	f04f 0e01 	mov.w	lr, #1
 8000340:	2202      	movs	r2, #2
 8000342:	2303      	movs	r3, #3
 8000344:	2004      	movs	r0, #4
 8000346:	2105      	movs	r1, #5
 8000348:	bd80      	pop	{r7, pc}

Debug session:

> continue
> # PC = 0x08000330, CYCCNT = 0x50

> break overhead::main::INTERRUPTS::t2
> continue
> # PC = 0x08000338, CYCCNT = 0x5a

> print 0x5a - 0x50
$1 = 10

10 cycles of overhead in the case where t2 includes a prologue. In conclusion:

Context switching cost (tail chaining) = 6-10 cycles

Which is around the overhead of function calls.

Memory overhead

As the NVIC does all the scheduling the processor doesn’t keep track of the running tasks so there’s no memory use on that front. The RTFM framework uses task priorities in its API, but as priorities remain fixed at runtime they are tracked in the type system and not stored in memory at runtime. In conclusion:

Memory overhead per task = 0 bytes of .bss / .data / .heap memory

Setup cost

There is no memory overhead per task, but there is a small setup cost per task. Let’s take a look at that:

Zero tasks

Consider the following program with zero tasks:

#[inline(never)]
fn init(_prio: P0, _thr: &TMax) {
    // Just to make sure LLVM that doesn't optimize away this function
    rtfm::bkpt();
}

#[inline(never)]
fn idle(_prio: P0, _thr: T0) -> ! {
    // Sleep
    loop {
        rtfm::wfi();
    }
}

Both init and idle have been marked as inline(never) to make the analysis easier. Here’s the disassembly of the program:

08000340 <cortex_m_rt::reset_handler>:
 8000340:	b5d0      	push	{r4, r6, r7, lr}
 8000342:	af02      	add	r7, sp, #8
 8000344:	f240 0000 	movw	r0, #0
 8000348:	f240 0100 	movw	r1, #0
 800034c:	f2c2 0000 	movt	r0, #8192	; 0x2000
 8000350:	f2c2 0100 	movt	r1, #8192	; 0x2000
 8000354:	1a09      	subs	r1, r1, r0
 8000356:	f021 0103 	bic.w	r1, r1, #3
 800035a:	f000 f85a 	bl	8000412 <__aeabi_memclr4>
 800035e:	f240 0000 	movw	r0, #0
 8000362:	f240 0100 	movw	r1, #0
 8000366:	f2c2 0000 	movt	r0, #8192	; 0x2000
 800036a:	f2c2 0100 	movt	r1, #8192	; 0x2000
 800036e:	1a09      	subs	r1, r1, r0
 8000370:	f021 0203 	bic.w	r2, r1, #3
 8000374:	f240 4124 	movw	r1, #1060	; 0x424
 8000378:	f6c0 0100 	movt	r1, #2048	; 0x800
 800037c:	f000 f83f 	bl	80003fe <__aeabi_memcpy4>
 8000380:	f240 0000 	movw	r0, #0
 8000384:	f2c0 0000 	movt	r0, #0
 8000388:	7800      	ldrb	r0, [r0, #0]
 800038a:	f3ef 8410 	mrs	r4, PRIMASK
 800038e:	b672      	cpsid	i
 8000390:	f7ff ffd2 	bl	8000338 <overhead::init>
 8000394:	f014 0f01 	tst.w	r4, #1
 8000398:	d100      	bne.n	800039c <cortex_m_rt::reset_handler+0x5c>
 800039a:	b662      	cpsie	i
 800039c:	f7ff ffce 	bl	800033c <overhead::idle>

reset_handler is the entry point of the program. This routine will call the init function and then idle function, but before it does that it initializes RAM as evidenced by the calls to memclr4 and memcpy4. RAM initialization is required by all programs so we won’t count it as part of the overhead of the RTFM framework.

After RAM initialization init will be called within a global critical section (rtfm::atomic) hence the cpsid i and cpsie i instructions around the call. After init returns the critical section ends and idle gets called.

One task

Let’s now add one task to the program:

#[inline(never)]
fn init(_prio: P0, _thr: &TMax) {
    rtfm::bkpt();
}

#[inline(never)]
fn idle(_prio: P0, _thr: T0) -> ! {
    loop {
        rtfm::wfi();
    }
}

tasks!(stm32f100xx, {
    t1: Task {
        interrupt: Exti0Irq,
        priority: P1,
        enabled: true,
    },
});

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {}

The disassembly of reset_handler becomes:

08000338 <cortex_m_rt::reset_handler>:
(..)
 8000386:	b672      	cpsid	i       	; start of critical section
 8000388:	f7ff ffd1 	bl	800032e <overhead::init>
 800038c:	f24e 4006 	movw	r0, #58374	; 0xe406
 8000390:	21f0      	movs	r1, #240	; 0xf0
 8000392:	f014 0f01 	tst.w	r4, #1
 8000396:	f2ce 0000 	movt	r0, #57344	; 0xe000
 800039a:	7001      	strb	r1, [r0, #0]
 800039c:	f24e 1000 	movw	r0, #57600	; 0xe100
 80003a0:	f04f 0140 	mov.w	r1, #64	; 0x40
 80003a4:	f2ce 0000 	movt	r0, #57344	; 0xe000
 80003a8:	6001      	str	r1, [r0, #0]
 80003aa:	d100      	bne.n	80003ae <cortex_m_rt::reset_handler+0x76>
 80003ac:	b662      	cpsie	i   		; end of critical section
 80003ae:	f7ff ffc0 	bl	8000332 <overhead::idle>

There’s now extra code between the call to init and the end of the global critical section. This extra code takes care of assigning priorities to tasks (interrupts), and also of enabling the tasks (interrupts) that were declared as enabled: true in the tasks! macro.

The RTFM framework doesn’t add any extra code to tasks. Only the code you have written will be executed. Here’s the disassembly of the empty task t1 as a proof:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	4770      	bx	lr

N tasks

Let’s add one more task:

tasks!(stm32f100xx, {
    t1: Task {
        interrupt: Exti0Irq,
        priority: P1,
        enabled: true,
    },
    t2: Task {
        interrupt: Exti1Irq,
        priority: P2,
        enabled: true,
    },
});

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {}

fn t2(_task: Exti1Irq, _prio: P2, _thr: T2) {}

With this change the disassembly of reset_handler becomes:

08000330 <cortex_m_rt::reset_handler>:
(..)
 800037e:	b672      	cpsid	i   		; start of critical section
 8000380:	f7ff ffd0 	bl	8000324 <overhead::init>
 8000384:	f24e 4006 	movw	r0, #58374	; 0xe406
 8000388:	21f0      	movs	r1, #240	; 0xf0
 800038a:	f014 0f01 	tst.w	r4, #1
 800038e:	f2ce 0000 	movt	r0, #57344	; 0xe000
 8000392:	7001      	strb	r1, [r0, #0]
 8000394:	f04f 01e0 	mov.w	r1, #224	; 0xe0
 8000398:	7041      	strb	r1, [r0, #1]
 800039a:	f24e 1000 	movw	r0, #57600	; 0xe100
 800039e:	f04f 0140 	mov.w	r1, #64	; 0x40
 80003a2:	f2ce 0000 	movt	r0, #57344	; 0xe000
 80003a6:	6001      	str	r1, [r0, #0]
 80003a8:	f04f 0180 	mov.w	r1, #128	; 0x80
 80003ac:	6001      	str	r1, [r0, #0]
 80003ae:	d100      	bne.n	80003b2 <cortex_m_rt::reset_handler+0x82>
 80003b0:	b662      	cpsie	i   		; end of critical section
 80003b2:	f7ff ffb9 	bl	8000328 <overhead::idle>

If you keep adding tasks and measure the time that takes to go from the end of init to the start of idle you’ll find out that the task setup code takes O(N) cycles where N is the number of tasks. The exact linear constant depends on whether all the task have the same priority, or each one has a different priority but 4 to 6 cycles per task is usual.

Setup runtime cost = O(N) cycles where N = number of tasks

Shared call stack

How does the RTFM framework makes use of stack memory? How are tasks allocated on the stack? Let’s see what threaded systems do first.

On threaded systems each thread is assigned its own call stack as shown below:

Multithreaded stack

The image depicts three threads: T1, T2 and T3. The stack of each thread can grow independently so its possible for the stack of one thread to overflow into the stack of the next thread, corrupting it. For this reason each thread is assigned a maximum stack size, depicted in the above figure by the black boundaries between each thread stack. To enforce these boundaries usually the Memory Protection Unit (MPU) is used; the MPU can detect stack overflows and raise an exception when they occur.

The maximum stack size of a thread must be chosen carefully. A large stack size limits the number of threads that can be running at a given time. For example, in the above figure only 3 threads fit in memory. OTOH, a small stack size makes threads more prone to stack overflows.

Under the RTFM framework all tasks share a single call stack as shown below:

RTFM stack

This looks like a compacted version of the multithreaded memory layout.

Under the RTFM scheduler once a high priority task starts its execution all the other lower priority tasks can’t resume execution until after the high priority task is over. Because of this the stacks of suspended tasks never grow so it’s not necessary to reserve space for each task. That’s why we don’t have this empty space between the stack of each task as in the multithreaded case.

Although less likely stack overflows are still possible: the shared call stack can overflow into the heap region. Again, one can use the MPU to protect against such condition.

So how much stack space does each task use? Let’s find out.

Consider the following program:

fn idle(_prio: P0, _thr: T0) -> ! {
    rtfm::bkpt();
    rtfm::request(t1);

    // Sleep
    loop {
        rtfm::wfi();
    }
}

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    let x = 42;

    rtfm::bkpt();

    // `t2` will immediately preempt this task
    rtfm::request(t2);

    // Force LLVM to allocate `x` on the stack
    unsafe {
        ptr::read_volatile(&x);
   }
}

fn t2(_task: Exti1Irq, _prio: P2, _thr: T2) {
    let y = 24;

    rtfm::bkpt();

    // Force LLVM to allocate `y` on the stack
    unsafe {
        ptr::read_volatile(&y);
    }
}

Disassembly:

08000324 <overhead::main::INTERRUPTS::t1>:
 8000324:	b081      	sub	sp, #4
 8000326:	202a      	movs	r0, #42
 8000328:	2180      	movs	r1, #128
 800032a:	9000      	str	r0, [sp, #0]	; x = 42
 800032c:	f24e 2000 	movw	r0, #57856
 8000330:	be00      	bkpt	0x0000  	; t1's BKPT
 8000332:	f2ce 0000 	movt	r0, #57344
 8000336:	6001      	str	r1, [r0, #0]	; rtfm::request
 8000338:	9800      	ldr	r0, [sp, #0]	; ptr::read_volatile
 800033a:	b001      	add	sp, #4
 800033c:	4770      	bx	lr

0800033e <overhead::main::INTERRUPTS::t2>:
 800033e:	b081      	sub	sp, #4
 8000340:	2018      	movs	r0, #24
 8000342:	9000      	str	r0, [sp, #0]	; y = 24
 8000344:	be00      	bkpt	0x0000  	; t2's BKPT
 8000346:	9800      	ldr	r0, [sp, #0]	; ptr::read_volatile
 8000348:	b001      	add	sp, #4
 800034a:	4770      	bx	lr

0800034c <cortex_m_rt::reset_handler>:
(..)
 80003d6:	b662      	cpsie	i
 80003d8:	f44f 4152 	mov.w	r1, #53760
 80003dc:	be00      	bkpt	0x0000      	; idle's BKPT
 80003de:	f840 c001 	str.w	ip, [r0, r1]	; rtfm::request
 80003e2:	bf30      	wfi
 80003e4:	e7fd      	b.n	80003e2 <cortex_m_rt::reset_handler+0x96>

Now let’s debug it:

> continue
> # IDLE: PC = 0x080003dc, SP = 0x20001ff8

We hit idle’s breakpoint; just before rtfm::request(t1) is called. Let’s print the values of some registers at this point. We’ll see in a bit why they are relevant.

> info registers r0 r1 r2 r3 r12 lr pc xPSR
r0             0xe0001000
r1             0xd200
r2             0x0
r3             0xd100
r12            0x40
lr             0x800038d
pc             0x80003dc
xPSR           0x61000000

We continue the program execution and reach t1’s breakpoint.

> continue
> # T1: PC = 0x08000330, SP = 0x20001fd4

At this point x has already been allocated on the stack so if we inspect the stack around the stack pointer (SP) we should see its value:

> x/12x $sp
0x20001fd4:    (0x0000002a)    [0xe0001000      0x0000d200      0x00000000
0x20001fe4:     0x0000d100      0x00000040      0x0800038d      0x080003e4
0x20001ff4:     0x61000000]     0x20001ff8      0xffffffff      0x00000000

NB The parentheses and square brackets were added by me; they are not part of GDB’s output.

At address 0x20001fd4 we see (0x0000002a); this is the local variable x. Next to that value we see some familiar looking values: [0xe0001000 .. 0x61000000]; these values match the output of the info registers command executed before. Those values are, in fact, a snapshot of idle’s state that was pushed into the stack by the NVIC. They’re there because once t1 returns the NVIC will restore those values to their corresponding registers to resume idle’s execution.

Here’s a snapshot of the same registers at PC = 0x08000336:

> stepi
> # T1: PC = 0x08000336, SP = 0x20001fd4

> info registers r0 r1 r2 r3 r12 lr pc xPSR
r0             0xe000e200
r1             0x80
r2             0x0
r3             0xd100
r12            0x40
lr             0xfffffff9
pc             0x8000336
xPSR           0x21000016

These values don’t (necessarily) match the ones that are stored in the stack.

Let’s now skip to t2’s breakpoint:

> continue
> # T2: PC = 0x08000344, SP = 0x20001fb0

Here’s the state of the stack at that point:

> x/20x $sp
0x20001fb0:    (0x00000018)    [0x0000002a      0x00000080      0x00000000
0x20001fc0:     0x0000d100      0x00000040      0xfffffff9      0x0800033a
0x20001fd0:     0x21000016]    (0x0000002a)    [0xe0001000      0x0000d200
0x20001fe0:     0x00000000      0x0000d100      0x00000040      0x0800038d
0x20001ff0:     0x080003e4      0x61000000]     0x20001ff8      0xffffffff

(0x00000018) is the local variable y, and [0x0000002a .. 0x21000016] is a snapshot of t1’s state. If you are a careful reader then you probably noticed that not all those values match the output of the last info register command. The difference is due to when the registers were stacked: the stacked PC value is 0x0800033a and the PC value from the info register’s output is 0x08000336 so the stacking happened after the info register command was issued.

Next on the stack is (0x0000002a), t1’s x value, and finally [0xe0001000 .. 0x61000000], idle’s state from before.

So in conclusion to preserve the state of the lower priority task during preemption at least 8 words of information have to be stored on the stack. Remember the push instruction in function prologues? That instruction pushes more registers, the ones that are not stacked by default, into the stack. So, function with prologues use more stack space.

Stack usage per suspended task = at least 8 words (32 bytes)

That’s all for the realm of tasks. Let’s now move onto the data abstractions.

Task local data

The task local data abstraction, Local, is used to preserve state across the different runs of a task.

Local.borrow

Local provides two methods to access the inner data: borrow and borrow_mut. Both methods have zero synchronization overhead as Local data is confined to a single task.

Let’s confirm this claim by comparing this program which increases a counter:

fn t1(ref mut task: Exti0Irq, _prio: P1, _thr: T1) {
    static COUNTER: Local<u32, Exti0Irq> = Local::new(0);

    let state = COUNTER.borrow_mut(task);
    *state += 1;
}

Disassembly:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	f240 0000 	movw	r0, #0
 8000332:	f2c2 0000 	movt	r0, #8192	; 0x2000
 8000336:	6801      	ldr	r1, [r0, #0]
 8000338:	3101      	adds	r1, #1
 800033a:	6001      	str	r1, [r0, #0]
 800033c:	4770      	bx	lr

Against its unsynchronized, memory unsafe version shown below:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    static mut COUNTER: u32 = 0;

    unsafe { COUNTER += 1 }
}

Disassembly:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	f240 0000 	movw	r0, #0
 8000332:	f2c2 0000 	movt	r0, #8192	; 0x2000
 8000336:	6801      	ldr	r1, [r0, #0]
 8000338:	3101      	adds	r1, #1
 800033a:	6001      	str	r1, [r0, #0]
 800033c:	4770      	bx	lr

Both versions produce exactly the same code, but the Local version is verified to be memory safe by the compiler. Note how the task token doesn’t appear in the disassembly; as the task token is a zero sized type it doesn’t exit at runtime.

Memory overhead

Local is just a newtype over the protected data and imposes no overhead in terms of memory.

You can confirm that with the following program:

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    static L1: Local<(), Exti0Irq> = Local::new(());
    static L2: Local<u8, Exti0Irq> = Local::new(0);
    static L3: Local<u16, Exti0Irq> = Local::new(0);
    static L4: Local<u32, Exti0Irq> = Local::new(0);
    static L5: Local<u64, Exti0Irq> = Local::new(0);

    hprintln!("{}", mem::size_of_val(&L1));
    hprintln!("{}", mem::size_of_val(&L2));
    hprintln!("{}", mem::size_of_val(&L3));
    hprintln!("{}", mem::size_of_val(&L4));
    hprintln!("{}", mem::size_of_val(&L5));
}

which prints:

$ openocd -f (..)
(..)
0
1
2
4
8

Conclusion

In conclusion the Local abstraction is no different than using an unsafe static mut variable in terms of memory usage and the runtime cost of accessing it. In Rust we call these abstractions zero cost abstractions. Local is a zero cost abstraction that makes global variables memory safe by pinning them to a single task.

Resources

The RTFM framework provides a Resource abstraction that can be used to safely share data between tasks. Let’s analyze their overhead.

Resource.access

Resource provides an access method that grants access to its inner data. Some conditions need to be met for access to work; if any of those conditions is not met then the program doesn’t compile. Once the conditions are met the access method, itself, is zero cost.

Let’s confirm that by comparing the counting program ported to use a Resource:

static COUNTER: Resource<Cell<u32>, C1> = Resource::new(Cell::new(0));

fn t1(_: Exti0, ref prio: P1, ref thr: T1) {
    let counter = COUNTER.access(prio, thr);
    counter.set(counter.get() + 1);
}

Disassembly:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	f240 0000 	movw	r0, #0
 8000332:	f2c2 0000 	movt	r0, #8192	; 0x2000
 8000336:	6801      	ldr	r1, [r0, #0]
 8000338:	3101      	adds	r1, #1
 800033a:	6001      	str	r1, [r0, #0]
 800033c:	4770      	bx	lr

Against its unsynchronized, memory unsafe version (which we already saw in the Local.borrow section):

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    static mut COUNTER: u32 = 0;

    unsafe { COUNTER += 1 }
}

Disassembly:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	f240 0000 	movw	r0, #0
 8000332:	f2c2 0000 	movt	r0, #8192	; 0x2000
 8000336:	6801      	ldr	r1, [r0, #0]
 8000338:	3101      	adds	r1, #1
 800033a:	6001      	str	r1, [r0, #0]
 800033c:	4770      	bx	lr

The produced code is exactly the same. Again, the tokens don’t exist at runtime.

Threshold.raise

When a resource is accessed by two tasks that have different priorities the lowest priority task will have to create a critical section using the Threshold.raise method. For the span of this critical section the task’s preemption threshold is temporarily raised to prevent the higher priority task from preempting the lower priority one. Only within this critical section can the lower priority task access the resource in a memory safe manner that’s free of data races.

Let’s see what’s the overhead of a Threshold.raise critical section by timing the following program:

static R1: Resource<(), C2> = Resource::new(());

fn t1(_: Exti0, _: P1, thr: T1) {
    rtfm::bkpt(); // before

    thr.raise(
        &R1, |_thr| {
            rtfm::bkpt(); // inside
        }
    );

    rtfm::bkpt(); // after
}

Disassembly:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	21e0      	movs	r1, #224	; 0xe0
 8000330:	be00      	bkpt	0x0000		; before
 8000332:	f3ef 8011 	mrs	r0, BASEPRI
 8000336:	f381 8812 	msr	BASEPRI_MAX, r1
 800033a:	be00      	bkpt	0x0000		; inside
 800033c:	f380 8811 	msr	BASEPRI, r0
 8000340:	be00      	bkpt	0x0000		; after
 8000342:	4770      	bx	lr

In the disassembly you can see the secret sauce of the RTFM framework: the BASEPRI register. The value of this register is the preemption threshold of the system. The critical section is started by raising the preemption threshold, using the msr BASEPRI_MAX instruction, and then finished by restoring the previous preemption threshold, using the msr BASEPRI instruction.

So how long does it take to enter and exit the critical section?

> continue
> # PC = 0x08000330, (CYCCNT & 0xff) = 0xdb

> continue
> # PC = 0x0800033a, (CYCCNT & 0xff) = 0xde

> continue
> # PC = 0x08000340, (CYCCNT & 0xff) = 0xdf

> print 0xdf - 0xdb
$1 = 4

Entering and leaving the critical section takes only 4 cycles. This overhead is the same regardless of the number of tasks that get blocked by the critical section. So, O(1) runtime cost.

Critical section overhead = 4 cycles

vs rtfm::atomic

How does this overhead compare to the overhead of a global 2 critical section (rtfm::atomic)? Let’s find out with this program:

fn t1(_task: Exti0Irq, _prio: P1, ref thr: T1) {
    rtfm::bkpt(); // before

    rtfm::atomic(
        |_| {
            rtfm::bkpt(); // inside
        }
    );

    rtfm::bkpt(); // after
}

Here’s the disassembly of the program:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	be00      	bkpt	0x0000		; before
 8000330:	f3ef 8010 	mrs	r0, PRIMASK
 8000334:	b672      	cpsid	i
 8000336:	be00      	bkpt	0x0000		; inside
 8000338:	f010 0f01 	tst.w	r0, #1
 800033c:	d100      	bne.n	8000340 <overhead::main::INTERRUPTS::t1+0x12>
 800033e:	b662      	cpsie	i
 8000340:	be00      	bkpt	0x0000		; after
 8000342:	4770      	bx	lr

Global critical sections use the PRIMASK register to block all tasks. The critical section is started by disabling all the tasks, using the cpsid i instruction (sets PRIMASK to 1), and finished by re-enabling them, using the cpsie i instruction (sets PRIMASK to 0). There’s a catch here: if the tasks were already disabled before the critical section started then they should not be re-enabled when the critical section ends – if you were wondering: this situation occurs when rtfm::atomic sections are nested. This is why PRIMASK is read before starting the critical section: to check whether the tasks were already disabled or not.

How do critical sections fare in terms of runtime overhead?

> continue
> # PC = 0x0800032e, (CYCCNT & 0xff) = 0xe7

> continue
> # PC = 0x08000336, (CYCCNT & 0xff) = 0xea

> continue
> # PC = 0x08000340, (CYCCNT & 0xff) = 0xed

> print 0xed - 0xe7
$1 = 6

Entering and leaving a global critical section takes 6 cycles; this is slower than the Threshold.raise critical section!

In conclusion RTFM style critical sections not only impose less task blocking than global critical sections; they also have a lower runtime overhead.

Memory overhead

Resources have zero memory overhead. Like Local, Resource is just a newtype over the protected data. The ceiling of each resource is fixed, and it’s tracked in the type system so it’s not stored in memory at runtime.

You can confirm that with the following program:

static R1: Resource<(), C1> = Resource::new(());
static R2: Resource<u8, C1> = Resource::new(0);
static R3: Resource<u16, C1> = Resource::new(0);
static R4: Resource<u32, C1> = Resource::new(0);
static R5: Resource<u64, C1> = Resource::new(0);

fn t1(_task: Exti0Irq, _prio: P1, _thr: T1) {
    hprintln!("{}", mem::size_of_val(&R1));
    hprintln!("{}", mem::size_of_val(&R2));
    hprintln!("{}", mem::size_of_val(&R3));
    hprintln!("{}", mem::size_of_val(&R4));
    hprintln!("{}", mem::size_of_val(&R5));
}

which prints:

$ openocd -f (..)
(..)
0
1
2
4
8

A nonzero cost pattern

There is catch with resources: they don’t hand out mutable references (&mut-) to their inner data, only shared references. To achieve mutation through shared references either a Cell or a RefCell must be used. If you are dealing with primitives like i32 and bool then Cell gives you a zero cost way to mutate the data, but anything more complex that requires mutation via &mut self will require a RefCell. The problem with RefCells is that they have obligatory runtime checks to enforce that Rust borrowing rules are preserved.

access_mut?

From the POV of concurrency if a task meets the conditions to access a resource then it has exclusive access to that resource as no other task can preempt it and access the same resource. So, from the POV of concurrency access returning a mutable reference is perfectly valid. However, access returning a mutable reference doesn’t sit well with Rust borrowing rules: mutable aliasing is a problem even within a single thread / context / task as it can lead to pointer invalidation 3 4.

The task local data abstraction faces the same situation, but Local does provide a safe borrow_mut method that hands out mutable references. Why is that method safe? Because of its signature:

impl<DATA, TASK> Local<DATA, TASK> {
    pub fn borrow_mut<'task>(
        &'static self,
        _task: &'task mut TASK,
    ) -> &'task mut DATA {
        ..
    }
}

borrow_mut takes a mutable reference to the task token; this freezes the task token making it impossible to use borrow_mut for as long as the returned mutable reference is in scope. This does prevent mutable aliasing as shown below:

fn t1(ref mut task: Exti0Irq, _prio: P1, _thr: T1) {
    static STATE: Local<i32, Exti0Irq> = Local::new(0);

    let state: &mut i32 = STATE.borrow_mut(task);
    let aliased_state: &mut i32 = STATE.borrow_mut(task);
    //~^ error: cannot borrow `*task` as mutable more than once at a time
}

However this safety mechanism is also rather restrictive because you can’t mutably borrow two different resources within the same scope:

fn t1(ref mut task: Exti0Irq, _prio: P1, _thr: T1) {
    static A: Local<i32, Exti0Irq> = Local::new(0);
    static B: Local<i32, Exti0Irq> = Local::new(0);

    // this is valid and safe but ...
    let a: &mut i32 = A.borrow_mut(task);
    let b: &mut i32 = B.borrow_mut(task);
    //~^ error: cannot borrow `*task` as mutable more than once at a time

    *a += 1;
    *b += 2;
}

This can, somehow, be worked around by minimizing the spans of the mutable borrows as shown below:

fn t1(ref mut task: Exti0Irq, _prio: P1, _thr: T1) {
    static A: Local<i32, Exti0Irq> = Local::new(0);
    static B: Local<i32, Exti0Irq> = Local::new(0);

    *A.borrow_mut(task) += 1;
    *B.borrow_mut(task) += 2;
}

But this workaround is rather unergonomic, and doesn’t work when you need to pass two mutable references to a function / method.

fn t1(ref mut task: Exti0Irq, _prio: P1, _thr: T1) {
    static A: Local<i32, Exti0Irq> = Local::new(0);
    static B: Local<i32, Exti0Irq> = Local::new(0);

    mem::swap(A.borrow_mut(task), B.borrow_mut(task));
    //~^ error: cannot borrow `*task` as mutable more than once at a time
}

So can we add an access_mut method to Resource that behaves like Local.borrow_mut? Turns out making the lifetime constraints work out is tricky because the access takes two arguments and already has lifetime constraint wrt to the threshold token. It’s also likely that access_mut will hit the borrow restrictions shown above much more often that in the Local case so I’m not sure if access_mut will actually help or rather only cause more frustration 5.

RefCell overhead

Until we find a proper solution to the borrow restriction we have to use RefCells but how much overhead do they impose over the ideal solution that has no runtime checks? Let’s check with this program:

static COUNTER: Resource<RefCell<u32>, C1> = Resource::new(RefCell::new(0));

fn t1(_task: Exti0Irq, ref prio: P1, ref thr: T1) {
    rtfm::bkpt();

    let counter = COUNTER.access(prio, thr);
    *counter.borrow_mut() += 1;

    rtfm::bkpt();
}

This is our running example of increasing a counter. Here’s the disassembly of the RefCell version:

08000336 <overhead::main::INTERRUPTS::t1>:
 8000336:	f240 0000 	movw	r0, #0
 800033a:	be00      	bkpt	0x0000
 800033c:	f2c2 0000 	movt	r0, #8192	; 0x2000
 8000340:	6801      	ldr	r1, [r0, #0]
 8000342:	b931      	cbnz	r1, 8000352 <overhead::main::INTERRUPTS::t1+0x1c>
 8000344:	6841      	ldr	r1, [r0, #4]
 8000346:	2200      	movs	r2, #0
 8000348:	3101      	adds	r1, #1
 800034a:	e9c0 2100 	strd	r2, r1, [r0]
 800034e:	be00      	bkpt	0x0000
 8000350:	4770      	bx	lr
 8000352:	b580      	push	{r7, lr}
 8000354:	466f      	mov	r7, sp
 8000356:	f7ff feeb 	bl	8000130 <core::result::unwrap_failed>

And the runtime cost is 12 cycles.

> continue
> # PC = 0x0800033a, (CYCCNT & 0xff) = 0xb9

> continue
> # PC = 0x0800034e, (CYCCNT & 0xff) = 0xc5

> print 0xc5 - 0xb9
$1 = 12

Remember that we measured the Cell version and the unsafe static mut version of this example before and they both took 6 cycles. So is the overhead a fixed cost of 6 cycles? Let’s confirm with another program:

static COUNTER: Resource<RefCell<()>, C1> = Resource::new(RefCell::new(()));

fn t1(_task: Exti0Irq, ref prio: P1, ref thr: T1) {
    rtfm::bkpt();

    let counter = COUNTER.access(prio, thr);
    counter.borrow_mut();

    rtfm::bkpt();
}

Disassembly:

08000336 <overhead::main::INTERRUPTS::t1>:
 8000336:	f240 0000 	movw	r0, #0
 800033a:	be00      	bkpt	0x0000
 800033c:	f2c2 0000 	movt	r0, #8192	; 0x2000
 8000340:	6801      	ldr	r1, [r0, #0]
 8000342:	b919      	cbnz	r1, 800034c <overhead::main::INTERRUPTS::t1+0x16>
 8000344:	2100      	movs	r1, #0
 8000346:	6001      	str	r1, [r0, #0]
 8000348:	be00      	bkpt	0x0000
 800034a:	4770      	bx	lr
 800034c:	b580      	push	{r7, lr}
 800034e:	466f      	mov	r7, sp
 8000350:	f7ff feee 	bl	8000130 <core::result::unwrap_failed>

Debug session:

> continue
> # PC = 0x0800033a, (CYCCNT & 0xff) = 0x93

> continue
> # PC = 0x08000348, (CYCCNT & 0xff) = 0x9a

> print 0x9a - 0x93
$1 = 7

The measurement says 7 cycles of overhead for doing a no-op borrow_mut(). So around 6 or 7 cycles seems to be the overhead of using RefCell.borrow_mut.

The worst part of RefCells is that they inhibit optimizations. If we rewrite our program like this:

static COUNTER: Resource<RefCell<u32>, C1> = Resource::new(RefCell::new(0));

fn t1(_task: Exti0Irq, ref prio: P1, ref thr: T1) {
    rtfm::bkpt();

    let counter = COUNTER.access(prio, thr);
    let curr = *counter.borrow();
    *counter.borrow_mut() = curr + 1;

    rtfm::bkpt();
}

We get a much worse disassembly:

0800033e <overhead::main::INTERRUPTS::t1>:
 800033e:	b580      	push	{r7, lr}
 8000340:	466f      	mov	r7, sp
 8000342:	f240 0000 	movw	r0, #0
 8000346:	be00      	bkpt	0x0000
 8000348:	f2c2 0000 	movt	r0, #8192	; 0x2000
 800034c:	6801      	ldr	r1, [r0, #0]
 800034e:	b931      	cbnz	r1, 800035e <overhead::main::INTERRUPTS::t1+0x20>
 8000350:	6841      	ldr	r1, [r0, #4]
 8000352:	2200      	movs	r2, #0
 8000354:	3101      	adds	r1, #1
 8000356:	e9c0 2100 	strd	r2, r1, [r0]
 800035a:	be00      	bkpt	0x0000
 800035c:	bd80      	pop	{r7, pc}
 800035e:	1c48      	adds	r0, r1, #1
 8000360:	d101      	bne.n	8000366 <overhead::main::INTERRUPTS::t1+0x28>
 8000362:	f7ff fee9 	bl	8000138 <core::result::unwrap_failed>
 8000366:	f7ff fee3 	bl	8000130 <core::result::unwrap_failed>

If you do the measurement:

> continue
> # PC = 0x08000346, (CYCCNT & 0xff) = 0x95

> continue
> # PC = 0x0800035a, (CYCCNT & 0xff) = 0xa1

> print 0xa1 - 0x95
$1 = 12

You get 12 cycles as before but that’s not the full story. The task now has a prologue, which increases the context switching cost, and now has two panic branches instead of one.

RefCells also impose a memory overhead of 1 word (4 bytes) per resource:

static COUNTER: Resource<RefCell<()>, C1> = Resource::new(RefCell::new(()));

fn t1(_task: Exti0Irq, ref prio: P1, ref thr: T1) {
    hprintln!("{}", mem::size_of_val(&COUNTER));
}

This program outputs:

$ openocd -f (.)
(..)
4

Less overhead using unsafe

Can we improve the situation somehow? Well, one can use unsafe to optimize away the RefCell runtime check. But is that actually safe?

Resources have the following property: “once a task has accessed a resource then no other task that may access the same resource can start”, where “start” here means preempt the current task. You can flip that property into: “by the time a task accesses a resource no other outstanding borrows to the resource data can exist in other tasks”. Taking this to the RefCell context: “the first time a task accesses a RefCell resource the borrow count of the RefCell is zero”. Or conversely: “as long as Rust borrowing rules are not broken within a task the dynamic borrows of a RefCell can’t fail”.

Translating that into code:

static COUNTER: Resource<RefCell<u32>, C1> = Resource::new(RefCell::new(0));

fn t1(_task: Exti0Irq, ref prio: P1, ref thr: T1) {
    rtfm::bkpt();

    // first access to COUNTER in this task
    let counter = COUNTER.access(prio, thr);

    match counter.try_borrow_mut() {
        Ok(mut counter) => *counter += 1,
        // we know that this is not reachable because no other task can have
        // a reference to COUNTER's inner data and this is the first dynamic
        // borrow of COUNTER in this task
        Err(_) => unsafe { intrinsics::unreachable() },
    }

    rtfm::bkpt();
}

Disassembly:

0800032e <overhead::main::INTERRUPTS::t1>:
 800032e:	f240 0000 	movw	r0, #0
 8000332:	be00      	bkpt	0x0000
 8000334:	2200      	movs	r2, #0
 8000336:	f2c2 0000 	movt	r0, #8192	; 0x2000
 800033a:	6841      	ldr	r1, [r0, #4]
 800033c:	3101      	adds	r1, #1
 800033e:	e9c0 2100 	strd	r2, r1, [r0]
 8000342:	be00      	bkpt	0x0000
 8000344:	4770      	bx	lr

The measurement says that this program takes 9 cycles. Remember than the ideal version took 6 cycles.

> continue
> # PC = 0x08000332, (CYCCNT & 0xff) = 0x9f

> continue
> # PC = 0x08000342, (CYCCNT & 0xff) = 0xa8

> print 0xa8 - 0x9f
$1 = 9

This helps a bit with the runtime overhead, but doesn’t remove the borrow counter from the resource so the resource still has a memory overhead of one word. At the end it’s probably not worth to lose memory safety to reduce a bit of runtime overhead.

Outro

Wow, that’s was a lot (again). Sorry, I probably overdid myself a little up there. I hope you now have a better idea of what RTFM does under the hood; it actually does very little! And I hope that’s also clear that most of the guarantees that RTFM provides are enforced at compile time and don’t involve runtime checks. Hopefully we’ll get rid of RefCell at some point.

Next up: measuring performance at runtime, and then, finally, some applications.


Thank you patrons! ❤️

I want to wholeheartedly thank Iban Eguia, Aaron Turon, Geoff Cant, Harrison Chin, Brandon Edens, whitequark, J. Ryan Stinnett and 14 more people for supporting my work on Patreon.


Let’s discuss on reddit.

Enjoyed this post? Like my work on embedded stuff? Consider supporting my work on Patreon!

Follow me on twitter for even more embedded stuff.

The embedded Rust community gathers on the #rust-embedded IRC channel (irc.mozilla.org). Join us!


Appendix

Initial version of the program used throughout this post:

#![feature(const_fn)]
#![feature(used)]
#![no_std]

// version = "0.2.7"
extern crate cortex_m;
// version = "0.2.0"
extern crate cortex_m_rt;
// version = "0.1.0"
#[macro_use]
extern crate cortex_m_rtfm as rtfm;
// git = "https://github.com/japaric/vl"
extern crate vl;

use core::ptr;

use cortex_m::asm;
use rtfm::{P0, P1, T0, T1, TMax};
use vl::stm32f100xx::interrupt::Exti0Irq;
use vl::stm32f100xx;

// RESOURCES
peripherals!(stm32f100xx, {
    DWT: Peripheral {
        register_block: Dwt,
        ceiling: C1,
    },
});

// INITIALIZATION PHASE
fn init(ref prio: P0, thr: &TMax) {
    // NB the cycle counter is disabled by default
    let dwt = DWT.access(prio, thr);
    dwt.enable_cycle_counter();
}

// IDLE LOOP
fn idle(_prio: P0, _thr: T0) -> ! {
    // Start task `t1`
    rtfm::request(t1);

    // Sleep
    loop {
        rtfm::wfi();
    }
}

// TASKS
tasks!(stm32f100xx, {
    t1: Task {
        interrupt: Exti0Irq,
        priority: P1,
        enabled: true,
    },
});

fn t1(_task: Exti0Irq, ref prio: P1, ref thr: T1) {
    let dwt = DWT.access(prio, thr);

    let before = dwt.cyccnt.read();
    asm::nop();
    let after = dwt.cyccnt.read();

    let elapsed = after.wrapping_sub(before);

    unsafe { ptr::write_volatile(0x2000_0000 as *mut _, elapsed) }

    rtfm::bkpt();
}

  1. The technical reference manual (warning big PDF file), in figure 5-2, indicates a worst case scenario of 12 cycles for the interrupt latency.

    [return]
  2. Remember that Threshold.raise critical sections are not global as they don’t block all the other tasks; instead they only block tasks that could cause data races.

    [return]
  3. http://manishearth.github.io/blog/2015/05/17/the-problem-with-shared-mutability/ [return]
  4. http://smallcultfollowing.com/babysteps/blog/2013/06/11/on-the-connection-between-memory-management-and-data-race-freedom/ [return]
  5. You can find more details about access_mut in this thread [return]

Contents

Creative Commons License
Jorge Aparicio