TiLDA March: how to run native code from MicroPython

Posted by HEx 2016-08-13 at 15:51

Another EMF, another badge! This one is the TiLDA MK3, an STM32-based board running MicroPython. It's similar enough to the pyboard that most of the following might be of use even to non-EMF attendees.

The MK3 is very much an embedded platform, with 128K of RAM, 1MByte of flash and an 80MHz Cortex-M4. How do you get performance out of such a device? Not by writing python, that's for sure.

We have two requirements for running native code. We need to be able to load code at a known address, and we need to be able to branch to it. The built-in assembler is terrible. It only supports a subset of instructions (and in a non-standard format, too). Nonetheless it has a data() function to insert arbitrary bytes into the instruction stream. Problem solved! Just write a script to convert gcc output into a series of data() calls and we're done. Right?

@micropython.asm_thumb def arbitrary_code(): data(4,0x11111111,0x22222222,0x33333333...)

This works. But it's kind of inefficient, and since we have so little space it would be foolish to incur a 300% bloat penalty1. But it is enough to investigate the environment in which we find ourselves.


Here's a useful function. It doesn't do anything itself.2 But nonetheless you can pass arbitrary python objects to it and get their addresses back.

@micropython.asm_thumb def addr(r0): mov(r0,r0)

We also have stm.mem* which lets us read and write arbitrary memory locations from Python. So we can build ourselves a hex dumper:

def dump(addr): # copy memory just in case it changes out from under us3 a = bytearray(16) for i in range(16): a[i] = stm.mem8[addr+i] out = "%08x: " % addr for b in a: out = out + ("%02x " % b) for b in a: out = out + (("%c" % b) if b > 31 and b < 127 else ".") print(out)

addr(addr) returns its own address, which turns out to be some kind of struct with a pointer to the code at offset 8.

>>> dump(addr(addr)) 20004a10: 4c fb 06 08 01 00 00 00 70 9f 00 20 02 00 00 00 L.......p.. .... >>> dump(stm.mem32[addr(addr)+8]) 20009f70: f2 b5 00 46 f2 bd 00 00 00 00 00 00 00 00 00 00 ...F............

Disassemble the code and we get to see the prologue and epilogue:

$ perl -e 'print chr hex for @ARGV' f2 b5 00 46 f2 bd >/tmp/dump $ arm-none-eabi-objdump -b binary -m arm -M force-thumb,reg-names-std -D /tmp/dump [...] 0: b5f2 push {r1, r4, r5, r6, r7, lr} 2: 4600 mov r0, r0 4: bdf2 pop {r1, r4, r5, r6, r7, pc}

Which looks fairly standard.

The goal though is to be able to get the address of a buffer that contains code we want to run. It turns out that python bytearrays are passed directly to assembly functions. Handy.

>>> test = bytearray([0x2a,0x20,0x70,0x47]) # "movs r0, #42; bx lr" >>> dump(addr(test)) 2000f950: 2a 20 70 47 00 00 00 00 00 00 00 00 00 00 00 00 * pG............

Now we can read a buffer from file and branch to it. So, assuming we would actually like to return control to python afterwards, how do we do the branching?

blx is the standard way to jump to an address and return afterwards. blx requires the instruction set (ARM or Thumb) to be encoded in the LSB of the address; while the Cortex-M4 supports only Thumb mode, the processor does indeed verify that this is the case, so we must set the LSB correctly or Bad Things happen.4 Sadly the older bl instruction (which doesn't change instruction set) can't branch to an address in a register. We could always push the return address ourselves and load the PC directly, but then we'd have to set the LSB on the return address!

The assembler refuses to assemble blx(r0) so we have to do it ourselves.

@micropython.asm_thumb def call(r0): data(2,0x4780) # blx r0

Then we can call code in our test buffer.

>>> assert(not(addr(test)&3)) # ensure the buffer is aligned >>> call(addr(test)+1) # set the LSB and branch 42


Of course, if we do the incrementing from within the call function we can dispense with addresses entirely:

def call_inc(r0): data(2,0x3001,0x4780) # adds r0,#1; blx r0 buf = open("code").read() call_inc(buf) # assumes buf is word-aligned, which seems to always be the case

Now to get gcc on board.

To C

Here's “hello world” without the hello (or the world).

$ cat test.c int test(void) { return 42; } $ arm-none-eabi-gcc -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -fPIC test.c -o test ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000 $ arm-none-eabi-objcopy play -O binary test-raw

(Why not just get gcc to output a flat binary directly?

$ arm-none-eabi-gcc -Wl,--oformat=binary -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -fPIC test.c -o test ld: error: Cannot change output format whilst linking ARM binaries.

Apparently this is deliberate. I have no idea why.)

Let's have a look at the result:

$ arm-none-eabi-objdump -b binary -m arm -M force-thumb,reg-names-std -D test-raw [...] 0: b480 push {r7} 2: af00 add r7, sp, #0 4: 232a movs r3, #42 ; 0x2a 6: 4618 mov r0, r3 8: 46bd mov sp, r7 a: f85d 7b04 ldr.w r7, [sp], #4 e: 4770 bx lr

A bit verbose perhaps, but we didn't ask for any optimization. Try it:

>>> f = open("test-raw") >>> test2 = f.read() >>> f.close() >>> call_inc(test2) 42

Looking good. Things go badly very quickly though:

$ cat count.c int arr[] = { 1, 2, 3, 4, 5 }; int count(void) { int i, total=0; for(i=0; i<5; i++) { total += arr[i]; } return total; } $ arm-none-eabi-gcc -Wall -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -fPIC count.c -o count ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000 $ arm-none-eabi-objcopy count -O binary count-raw

Again we compile without optimization, in this case to avoid gcc simply returning a constant instead of performing calculations at runtime.

Suddenly our binary is 64K in size! And it contains some code that will never work:

$ arm-none-eabi-objdump -b binary -m arm -M force-thumb,reg-names-std -D count-raw [...] 0: b480 push {r7} 2: b083 sub sp, #12 4: af00 add r7, sp, #0 6: 4a0e ldr r2, [pc, #56] ; (0x40) 8: 447a add r2, pc [...] 14: 4b0b ldr r3, [pc, #44] ; (0x44) 16: 58d3 ldr r3, [r2, r3] [...] 40: 003c movs r4, r7 42: 0001 movs r1, r0 44: 000c movs r4, r1 [...] 10054: 8058 strh r0, [r3, #2] 10056: 0001 movs r1, r0

Here it is trying to set R2 to the contents of PC+0x10054, which is 0x18058, and later reading from that address. Disassembling the original ELF output reveals what's going on: ldr r2, [pc, #56]: add r2, pc is attempting to load the global offset table. GCC has assumed (despite -ffreestanding!) that we're running in an environment with virtual memory where code and data are held in separate pages.

We need to create a custom linker script to tell it to not do that. Not being any kind of ld expert, here's what I cribbed together:

$ cat linkscr SECTIONS { .text 0 : AT(0) { code = .; *(.text.startup) *(.text) *(.rodata) *(.data) } }

108 bytes, plus our warning has gone away! That's more like it. We also need to replace -fPIC with -fPIE. The tempting -mpic-data-is-text-relative option is, happily, already enabled by default in gcc 4.8 and up.

$ arm-none-eabi-gcc -Tlinkscr -Wall -mthumb -mcpu=cortex-m4 -nostdlib -fPIE count.c -o count $ arm-none-eabi-objcopy count -O binary count-raw >>> call_inc(open("count-raw").read()) 15


So having got this far I made a little app, a quick-and-dirty port of some existing C code I've been working on.

What's the code? And why is it so arcane? That's a story for another day.

Code is on github.

[1] 4 bytes written in hex is 0x12345678, (11 bytes). This will allocate another four bytes for the result, for a total of 15 bytes. Then there's the space for data() (which almost certainly has a limit on the number of parameters, so multiple statements would be required). Result: about four times the space.

[2] Indeed, we only need the no-op at all to avoid confusing Python with an empty function body.

[3] Sadly the implementation of stm.mem* contains the following line:

// TODO support slice index to read/write multiple values at once

so this somewhat unpythonic code is actually the best we can do.

[4] Bad Things include but are not limited to: LEDs flashing ominously, unresponsiveness (necessitating a reboot), filesystem corruption (necessitating a reinstall, with loss of data), and nasal demons.

Leave a comment