Memory mapping the TiLDA

Posted by HEx 2016-09-15 at 01:31

Last time I looked at the TiLDA and how to make it run native code. Anyone who's played with this device for any length of time (read: five minutes) knows that there's a serious memory shortage. There's only 128K of RAM total and half of that's used by MicroPython. By the time your app is running you have perhaps 30K to play with.

It gets worse. Because the garbage collector can't move blocks around, the heap gets fragmented incredibly quickly. You want to load a 10K file? Nope. 5K? On a good day.

Here gc.mem_free() is reporting 30944 bytes free, which is technically correct (the best kind of correct!), but pyb.info(1) reveals the true story. Each line is one kilobyte, and free space is represented by dots:

GC memory layout; from 20003c80: 00000: MDhhhhBLhhDh======BBBhh=============h=ShSh=ShSh=Sh=BhBBhBB=BBBBh 00400: =====h=======h=====BTh===BhDBhhh=B=h============================ 00800: ====================================================h========h== 00c00: ===BBBB=hB=BhBhBhBBhShBhBhBhhBh=BhSBhSSB=BBSh===B=hSB=hB=h=B=BBB 01000: Sh========Bh=BBhhhShShhShSDh=hShSFhShFShSFh=hLhShh=ShhBSh===hSSh 01400: ===========hShShh=====ShShShShDh=ShLhSShShShShShShShShShShShShSh 01800: Sh=========================h==Sh==Sh=Sh=Sh=h==================== 01c00: h==ShShShShShShh=hh=hh===hh=hh============h=Sh====ShShShShSh=h== 02000: hh==hh=hh==============h=Sh====ShShShShSh=h======hSh=Sh=h=hh=hh= 02400: ===================================hShh=h=h=h=hShShShShShShSh=hh 02800: hShShShShSh=h=Bh=hhBh=hBhhh=hBhShSh=hhShSh=hShShShShShShShShShSh 02c00: ShShShShShhhh=====hShShShShShShShShShShShShShShShShSh=hBhhBh.... 03000: ...h=....h=..................................................... (2 lines all free) 03c00: ...................h........MD.................................. (6 lines all free) 05800: ............................................MD.h=............... 05c00: .BD..B.B=BB.B..B=Bh===B=h====B..h===.BBBBh=======.........h===== 06000: =====B.B=.SMD.h=======.DDF.Bh=====....MD.BBB....MD...MDF........ 06400: ...hSFFhh.hhhh...hh.....h.Bh=======h.B=B.......BDDBhh=======h=== 06800: ===..hh..hh.h===hh..B=B.Bh=======h=======B=............h==hh==== 06c00: ===.....h========B=BBBBM......h======h============hh============ 07000: ============================h=======h=hh=======h=======h======== 07400: ===M.h==h==hh=======hSh===h==hh==hh==hh=====hh=hh=hD..D.Mh=..... 07800: DBB..Lhh=======h.h=====B...hBh=D...Bh=.BBB.....h===========..h== 07c00: =====BB.....B=B....Sh..h=hhShh=====h=====h========h==S.......BMB 08000: D.h=S.....h=======h====================h========...............h 08400: ===.......h==hSh==.......h==============hh==hh===.h=..........h= 08800: ========.............h=======h====h====h==h.h=======h===h=B==... 08c00: ........B==.B==..................h====h====h====h=======h=....h= 09000: =h============h==h=====h==hh======.h==h=h=hF............B==DB..h 09400: S....MDB=.BB....hBBh===h===.....hSh=h=h=====...h========.h====== 09800: =h==hShFh========....hh......................................... 09c00: ........h....................................h===hShh=........h= 0a000: h==h====h===h=================================================== 0a400: ==...h========h=====h===========h=h=Sh=======h==========hhS..h== 0a800: ==.....................................................h=======. (2 lines all free) 0b400: ..............................................Sh=======......... 0b800: ................................................................ 0bc00: h================hh===========h=h=======h=hh==h================= 0c000: ================================================================ 0c400: ================================h==Sh=ShSh===Sh==Sh=ShShSh===h== 0c800: =hShSh....h.......h...................................Sh........ 0cc00: ................................................................ 0d000: ....................................h=======.................... 0d400: ................................................................ 0d800: ..................................h...................h========= 0dc00: ===================================================............. 0e000: .................................hh............................. 0e400: h=======.....h.................................................. 0e800: ..............h===h====h==hh=====hSh==h==hh==hh=====hh==h======h 0ec00: h=hh=hh==hh===hh==hh==hh==hh===hh==hh========hh==hh=====hSh==h=h 0f000: h=hh===hh==hh==h................................................ 0f400: ................................................................ 0f800: ...........h==h===h=h===========================h=h============= 0fc00: ========================h=h==================================h=. 10000: .......h================hh===hh=hh=hh=hh=hh==h.h=h=.....

We'd be lucky to find 2K of contiguous memory free! Once you start delving into nativity, how objects are laid out in memory starts to become really important. Last time's app was reduced with a bit of effort to 3200 bytes (from about twice that), but they still need to be consecutive bytes. Writing anything bigger would be a problem. In particular graphics take up an awful lot of space.

FAT

But stepping back a moment. The data we want is already stored in the filesystem. The filesystem is in flash. Flash is memory-mapped. Furthermore, the filesystem is FAT, which is primitive and doesn't have any annoying gotchas such as compression. So the data we want should be there for the taking.

The FAT implementation in micropython is a generic device-independent filesystem. It can use filesystems stored anywhere (e.g. on an SD card), for any device for which a "get block" and optionally "put block" function are provided. It assumes that block reads and writes are slow and will cache blocks in RAM when it thinks this is useful. For the case of our flash filesystem, this means a lot of unnecessary block copies and memory usage. There is of course no mmap() function, and it's not even clear how the filesystem could usefully provide one with the existing block device API.

But maybe we can make our own.

Thus the following idea. Open a file and read its contents. Loop through the entire flash (it's only a megabyte) looking for where it's stored. The cluster size is 512 bytes, so files will start on a 512-byte boundary.

def mmap(file): buf=bytearray(open(file).read()) for ptr in range(0x8000000,0x8100000,512): ## it'd be so great if this worked #if stm.mem8[ptr:ptr+len(buf)]==buf: # return ptr for i in range(len(buf)): if stm.mem8[ptr+i]!=buf[i]: break else: return ptr

Because of the previously-mentioned limitations of the stm.mem* functions we have to iterate by hand rather than using slices, which is hideously slow. But we can of course write a native version1:

def mmap(file): @micropython.asm_thumb def findblk(r0,r1,r2): data(2,0xf500,0x7000,0xf1b0,0x6f01,0xd208,0x2300,0x58c4,0x58cd) data(2,0x42ac,0xd1f5,0x3304,0x4293,0xd1f8,0xe000,0x2000) buf=open(file).read() return findblk(1<<27,buf,len(buf))

There are complications. Firstly, filesystems suffer fragmentation just as the heap does, so files may not be contiguous either. Secondly, reading the entire file contents takes up the same amount of RAM, temporarily, as the file size. The second problem could be solved by only reading and comparing small chunks of the file at a time, but then either we require that the first chunk of the file uniquely identifies it, or we're willing to backtrack (possibly multiple times!) if we found a false positive early on in the search. Also possible is precomputing a hash for the file rather than actually reading it in (we wouldn't need to bother with filenames at all!), but then we'd need to pick an algorithm, and implement it, and hope there are no collisions... Messy, all told.

The first is the biggie though, and there's no easy solution. But we can mitigate both problems simultaneously by simply having small files, but lots of them—say, one per sprite. This turns out to work beautifully in practice.

Is it hacky? Why yes. But this is a microcontroller. If you're not prepared to get your hands dirty you're doing it wrong!

Next time: let's put mmap to good use by writing a game.


[1] There are a couple of gotchas with this code. Word comparisons are used, so ensure your files are a multiple of four bytes long. It takes an address to start searching at, but the first block is skipped. Here's the actual code encoded in the hex constants:

0: f500 7000 add.w r0, r0, #512 ; 0x200 4: f1b0 6f01 cmp.w r0, #135266304 ; 0x8100000 8: d208 bcs.n 0x1c a: 2300 movs r3, #0 c: 58c4 ldr r4, [r0, r3] e: 58cd ldr r5, [r1, r3] 10: 42ac cmp r4, r5 12: d1f5 bne.n 0x0 14: 3304 adds r3, #4 16: 4293 cmp r3, r2 18: d1f8 bne.n 0xc 1a: e000 b.n 0x1e 1c: 2000 movs r0, #0

TiLDA March: how to run native code from MicroPython

Posted by HEx 2016-08-13 at 15:51

Another EMF, another badge! This one is the TiLDA MK3, an STM32-based board running MicroPython. It's similar enough to the pyboard that most of the following might be of use even to non-EMF attendees.

The MK3 is very much an embedded platform, with 128K of RAM, 1MByte of flash and an 80MHz Cortex-M4. How do you get performance out of such a device? Not by writing python, that's for sure.

We have two requirements for running native code. We need to be able to load code at a known address, and we need to be able to branch to it. The built-in assembler is terrible. It only supports a subset of instructions (and in a non-standard format, too). Nonetheless it has a data() function to insert arbitrary bytes into the instruction stream. Problem solved! Just write a script to convert gcc output into a series of data() calls and we're done. Right?

@micropython.asm_thumb def arbitrary_code(): data(4,0x11111111,0x22222222,0x33333333...)

This works. But it's kind of inefficient, and since we have so little space it would be foolish to incur a 300% bloat penalty1. But it is enough to investigate the environment in which we find ourselves.

Addresses

Here's a useful function. It doesn't do anything itself.2 But nonetheless you can pass arbitrary python objects to it and get their addresses back.

@micropython.asm_thumb def addr(r0): mov(r0,r0)

We also have stm.mem* which lets us read and write arbitrary memory locations from Python. So we can build ourselves a hex dumper:

def dump(addr): # copy memory just in case it changes out from under us3 a = bytearray(16) for i in range(16): a[i] = stm.mem8[addr+i] out = "%08x: " % addr for b in a: out = out + ("%02x " % b) for b in a: out = out + (("%c" % b) if b > 31 and b < 127 else ".") print(out)

addr(addr) returns its own address, which turns out to be some kind of struct with a pointer to the code at offset 8.

>>> dump(addr(addr)) 20004a10: 4c fb 06 08 01 00 00 00 70 9f 00 20 02 00 00 00 L.......p.. .... >>> dump(stm.mem32[addr(addr)+8]) 20009f70: f2 b5 00 46 f2 bd 00 00 00 00 00 00 00 00 00 00 ...F............

Disassemble the code and we get to see the prologue and epilogue:

$ perl -e 'print chr hex for @ARGV' f2 b5 00 46 f2 bd >/tmp/dump $ arm-none-eabi-objdump -b binary -m arm -M force-thumb,reg-names-std -D /tmp/dump [...] 0: b5f2 push {r1, r4, r5, r6, r7, lr} 2: 4600 mov r0, r0 4: bdf2 pop {r1, r4, r5, r6, r7, pc}

Which looks fairly standard.

The goal though is to be able to get the address of a buffer that contains code we want to run. It turns out that python bytearrays are passed directly to assembly functions. Handy.

>>> test = bytearray([0x2a,0x20,0x70,0x47]) # "movs r0, #42; bx lr" >>> dump(addr(test)) 2000f950: 2a 20 70 47 00 00 00 00 00 00 00 00 00 00 00 00 * pG............

Now we can read a buffer from file and branch to it. So, assuming we would actually like to return control to python afterwards, how do we do the branching?

blx is the standard way to jump to an address and return afterwards. blx requires the instruction set (ARM or Thumb) to be encoded in the LSB of the address; while the Cortex-M4 supports only Thumb mode, the processor does indeed verify that this is the case, so we must set the LSB correctly or Bad Things happen.4 Sadly the older bl instruction (which doesn't change instruction set) can't branch to an address in a register. We could always push the return address ourselves and load the PC directly, but then we'd have to set the LSB on the return address!

The assembler refuses to assemble blx(r0) so we have to do it ourselves.

@micropython.asm_thumb def call(r0): data(2,0x4780) # blx r0

Then we can call code in our test buffer.

>>> assert(not(addr(test)&3)) # ensure the buffer is aligned >>> call(addr(test)+1) # set the LSB and branch 42

Yes!

Of course, if we do the incrementing from within the call function we can dispense with addresses entirely:

def call_inc(r0): data(2,0x3001,0x4780) # adds r0,#1; blx r0 buf = open("code").read() call_inc(buf) # assumes buf is word-aligned, which seems to always be the case

Now to get gcc on board.

To C

Here's “hello world” without the hello (or the world).

$ cat test.c int test(void) { return 42; } $ arm-none-eabi-gcc -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -fPIC test.c -o test ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000 $ arm-none-eabi-objcopy play -O binary test-raw

(Why not just get gcc to output a flat binary directly?

$ arm-none-eabi-gcc -Wl,--oformat=binary -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -fPIC test.c -o test ld: error: Cannot change output format whilst linking ARM binaries.

Apparently this is deliberate. I have no idea why.)

Let's have a look at the result:

$ arm-none-eabi-objdump -b binary -m arm -M force-thumb,reg-names-std -D test-raw [...] 0: b480 push {r7} 2: af00 add r7, sp, #0 4: 232a movs r3, #42 ; 0x2a 6: 4618 mov r0, r3 8: 46bd mov sp, r7 a: f85d 7b04 ldr.w r7, [sp], #4 e: 4770 bx lr

A bit verbose perhaps, but we didn't ask for any optimization. Try it:

>>> f = open("test-raw") >>> test2 = f.read() >>> f.close() >>> call_inc(test2) 42

Looking good. Things go badly very quickly though:

$ cat count.c int arr[] = { 1, 2, 3, 4, 5 }; int count(void) { int i, total=0; for(i=0; i<5; i++) { total += arr[i]; } return total; } $ arm-none-eabi-gcc -Wall -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -fPIC count.c -o count ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000 $ arm-none-eabi-objcopy count -O binary count-raw

Again we compile without optimization, in this case to avoid gcc simply returning a constant instead of performing calculations at runtime.

Suddenly our binary is 64K in size! And it contains some code that will never work:

$ arm-none-eabi-objdump -b binary -m arm -M force-thumb,reg-names-std -D count-raw [...] 0: b480 push {r7} 2: b083 sub sp, #12 4: af00 add r7, sp, #0 6: 4a0e ldr r2, [pc, #56] ; (0x40) 8: 447a add r2, pc [...] 14: 4b0b ldr r3, [pc, #44] ; (0x44) 16: 58d3 ldr r3, [r2, r3] [...] 40: 003c movs r4, r7 42: 0001 movs r1, r0 44: 000c movs r4, r1 [...] 10054: 8058 strh r0, [r3, #2] 10056: 0001 movs r1, r0

Here it is trying to set R2 to the contents of PC+0x10054, which is 0x18058, and later reading from that address. Disassembling the original ELF output reveals what's going on: ldr r2, [pc, #56]: add r2, pc is attempting to load the global offset table. GCC has assumed (despite -ffreestanding!) that we're running in an environment with virtual memory where code and data are held in separate pages.

We need to create a custom linker script to tell it to not do that. Not being any kind of ld expert, here's what I cribbed together:

$ cat linkscr SECTIONS { .text 0 : AT(0) { code = .; *(.text.startup) *(.text) *(.rodata) *(.data) } }

108 bytes, plus our warning has gone away! That's more like it. We also need to replace -fPIC with -fPIE. The tempting -mpic-data-is-text-relative option is, happily, already enabled by default in gcc 4.8 and up.

$ arm-none-eabi-gcc -Tlinkscr -Wall -mthumb -mcpu=cortex-m4 -nostdlib -fPIE count.c -o count $ arm-none-eabi-objcopy count -O binary count-raw >>> call_inc(open("count-raw").read()) 15

Awesome.

So having got this far I made a little app, a quick-and-dirty port of some existing C code I've been working on.

What's the code? And why is it so arcane? That's a story for another day.

Code is on github.


[1] 4 bytes written in hex is 0x12345678, (11 bytes). This will allocate another four bytes for the result, for a total of 15 bytes. Then there's the space for data() (which almost certainly has a limit on the number of parameters, so multiple statements would be required). Result: about four times the space.

[2] Indeed, we only need the no-op at all to avoid confusing Python with an empty function body.

[3] Sadly the implementation of stm.mem* contains the following line:

// TODO support slice index to read/write multiple values at once

so this somewhat unpythonic code is actually the best we can do.

[4] Bad Things include but are not limited to: LEDs flashing ominously, unresponsiveness (necessitating a reboot), filesystem corruption (necessitating a reinstall, with loss of data), and nasal demons.