ARM Bare-metal Programming

ARM Cortex processors and microcontollers are ubiquitous, it’s such a successful processor architecture. It’s useful to understand how they work and how to use them as tools. When getting started with embedded development with ARM Cortex it might seem like a complex and difficult platform since a lot of the work will usually be done for you- i.e. you will download tools that setup projects for you, configure and initialize the devices for you and pull in dependencies with some API to use the device peripherals. It’s not obvious what is going on “under the hood” during those steps and what some of the resulting code is doing. Here I’ll document the steps to get an ARM Cortex microcontroller up and running with only simple tools and eventually make it perform more advanced tasks.

Source code is available at my GitHub. In this post a LED is toggled. In the repository there is code that do more advanced tasks such as configuring the microcontroller as a USB PC mouse, sending I2C data or running code asynchronously using interrupt service routines.

Article summary: I program an ARM Cortex M3 microcontroller using basic tools. With nothing more than the GCC-ARM toolchain program a STM32 microcontroller.

The tools needed is the GCC-ARM toolchain, a text-editor and a way to program the microcontroller. On Ubuntu the toolchain is installed by executing

apt install gcc-arm-none-eabi binutils-arm-none-eabi

To get a basic hello world program that blinks a LED on and off running the steps are the following.

  • Define the microcontroller memory layout for the compiler using a linker-script.

  • Create the startup-routine. It’s a sort of bootloader.

  • Create the main application which is hello world.

  • Compile the programs and flash them to the microcontroller.

  • The microcontroller I’ll be using is STM32F103C8. The datasheet is available here.

    stm32chip-300x289.png

    Generic STM32 in a LQFP64 package

    Linker script

    The linker script is used to define the memory layout of the output binary, that is the program flashed onto the microcontroller and what it will execute on boot.

    The image below is an excerpt from the datasheet and shows at which addresses flash memory and SRAM are located. Because the memory is accessed at different addresses the linker script is needed. Data needs to be written to addresses there is actual hardware memory. Certain types of data should be stored in different types of memory. For example data in SRAM is destroyed when power is lost but it’s faster to access than flash memory and perfect to store read-only data such as the actual program instructions.

    STM32F103C8F6_memory.png

    Page 34 of the datasheet shows the memory mapping. The information needed is the start address of the flash memory and SRAM, that is 0x0800 0000 and 0x2000 0000. It’s possible to use 0x0000 0000 for the flash memory address start since it can be aliased to the flash memory.

    Data will be segmented depending on the type. Here data is segmented using “text”, “rodata”, “data” and “bss”. This wikipedia article summarizes the difference between some of these. A custom segment called “vectors” will also be used. It is an architecture specific data-region that contains addresses for the stack-top and addresses for exception handlers. Read more about it here.

    Here is the actual link script:

    link_script.ld
    /* Linkfile for a STM32F103C8*
     * Robin Isaksson 2016 
     */ 
    
    MEMORY { 
        SRAM :       ORIGIN = 0x20000000, LENGTH = 20K
        FLASH :      ORIGIN = 0x08000000, LENGTH = 64K 
    } 
    
    SECTIONS {
        _STACK_TOP = 0x20000000 + 20K; /* Set the stack-top symbol to the end of SRAM */
        . = 0x00000000; /* Set current address to the origin of the flash memory */
       .text : {
           KEEP ( * (vectors) ); /* Custom segment, ARM Cortex vector table */
           * (.text); 
        } > FLASH
    
        .rodata : {
            * (.rodata);
        } > FLASH 
        FLASH_DATA_START = .; /*Where we will load from flash when writing to SRAM */ 
    
        /* Now we'll configure where we will keep our data outside off flash */ 
        . = 0x20000000; /* Set current address to the origin of the SRAM */
        SRAM_DATA_START = .; 
    
        .data : AT (FLASH_DATA_START) {
            * (.data);
        } > SRAM
        SRAM_DATA_END = .;
        SRAM_DATA_SIZE = SRAM_DATA_END - SRAM_DATA_START;
    
        BSS_START = .;
        .bss : {
            * (.bss);
            * (COMMON);
        } > SRAM
        BSS_END = .;
        BSS_SIZE = BSS_END - BSS_START; 
    } 
    
    ENTRY(main)

    The memory regions are defined in the memory block. From the datasheet the flash and SRAM region starting addresses and lengths are obtained. The flash memory is 64k and the SRAM 20k. Next the data segments are defined using the section block. Some useful memory-addresses symbols are defined, such as the top of the stack, the part of flash memory after the text- and rodata-segments (flash data start) and some similar symbols for SRAM. Those symbols will be used by the startup program to initialize and load data.

    By using the “AT” keyword a load-address can be specified, that is data can be loaded from one part of the binary and executed from another. Doing this all data is flashed onto the flash-memory but some parts of it will be relocated to SRAM. This is something that the startup routine needs to do.

    Startup routine

    The startup routine is needed for a few tasks. It’s used to define the custom memory region with the vector table mentioned in the previous section and to define symbols that can be used to set that vector table’s entries memory addresses to exception handler functions. It’s also used to move executable code from the flash memory to the faster SRAM and to initialize some variables to zero (for example global variables in C need to be initialized to zero).

    The image below show’s the format of the vector table. The beginning of the startup routine defines this data segment.

    vector_table.svg

    The format of the vector table available from the Cortex-M3 Devices Generic User Guide. The vector table begins with the address of the stack pointer reset value and then follows addresses for exception handlers.

    The startup routine begins with some assembler directives specifying syntax, architecture and to use the thumb-instruction set. Then the vector data segment is defined by allocating long data types contiguously. I’ve only included the generic vectors so far. More device specific vectors need to be added later to implement useful hardware interrupts, such as creating an interrupt handler that executes code when a timer reaches a certain value.

    startup.s
    /* Startup code for a STM32F103C8
     * Robin Isaksson 2017
     */ 
    .syntax unified
    .arch armv7-m 
    .thumb 
    
    /* Initial vector table */
    .section "vectors" 
        .long    _STACK_TOP
        .long    _reset_Handler
        .long    _NMI_Handler
        .long    _HardFault_Handler
        .long    _MemManage_Handler
        .long    _BusFault_Handler
        .long    _UsageFault_Handler
        .long    0 /* Reserved */
        .long    0 /* Reserved */
        .long    0 /* Reserved */
        .long    0 /* Reserved */
        .long    _SVCall_Handler
        .long    _DebugMonitor_Handler
        .long    0 /* Reserved */
        .long    _PendSV_Handler
        .long    _SysTick_Handler
    
    /* The macro below define each entry as a weak symbol which can be overwritten by
     * including the corresponding function-name in a C-file (for example) */
    .macro                 def_rewritable_handler   handler 
        .thumb_func
        .weak    \handler
        .type    \handler, %function
        \handler:   b  . @@ Branch forever in default state
    .endm
           
    def_rewritable_handler  _NMI_Handler
    def_rewritable_handler  _HardFault_Handler
    def_rewritable_handler  _MemManage_Handler
    def_rewritable_handler  _BusFault_Handler
    def_rewritable_handler  _UsageFault_Handler
    def_rewritable_handler  _SVCall_Handler
    def_rewritable_handler  _DebugMonitor_Handler
    def_rewritable_handler  _PendSV_Handler
    def_rewritable_handler  _SysTick_Handler

    Next follows the actual startup program logic. I won’t be getting into details of the actual assembler logic. The program moves data to SRAM using the symbols in the link-script, it initializes variables in the bss-segment to zero and then it calls the main-function. The main-function is not defined yet and will be defined in a C program. The “_reset_handler” symbol from the vector table is inserted here which mean that when the reset handler is called the startup routine will re-run. This is useful as the device will be in a predictable state when it resets.

    startup.s continued
    /* Startup code */
    .text
    .thumb_func
    _reset_Handler: 
    
            /* Copy data from flash to RAM */
    _startup: 
            @@ Copy data to RAM 
            ldr     r0, =FLASH_DATA_START
            ldr     r1, =SRAM_DATA_START
            ldr     r2, =SRAM_DATA_SIZE 
            @@ If data is zero then do not copy 
            cmp     r2, #0 
            beq     _BSS_init 
            /* Initialize data by loading it to SRAM */
            ldr     r3, =0 @@ i = 0 
    _RAM_copy: 
            ldr    r4, [r0, r3] @@ r4 = FLASH_DATA_START[i] 
            str    r4, [r1, r3] @@ SRAM_DATA_START[i] = r4 
            adds    r3, #4       @@ i++ 
            cmp     r3, r2       @@ if i < SRAM_DATA_SIZE then branch 
            blt     _RAM_copy    @@ otherwise loop again 
    
            /* Initialize uninitialized variables to zero (required for C) */
    _BSS_init: 
            ldr     r0, =BSS_START 
            ldr     r1, =BSS_END
            ldr     r2, =BSS_SIZE 
            /* If BSS size is zero then do not initialize */
            cmp     r2, #0
            beq     _end_of_init
            /* Initialize BSS variables to zero */
            ldr     r3, =0x0 @@ i = 0 
            ldr     r4, =0x0 @@ r4 = 0 
    _BSS_zero: 
            str     r4, [r0, r3] @@ BSS_START[i] = 0
            adds    r3, #4       @@ i++
            cmp     r3, r2       @@ if i < BSS_SIZE then branch
            blt     _BSS_zero        @@ otherwise loop again 
    
            /* Here control is given to the C-main function. Take note that
             * no heap is initialized. */
    _end_of_init:   bl   main 
    _halt:  b       _halt 
    
    .end
    Hello world

    The hello world program is written in C and will simply cycle an output pin between low to high states. I’ll be using a small LED-connected to that pin and it will be possible to see it blink when the program is running. Turning a led on and off is the equivalent to hello world in embedded development.

    I’m including the library header file called “stm32f10x.h” since it contains preprocessor macros for registry memory addresses and defines some helpful structs to work with registries. This makes the code infinitely more readable and no tedious work of copying registry addresses from the datasheet is required.

    main.c
    /* LED blinker for STM32F103C8
     * Robin Isaksson 2017
     */
    #include <stm32f10x.h> 
    #include <stdint.h>
    
    void delay(void);
    
    int main(void) {
        /* Enable clock to IO port C */
        RCC->APB2ENR |= RCC_APB2ENR_IOPCEN;
        /* Set pin C13 as an output */
        GPIOC->CRH &= ~GPIO_CRH_MODE13; //Clear bits
        GPIOC->CRH |= GPIO_CRH_MODE13;  //Output mode, max 50 MHz
        GPIOC->CRH &= ~GPIO_CRH_CNF13;  //GPIO output push-pull
    
        while(1) {
            GPIOC->BSRR = (1U << 13U); //Set pin HIGH
            delay();
            GPIOC->BRR = (1U << 13U); //Reset pin to LOW
            delay();
        }
    }
    
    /* Primitive delay function */
    void delay(void) {
        uint8_t i; 
        uint8_t j;
    
        for (i = 0; i < 0xFF; i++) {
            for (j = 0; j < 0xFF; j++) { 
            }
        } 
    }

    The code is simple. The general purpose input/output port C is first enabled and then pin 13 of the port is configured to be an output. In an infinite while loop pin 13 is repeatedly set high and then reset back to low between delays. The delay function is a simple loop that counts to a specified number.

    Compiling and flashing the program

    To compile the project the ARM-GCC toolchain is used. I’ve written a makefile that compiles all sourcefiles separately and then links them to a ELF-file which is then translated it to a binary format that is ready to be flashed to the microcontroller. I’m using the “ST-Link” programmer which has a command-line program called “st-flash” that is used to flash the microcontroller.

    # The output- and intermediary files will be named $(TARGET) 
    TARGET = blinky
    DEBUG = -g
    
    TOOLCHAIN = arm-none-eabi
    CC = $(TOOLCHAIN)-gcc
    AS = $(TOOLCHAIN)-as
    LD = $(TOOLCHAIN)-ld
    OCP = $(TOOLCHAIN)-objcopy
    GDB = $(TOOLCHAIN)-gdb
    
    LDSCRIPT = $(wildcard *.ld)
    SRCS = $(wildcard *.s)
    SRCC = $(wildcard *.c)
    OBJS = $(SRCS:.s=.o)
    OBJS +=$(SRCC:.c=.o)
    
    CFLAGS = -Wall -Wextra -mcpu=cortex-m3 -mthumb -c -Ilib $(DEBUG)
    AFLAGS = -g -mcpu=cortex-m3 -mthumb
    LDFLAGS = -g -T $(LDSCRIPT) 
    OCPFLAGS_HEX = -O ihex
    
    default: $(TARGET).hex
    
    %.o: %.s	
    	$(AS) $(AFLAGS) $< -o $@
    
    %.o: %.c	
    	$(CC) $(CFLAGS) $< -o $@ 
    
    $(TARGET).elf: $(OBJS) $(LDSCRIPT)
    	$(LD) $(LDFLAGS) $(OBJS) -o $@
    
    $(TARGET).hex: $(TARGET).elf
    	$(OCP) $(OCPFLAGS_HEX) $< $@
    
    clean:
    	rm -f ./*.o ./*.elf ./*.bin ./*.syms ./*.hex
    
    symbols: $(TARGET).elf
    	$(TOOLCHAIN)-nm -n $<
    
    flash:	$(TARGET).hex
    	st-flash --format ihex write $(TARGET).hex
    
    debug:	$(TARGET).elf 
    	$(GDB) --eval-command="target extended-remote :4242" $(TARGET).elf
  • builds the project and produces the output .hex-file that is ready to be flashed.

  • tries to flash the program to the microcontroller using the ST-Link utilities.

  • outputs the symbols in the .elf file, useful for debugging

  • tries to run the GNU-debugger with the program.

  • The next steps is creating a useful program. Take a look at my GitHub repository for code and libraries. I have continued on this work and written

  • USART implementation

  • I2C implementation

  • USB HID implementation

  • Use of hardware timers

  • Asynchronous programs using interrupt handlers

  • Setup with a 72 MHz system clock using PLL with a crystal oscillator

  • A more advanced project using all of these components and libraries is my USB intertial headtracker.