3.0.0
STM32 Arm® Cortex® M - Relocatable binary (or runtime loadable) model support


ST Edge AI Core

STM32 Arm® Cortex® M - Relocatable binary (or runtime loadable) model support


for STM32 target, based on ST Edge AI Core Technology 3.0.0



r2.6

Introduction

What is a relocatable binary model (runtime loadable model)?

A relocatable binary (or runtime-loadable) model designates a binary object that can be installed and executed anywhere in an STM32 memory subsystem. It contains a compiled version of the generated neural network (NN) C-files, including the requested forward kernel functions and the weights. The principal objective is to provide a flexible way to upgrade an AI-based application without re-generating and flashing the entire end-user firmware. This is the primary element used, for example, in firmware over-the-air (FOTA) technology.

The generated binary object is a lightweight plug-in. It can execute from any address (position-independent code) and store its data anywhere in memory (position-independent data). A simple and efficient AI relocatable runtime enables its instantiation and usage. No complex or resource-intensive dynamic linker for the Arm® Cortex®-M MCU is embedded in the STM32 firmware. The generated object is a self-contained entity, and no external symbols or functions are required at runtime.

Runtime-loadable model

In this article, the term “static” approach refers to the scenario where the generated neural network (NN) C-files are compiled and linked with the end-user application stack.

Limitations

  • No support to manage the state of a Keras stateful LSTM/GRU layers
  • No support for the STM32 series with Cortex-m0 or Cortex-m0plus (STM32L0, STM32F0, STM32G0)
  • Initial support for Custom layers, only self-containts c-files are supported. Lambda layers are supported.

Comparison with LiteRT for Microcontrollers solution

TensorFlow Lite (LiteRT) for Microcontrollers environment provides a way to upgrade an AI-based application. TFLite converter utility allows deploying a network and its associated parameters through a simple container: a TFLite file (*.tflitefile). Based on the flat buffer technology, it is interpreted at run-time to create an executable instance. The main difference is that the code of the forward kernel functions and associated interpreter should be already available in the initial firmware image. For the ST Edge AI core relocatable solution, the code of the kernels are also embedded in the container.

Getting started

Generating a relocatable binary model

To build a relocatable binary file for a given STM32 series, the --relocatable/--reloc/-r option is used with the generate command. Pay attention to the specific options to compress or to put the IO buffer in the activations buffer should be always applied as for the “standard” approach.

Important

A GNU ARM Embedded tool-chain (arm-none-eabi- prefix) should be available in the PATH before to launch the command.

$ stedgeai generate -m <model_file_path> <gen_options> --relocatable --target stm32h7

ST Edge AI Core v3.0.0
Used root dir: $STEDGEAI_CORE_DIR\Middlewares\ST\AI

Generating files for relocatable binary model..
...

 Runtime Loadable Model - Memory Layout (series="stm32h7")
 --------------------------------------------------------------------------------

 v2.0 'Embedded ARM GCC' cpuid=C27 fpu=True float-abi=2

 XIP size             = 7,280      data(6,708)+got(268)+bss(300) sections
 COPY size            = 46,812     +ro(39,532) sections
 extra sections       = 2,348      got:268(3.7%) header+rel:2,080(5.3%)
 params size          = 24,368     included in the binary file
 acts size            = 18,252
 binary file size     = 72,860
 params file size     = 0

 Generated files (10)
 --------------------------------------------------------------------------------
 <output-directory-path>\network_data_params.h
 <output-directory-path>\network_data_params.c
 <output-directory-path>\network_data.h
 <output-directory-path>\network_data.c
 <output-directory-path>\network_config.h
 <output-directory-path>\network.h
 <output-directory-path>\network.c
 <output-directory-path>\network_rel.bin
 <output-directory-path>\ai_reloc_network.c
 <output-directory-path>\ai_reloc_network.h

Creating report file <output-directory-path>\network_generate_report.txt

Options

The -r/--reloc/--relocate option can be used with the following parameters:

parameter description
<none> Default. A simple model binary file is generated: network + kernels + weights
“split” Two binary files are generated: one with the model (network + kernels) and another only with the weights.
“gen-c-file” A binary file is generated as a C-array (for debugging purposes).

The parameters can be combined using a comma separator.

$ stedgeai generate -m <model_file_path> <gen_options> --relocatable split,gen-c-files --target stm32h7 
...
 <output-directory-path>\network_data.bin
 <output-directory-path>\network_img_rel.c
 <output-directory-path>\network_img_rel.h

Supported STM32 series

supported series description
stm32f4/stm32f3/stm32g4/stm32wb default series. All STM32F4xx/STM32F3xx/STM32G4xx devices with a ARM Cortex M4 core and FPU support enabled (simple precision).
stm32l4/stm32l4r all STM32L4xx/STM32L4Rxx devices with a ARM Cortex M4 core and FPU support enabled (simple precision).
stm32l5/stm32u5/stm32u3 all STM32L5xx/STM32U5xx device with a ARM Cortex M33 core and FPU support enabled (simple precision).
stm32f7 all STM32F7xx device with a ARM Cortex M7 core and FPU support enabled (simple precision).
stm32h7 all STM32H7xx device with a ARM Cortex M7 core and FPU support enabled (double precision).
stm32n6 all STM32N6xx device with a ARM Cortex M55 core.

Generated files

file description
<network>_rel.bin main binary file (i.e. the “relocatable binary model”). Its contents the compiled version of the model, including the used forward kernel functions and the weights by default. It embeds also the additional sections (.header/.got/.rel) to be able to install the model.
<network>_data.bin Optional file. If the 'split' parameter is used, the weights are generated in a separated binary file.
ai_reloc_network.c/.h AI relocatable runtime API files. These files are copied from the STEdge AI core pack and there are not specific to the generated model. They should be compiled with the application files to use the relocatable binary model.

Debug/test purpose

file description
<network>.c/.h for debugging purposes - generated network C-files which are used to generate the relocatable binary model
<network>_data.c/.h for debugging purposes - generated network data C-files which are used to generate the relocatable binary model. <network>_data.c is an empty file.
<network>_img_rel.c/.h for debugging purposes - they facilitate the deployment of the relocatable binary model in a test framework. It contents additional macros and a C byte array (image of the binary file) which can be used by the AI relocatable runtime API to install and to use the model. The --no-c-files option flags can be used to avoid generating these additional files.

Memory layout information

The reported memory layout information complements the provided ROM and RAM memory size metrics (refer to “Memory-related metrics” section) with the AI memory resources, including the specific AI code and data sections required to run the AI stack. Apart from the additional sections used to manage the relocatable binary model (.header, .got and .rel sections), the size of the other sections is similar to the static code generation approach, where the neural network (NN) C-files are compiled and linked with the application code. The requested size for the I/O buffers is not reported here.

MCU AI memory layout for runtime loadable model

Following table summarizes the difference in term of memory layout (in bytes) between a static and relocatable approach. Size of the network/kernels sections are dependent of the topology complexity (number of nodes) and the different forward kernel functions which are used. Activations and weights are always the same.

AI object static reloc typically placed in
activations 18.252 18.252 RAM type (rw), .bss section
weights 24,368 24,368 FLASH type (ro), .rodata section
network/kernels (FLASH) 41,966 46,512 FLASH type (rx), .txt\.rodata\(.data) section
network/kernels (RAM) 6,212 7,280 RAM type (rw), .data\.bss section

XIP and COPY sizes indicate the requested size for the executable RAM region to install the model.

Final requested memory layout

Memory type Total size Comment
FLASH binary file size (+ params file size) Contains model code/kernels and weights. May be split into one or two non-volatile memory chunks depending on configuration (e.g., split option).
RAM XIP or COPY size + acts size + (+ IO buffer size) Executable RAM region where the model is installed (either Execute-In-Place or copied), plus the activations buffer. Note: Requested stack size is not included in this calculation.

Notes:

  • The code/data size related to ai_reloc_network.c/.h files is considered negligible and thus not included in the size estimation.
  • IO buffer size depends on the application’s input/output buffering requirements and may vary.

Upgrading the firmware image

The firmware upgrade process (flashing the new image) is application-dependent and out of scope for this article. It is assumed that the relocatable binary model file (<network>_rel.bin) has already been flashed into a memory-mapped region accessible by the MCU. To run a relocatable model, a specific AI relocatable runtime API is requested to install and to use it. This API is provided as a simple C source file within the ST Edge AI Core pack and should be integrated during the generation of the firmware. Note that the only the AI runtime header files are requested, the network_runtime.a library is not necessary.

CFLAGS += -mcpu=cortex-m4 -mthumb -mfpu=fpv4-sp-d16  -mfloat-abi=hard

C_SOURCES += $STEDGEAI_CORE_DIR/Middlewares/ST/AI/Reloc/Src/ai_reloc_network.c

CFLAGS += -I$STEDGEAI_CORE_DIR/Middlewares/ST/AI/Inc
CFLAGS += -I$STEDGEAI_CORE_DIR/Middlewares/ST/AI/Reloc/Inc

Creating an instance

After the update of the model (binary file) inside the STM32 device at the address: BIN_ADDRESS, the following sequence of code can be used to create and to install an instance of the generated model.

#include <ai_reloc_network.h>

ai_error err;
ai_rel_network_info rt_info;

err = ai_rel_network_rt_get_info(BIN_ADDRESS, &rt_info);

This allows to retrieve a part of the meta-information embedded in the header of the binary.

...
printf("Load a relocatable binary model, located at the address 0x%08x\r\n",
       (int)BIN_ADDRESS);
printf(" model name                : %s\r\n", rt_info.c_name);
printf(" weights size              : %d bytes\r\n", (int)rt_info.weights_sz);
printf(" activations size          : %d bytes (minimum)\r\n", (int)rt_info.acts_sz);
printf(" compiled for a Cortex-Mx  : 0x%03X\r\n",
         (int)AI_RELOC_RT_GET_CPUID(rt_info.variant));
printf(" FPU should be enabled     : %s\r\n",
         AI_RELOC_RT_FPU_USED(rt_info.variant)?"yes":"no");
printf(" RT RAM minimum size       : %d bytes (%d bytes in COPY mode)\r\n",
        (int)rt_info.rt_ram_xip,
        (int)rt_info.rt_ram_copy);
...

To create an executable instance of the C-model, a dedicated memory buffer (also called AI RT RAM), should be provided. Minimum requested size is model and execution mode dependent. For the XIP execution mode (AI_RELOC_RT_LOAD_MODE_XIP), only a buffer (rw memory mapped region) for the data sections is requested (minimum size = rt_info.rt_ram_xip). Note that the allocated buffer should be 4-bytes aligned. For the COPY execution mode (AI_RELOC_RT_LOAD_MODE_COPY), rt_info.rt_ram_copy minimum size is requested to be able to copy also the code sections. For this last case, the provided memory region, should be executable.

ai_error err;
ai_handle net = AI_HANDLE_NULL;

uint8_t *rt_ai_ram = malloc(rt_info.rt_ram_xip);

err = ai_rel_network_load_and_create(BIN_ADDRESS, rt_ai_ram, rt_info.rt_ram_xip,
                                     AI_RELOC_RT_LOAD_MODE_XIP, &net);

Before to install and to set the instance, the compatibility with the STM32 platform and the provided binary is verified, confirming the Cortex-Mx ID and if the FPU is enabled (if requested by the binary). If all is OK, an instance of the model is ready to be initialized and a handle is returned (net parameter).

As for the “static” approach, the next step is to complete the internal data structure with the activation buffer and the weights buffer. Only the addresses of the associated buffer should be provided. It the weights are loaded as a separated file (--binary option flag), WEIGHTS_ADDRESS indicates the location where the weights have been placed.

ai_handle weights_addr;
ai_bool res;

uint8_t *act_addr = malloc(rt_info.acts_sz)

if (rt_info.weights)
  weights_addr = rt_info.weights;
else
  weights_addr = WEIGHTS_ADDRESS;

res = ai_rel_network_init(net, &weights_addr, &act_addr))

At this stage, the instance is fully ready to be used. To retrieve the whole attributes of the instantiated model, the ai_rel_network_get_report() function can be used.

ai_bool res;
ai_network_report net_info;

res = ai_rel_network_get_report(net, &net_info);

To avoid to allocate the model dependent memory regions through a system heap, a pre-allocate memory region can be used (AI_RT_ADDR address).

rt_ai_ram = (uint8_t *)AI_RT_ADDR;
act_addr = rt_ai_ram + AI_RELOC_ROUND_UP(rt_info.rt_ram_xip);

Note

“Static” allowing only one instance at the time, there is no limitation here of the number of the created instances for a same generated model. Each instance can be created with its own AI RT RAM area. It is initialized with its own activations buffer, concurrent UC can be implemented w/o specific synchro.

Running an inference

The function to run an inference is fully like the “static” case. Following snippet code illustrates the case where the generated model is defined with the simple input and output tensors.

static int ai_run(void *data_in, void *data_out)
{
  ai_i32 batch;

  ai_buffer *ai_input = net_info.inputs;
  ai_buffer *ai_output = net_info.outputs;

  ai_input[0].data = AI_HANDLE_PTR(data_in);
  ai_output[0].data = AI_HANDLE_PTR(data_out);

  batch = ai_rel_network_run(net, ai_input, ai_output);
  if (batch != 1) {
    ai_log_err(ai_rel_network_get_error(net),
        "ai_rel_network_run");
    return -1;
  }

  return 0;
}

Tip

Properties of the input or output tensors are fully accessible through the ai_network_report struct as for the “static” approach (refer to “IO tensor description” [API] section). Payload can be allocated in the activations buffer w/o restrictions.

Generation flow

The following figure illustrates the flow to generate a relocatable binary model. The first step to import and generate the NN C-files. It is identical to the static model generation approach. Only the <network>_data.c/.h files are not fully generated. The second step allows to compile and to link the generated NN C-files against a specific AI runtime library. It is just compiled with the relocatable options and it embeds the requested mathematical and memcopy/memset functions. The last post-processing step generates the binary file by appending a specific section (.rel section) and various information which will be used by the AI relocatable run-time API. The weights are appended as a .weights binary section at the end of the file.

Generation of the relocatable binary model

Note

The code is compiled exclusively with a GCC Arm® Embedded toolchain. It is compiled using the -fpic and -msingle-pic-base options. The Arm® Cortex®-M r9 register is designated as the platform register for the global offset table (GOT). The AI relocatable runtime function updates the r9 register before calling the code. The generated relocatable binary object is independent of the end-user Arm® embedded toolchain used to build the end-user application. Consequently, for the same memory placement and hardware settings, the inference time remains consistent.

AI run time execution modes

XIP execution mode

This execution mode is the default use case, where the code and weight sections are stored in the flash memory. Regarding memory placement, this approach is similar to the static method. Only a read-write memory region is required to store the requested data and BSS sections.

XIP execution mode

COPY execution mode

This alternative execution mode must be considered when the weights are required to be stored in an external memory device because they cannot fit in the internal flash memory. Copying the code from a non-efficient executable memory region to a low-latency executable region significantly improves inference time. Note that the required AI RT RAM size is larger, and the associated memory region must be executable. Another limitation of the Cortex-M4-based architecture (no Core I/D cache available) is the contention caused by code and data memory accesses, which can degrade performance. To mitigate this limitation, the next use case must be considered.

COPY execution mode

XIP execution mode and separated weight binary file

This mode represents an optimal case where the weights must reside in an external memory device (<network>_data.bin file). This configuration requires a second internal or embedded flash region to store the code (<network>_rel.bin file). The benefit is to offload large weight data to external memory, freeing internal flash space. In this scenario, the critical code executes in place. However, the drawback is the need to manage two binary files during upgrades.

XIP execution mode with a separated weight binary file

AI relocatable run-time API

The proposed API, referred to as the AI relocatable runtime API, manages the relocatable binary model and is comparable to the embedded inference client API (refer to API) for the standard approach. Only the create and initialize functions have been enhanced to account for the specificities. All functions are prefixed with ai_rel_network_ and are independent of the C name of the model. They are defined and implemented in the ai_reloc_network.c/.h files, located in the $STEDGEAI_CORE_DIR/Middlewares/ST/AI/Reloc/ folder.

ai_rel_network_rt_get_info()

ai_error ai_rel_network_rt_get_info(const void* obj, ai_rel_network_info* rt);

Allows retrieving the dimensioning information required to instantiate a relocatable binary model.

  • {AI_ERROR_INVALID_HANDLE, AI_ERROR_CODE_INVALID_PTR} errors are returned if the referenced object is not valid (for example, invalid signature or the address is not aligned on 4 bytes).

The table below describes the fields available in the returned ai_rel_network_info C-struct:

field description
c_name Pointer to the user C-name of the generated model (for debugging purposes).
variant 32-bit word. Handles the AI RT version, requested Cortex-M ID, etc.
code_sz Code size in bytes (excluding the weight section).
weights\weights_sz Address/size (in bytes) of the weight section, if available.
acts_sz Requested activations size (RAM metric) to run the model.
rt_ram_xip Requested RAM size (in bytes) to install the model in XIP mode.
rt_ram_copy Requested RAM size (in bytes) to install the model in COPY mode

ai_rel_network_load_and_create()

ai_error ai_rel_network_load_and_create(const void* obj, ai_handle ram_addr,
    ai_size ram_size, uint32_t mode, ai_handle* hdl);
ai_handle ai_rel_network_destroy(ai_handle hdl);

Create and install an instance of the relocatable binary model, referenced by obj. Provide a read-write (RW) memory buffer, defined by ram_addr and ram_size, to create the data sections (.data, .bss, .got) and resolve internal references during the relocation process. The mode parameter specifies the expected execution mode. Retrieve the expected size for the AI runtime (RT) RAM buffer using the ai_rel_network_rt_get_info() function.

mode description
AI_RELOC_RT_LOAD_MODE_XIP XIP execution mode is requested
AI_RELOC_RT_LOAD_MODE_COPY COPY execution mode is requested
  • ai_handle references a run-time context (opaque object) which must be used for the other functions.
  • before to create the instance, Cortex-M id is verified. If requested the function checks also if the FPU is enable.

Note

If ram_addr or/and ram_size are NULL, default allocation is done through the system heap. Behavior can be overwritten in the ai_reloc_network.c file, see AI_RELOC_MALLOC macro definition.

ai_rel_network_init()

ai_bool ai_rel_network_init(ai_handle hdl, const ai_handle *weights,
    const ai_handle *act);

Finalizes the initialization of the instance with the addresses of the weights and the activations buffer.

  • if the weights are stored in the relocatable binary object, ai_rel_network_rt_get_info() should be used to retrieve the address.
  • as for the “static` approach, an activations buffer (or multiple) should be also provided.
ai_handle weights_addr;
ai_handle activations;
...
const ai_handle acts[] = { activations };
res = ai_rel_network_init(net, &weights_addr, acts))
...

ai_rel_network_get_info()

ai_bool ai_rel_network_get_info(ai_handle hdl, ai_network_report* report);

Allow to retrieve the run-time data attributes of an instantiated model. Refer to ai_platform.h file to show the detail of the returned ai_network_report C-struct. It should be called after the call of ai_rel_network_init().

ai_rel_network_get_error()

ai_error ai_rel_network_get_error(ai_handle hdl);

Return the 1st error reported during the execution of a ai_network_xxx() function.

  • see ai_platform.h file to have the list of the returned error type (ai_error_type) and associated code (ai_error_code).

ai_rel_network_run()

ai_i32 ai_rel_network_run(ai_handle hdl, const ai_buffer* input, ai_buffer* output);

Perform one or the inferences. The input and output buffer parameters (ai_buffer type) allow to provide the input tensors and to store the predicted output tensors respectively (refer to [“IO tensor description” [API]][API_io_tensor] section).

ai_rel_platform_observer_register()

ai_bool ai_rel_platform_observer_register(ai_handle hdl,
    ai_observer_node_cb cb, ai_handle cookie, ai_u32 flags);
ai_bool ai_rel_platform_observer_unregister(ai_handle hdl,
    ai_observer_node_cb cb, ai_handle cookie);
ai_bool ai_rel_platform_observer_node_info(ai_handle hdl,
    ai_observer_node *node_info);

As for the “static” approach, these functions allow the registration of a user callback to be notified before or after the execution of a c-node. There are no restrictions on using the Platform Observer API with a relocatable binary model.