STM32 Arm® Cortex® M - Relocatable binary (or runtime loadable) model support
for STM32 target, based on ST Edge AI Core Technology 3.0.0
r2.6
Introduction
What is a relocatable binary model (runtime loadable model)?
A relocatable binary (or runtime-loadable) model designates a binary object that can be installed and executed anywhere in an STM32 memory subsystem. It contains a compiled version of the generated neural network (NN) C-files, including the requested forward kernel functions and the weights. The principal objective is to provide a flexible way to upgrade an AI-based application without re-generating and flashing the entire end-user firmware. This is the primary element used, for example, in firmware over-the-air (FOTA) technology.
The generated binary object is a lightweight plug-in. It can execute from any address (position-independent code) and store its data anywhere in memory (position-independent data). A simple and efficient AI relocatable runtime enables its instantiation and usage. No complex or resource-intensive dynamic linker for the Arm® Cortex®-M MCU is embedded in the STM32 firmware. The generated object is a self-contained entity, and no external symbols or functions are required at runtime.
In this article, the term “static” approach refers to the scenario where the generated neural network (NN) C-files are compiled and linked with the end-user application stack.
Limitations
- No support to manage the state of a Keras stateful LSTM/GRU layers
- No support for the STM32 series with Cortex-m0 or Cortex-m0plus (STM32L0, STM32F0, STM32G0)
- Initial support for Custom layers, only self-containts c-files are supported. Lambda layers are supported.
Comparison with LiteRT for Microcontrollers solution
TensorFlow
Lite (LiteRT) for Microcontrollers environment provides a way to
upgrade an AI-based application. TFLite converter
utility allows deploying a network and its associated parameters
through a simple container: a TFLite file
(*.tflitefile). Based on the flat buffer technology, it
is interpreted at run-time to create an executable
instance. The main difference is that the code of the
forward kernel functions and associated interpreter
should be already available in the initial firmware image. For the
ST Edge AI core relocatable solution, the code of the kernels are
also embedded in the container.
Getting started
Generating a relocatable binary model
To build a relocatable binary file for a given STM32 series, the
--relocatable/--reloc/-r option is used with the
generate command. Pay attention to the specific options
to compress or to put the IO buffer in the activations buffer should
be always applied as for the “standard” approach.
Important
A GNU ARM Embedded tool-chain (arm-none-eabi- prefix) should be available in the PATH before to launch the command.
$ stedgeai generate -m <model_file_path> <gen_options> --relocatable --target stm32h7
ST Edge AI Core v3.0.0
Used root dir: $STEDGEAI_CORE_DIR\Middlewares\ST\AI
Generating files for relocatable binary model..
...
Runtime Loadable Model - Memory Layout (series="stm32h7")
--------------------------------------------------------------------------------
v2.0 'Embedded ARM GCC' cpuid=C27 fpu=True float-abi=2
XIP size = 7,280 data(6,708)+got(268)+bss(300) sections
COPY size = 46,812 +ro(39,532) sections
extra sections = 2,348 got:268(3.7%) header+rel:2,080(5.3%)
params size = 24,368 included in the binary file
acts size = 18,252
binary file size = 72,860
params file size = 0
Generated files (10)
--------------------------------------------------------------------------------
<output-directory-path>\network_data_params.h
<output-directory-path>\network_data_params.c
<output-directory-path>\network_data.h
<output-directory-path>\network_data.c
<output-directory-path>\network_config.h
<output-directory-path>\network.h
<output-directory-path>\network.c
<output-directory-path>\network_rel.bin
<output-directory-path>\ai_reloc_network.c
<output-directory-path>\ai_reloc_network.h
Creating report file <output-directory-path>\network_generate_report.txt
Options
The -r/--reloc/--relocate option can be used with the following parameters:
| parameter | description |
|---|---|
| <none> | Default. A simple model binary file is generated: network + kernels + weights |
| “split” | Two binary files are generated: one with the model (network + kernels) and another only with the weights. |
| “gen-c-file” | A binary file is generated as a C-array (for debugging purposes). |
The parameters can be combined using a comma separator.
$ stedgeai generate -m <model_file_path> <gen_options> --relocatable split,gen-c-files --target stm32h7
...
<output-directory-path>\network_data.bin
<output-directory-path>\network_img_rel.c
<output-directory-path>\network_img_rel.h- The --ihex/--address options can be used to generate an Intel Hexadecimal Object File Format.
- The --split-weights and --copy-weights-at features are not supported with the --relocatable option.
Supported STM32 series
| supported series | description |
|---|---|
stm32f4/stm32f3/stm32g4/stm32wb |
default series. All STM32F4xx/STM32F3xx/STM32G4xx devices with a ARM Cortex M4 core and FPU support enabled (simple precision). |
stm32l4/stm32l4r |
all STM32L4xx/STM32L4Rxx devices with a ARM Cortex M4 core and FPU support enabled (simple precision). |
stm32l5/stm32u5/stm32u3 |
all STM32L5xx/STM32U5xx device with a ARM Cortex M33 core and FPU support enabled (simple precision). |
stm32f7 |
all STM32F7xx device with a ARM Cortex M7 core and FPU support enabled (simple precision). |
stm32h7 |
all STM32H7xx device with a ARM Cortex M7 core and FPU support enabled (double precision). |
stm32n6 |
all STM32N6xx device with a ARM Cortex M55 core. |
Generated files
| file | description |
|---|---|
<network>_rel.bin |
main binary file (i.e. the “relocatable binary model”). Its contents the compiled version of the model, including the used forward kernel functions and the weights by default. It embeds also the additional sections (.header/.got/.rel) to be able to install the model. |
<network>_data.bin |
Optional file. If the
'split' parameter is used, the weights are generated in
a separated binary file. |
ai_reloc_network.c/.h |
AI relocatable runtime API files. These files are copied from the STEdge AI core pack and there are not specific to the generated model. They should be compiled with the application files to use the relocatable binary model. |
Debug/test purpose
| file | description |
|---|---|
<network>.c/.h |
for debugging purposes - generated network C-files which are used to generate the relocatable binary model |
<network>_data.c/.h |
for debugging purposes - generated
network data C-files which are used to generate the relocatable
binary model. <network>_data.c is an empty
file. |
<network>_img_rel.c/.h |
for debugging purposes - they
facilitate the deployment of the relocatable binary model in a test
framework. It contents additional macros and a C byte array (image
of the binary file) which can be used by the AI relocatable runtime API to install and to
use the model. The --no-c-files option flags can be
used to avoid generating these additional files. |
Memory layout information
The reported memory layout information complements the provided
ROM and RAM memory size metrics (refer to “Memory-related
metrics” section) with the AI memory resources, including
the specific AI code and data sections required to run the AI stack.
Apart from the additional sections used to manage the relocatable
binary model (.header, .got and
.rel sections), the size of the other sections is
similar to the static code generation approach, where the neural
network (NN) C-files are compiled and linked with the application
code. The requested size for the I/O buffers is not reported
here.
Following table summarizes the difference in term of memory layout (in bytes) between a static and relocatable approach. Size of the network/kernels sections are dependent of the topology complexity (number of nodes) and the different forward kernel functions which are used. Activations and weights are always the same.
| AI object | static | reloc | typically placed in |
|---|---|---|---|
| activations | 18.252 | 18.252 | RAM type (rw), .bss
section |
| weights | 24,368 | 24,368 | FLASH type (ro), .rodata
section |
| network/kernels (FLASH) | 41,966 | 46,512 | FLASH type (rx),
.txt\.rodata\(.data) section |
| network/kernels (RAM) | 6,212 | 7,280 | RAM type (rw), .data\.bss
section |
XIP and COPY sizes indicate the requested size for the executable RAM region to install the model.
Final requested memory layout
| Memory type | Total size | Comment |
|---|---|---|
| FLASH | binary file size (+ params file size) | Contains model code/kernels and weights. May be split into one or two non-volatile memory chunks depending on configuration (e.g., split option). |
| RAM | XIP or COPY size + acts size + (+ IO buffer size) | Executable RAM region where the model is installed (either Execute-In-Place or copied), plus the activations buffer. Note: Requested stack size is not included in this calculation. |
Notes:
- The code/data size related to
ai_reloc_network.c/.hfiles is considered negligible and thus not included in the size estimation. - IO buffer size depends on the application’s input/output buffering requirements and may vary.
Upgrading the firmware image
The firmware upgrade process (flashing the new
image) is application-dependent and out of scope
for this article. It is assumed that the relocatable binary model
file (<network>_rel.bin) has already been flashed
into a memory-mapped region accessible by the MCU. To run a
relocatable model, a specific AI relocatable
runtime API is requested to install and to use it. This API is
provided as a simple C source file within the ST Edge AI Core pack
and should be integrated during the generation of the firmware. Note
that the only the AI runtime header files are requested, the
network_runtime.a library is not necessary.
CFLAGS += -mcpu=cortex-m4 -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard
C_SOURCES += $STEDGEAI_CORE_DIR/Middlewares/ST/AI/Reloc/Src/ai_reloc_network.c
CFLAGS += -I$STEDGEAI_CORE_DIR/Middlewares/ST/AI/Inc
CFLAGS += -I$STEDGEAI_CORE_DIR/Middlewares/ST/AI/Reloc/IncCreating an instance
After the update of the model (binary file) inside the STM32
device at the address: BIN_ADDRESS, the following
sequence of code can be used to create and to install an instance of
the generated model.
#include <ai_reloc_network.h>
ai_error err;
ai_rel_network_info rt_info;
err = ai_rel_network_rt_get_info(BIN_ADDRESS, &rt_info);This allows to retrieve a part of the meta-information embedded in the header of the binary.
...
printf("Load a relocatable binary model, located at the address 0x%08x\r\n",
(int)BIN_ADDRESS);
printf(" model name : %s\r\n", rt_info.c_name);
printf(" weights size : %d bytes\r\n", (int)rt_info.weights_sz);
printf(" activations size : %d bytes (minimum)\r\n", (int)rt_info.acts_sz);
printf(" compiled for a Cortex-Mx : 0x%03X\r\n",
(int)AI_RELOC_RT_GET_CPUID(rt_info.variant));
printf(" FPU should be enabled : %s\r\n",
AI_RELOC_RT_FPU_USED(rt_info.variant)?"yes":"no");
printf(" RT RAM minimum size : %d bytes (%d bytes in COPY mode)\r\n",
(int)rt_info.rt_ram_xip,
(int)rt_info.rt_ram_copy);
...To create an executable instance of the C-model, a dedicated
memory buffer (also called AI RT RAM), should be provided.
Minimum requested size is model and execution mode dependent. For
the XIP execution mode (AI_RELOC_RT_LOAD_MODE_XIP),
only a buffer (rw memory mapped region) for the data sections is
requested (minimum size = rt_info.rt_ram_xip). Note
that the allocated buffer should be 4-bytes aligned. For the
COPY execution mode (AI_RELOC_RT_LOAD_MODE_COPY),
rt_info.rt_ram_copy minimum size is requested to be
able to copy also the code sections. For this last case, the
provided memory region, should be executable.
ai_error err;
ai_handle net = AI_HANDLE_NULL;
uint8_t *rt_ai_ram = malloc(rt_info.rt_ram_xip);
err = ai_rel_network_load_and_create(BIN_ADDRESS, rt_ai_ram, rt_info.rt_ram_xip,
AI_RELOC_RT_LOAD_MODE_XIP, &net);Before to install and to set the instance, the compatibility with
the STM32 platform and the provided binary is verified, confirming
the Cortex-Mx ID and if the FPU is enabled (if requested by the
binary). If all is OK, an instance of the model is ready to be
initialized and a handle is returned (net
parameter).
As for the “static” approach, the next step is to complete the
internal data structure with the activation buffer and the weights
buffer. Only the addresses of the associated buffer should be
provided. It the weights are loaded as a separated file
(--binary option flag), WEIGHTS_ADDRESS
indicates the location where the weights have been placed.
ai_handle weights_addr;
ai_bool res;
uint8_t *act_addr = malloc(rt_info.acts_sz)
if (rt_info.weights)
weights_addr = rt_info.weights;
else
weights_addr = WEIGHTS_ADDRESS;
res = ai_rel_network_init(net, &weights_addr, &act_addr))At this stage, the instance is fully ready to be used. To
retrieve the whole attributes of the instantiated model, the
ai_rel_network_get_report() function can be used.
ai_bool res;
ai_network_report net_info;
res = ai_rel_network_get_report(net, &net_info);To avoid to allocate the model dependent memory regions through a
system heap, a pre-allocate memory region can be used
(AI_RT_ADDR address).
rt_ai_ram = (uint8_t *)AI_RT_ADDR;
act_addr = rt_ai_ram + AI_RELOC_ROUND_UP(rt_info.rt_ram_xip);Note
“Static” allowing only one instance at the time, there is no limitation here of the number of the created instances for a same generated model. Each instance can be created with its own AI RT RAM area. It is initialized with its own activations buffer, concurrent UC can be implemented w/o specific synchro.
Running an inference
The function to run an inference is fully like the “static” case. Following snippet code illustrates the case where the generated model is defined with the simple input and output tensors.
static int ai_run(void *data_in, void *data_out)
{
ai_i32 batch;
ai_buffer *ai_input = net_info.inputs;
ai_buffer *ai_output = net_info.outputs;
ai_input[0].data = AI_HANDLE_PTR(data_in);
ai_output[0].data = AI_HANDLE_PTR(data_out);
batch = ai_rel_network_run(net, ai_input, ai_output);
if (batch != 1) {
ai_log_err(ai_rel_network_get_error(net),
"ai_rel_network_run");
return -1;
}
return 0;
}Tip
Properties of the input or output tensors are fully accessible
through the ai_network_report struct as for the
“static” approach (refer to “IO tensor
description” [API] section). Payload can be allocated in the
activations buffer w/o restrictions.
Generation flow
The following figure illustrates the flow to generate a
relocatable binary model. The first step to import and generate the
NN C-files. It is identical to the static model generation
approach. Only the <network>_data.c/.h
files are not fully generated. The second step allows to compile and
to link the generated NN C-files against a specific AI runtime
library. It is just compiled with the relocatable options and it
embeds the requested mathematical and memcopy/memset functions. The
last post-processing step generates the binary file by appending a
specific section (.rel section) and various information
which will be used by the AI relocatable run-time API. The weights
are appended as a .weights binary section at the end of
the file.
Note
The code is compiled exclusively with a GCC Arm® Embedded toolchain. It is compiled using the -fpic and -msingle-pic-base options. The Arm® Cortex®-M r9 register is designated as the platform register for the global offset table (GOT). The AI relocatable runtime function updates the r9 register before calling the code. The generated relocatable binary object is independent of the end-user Arm® embedded toolchain used to build the end-user application. Consequently, for the same memory placement and hardware settings, the inference time remains consistent.
AI run time execution modes
XIP execution mode
This execution mode is the default use case, where the code and weight sections are stored in the flash memory. Regarding memory placement, this approach is similar to the static method. Only a read-write memory region is required to store the requested data and BSS sections.
COPY execution mode
This alternative execution mode must be considered when the weights are required to be stored in an external memory device because they cannot fit in the internal flash memory. Copying the code from a non-efficient executable memory region to a low-latency executable region significantly improves inference time. Note that the required AI RT RAM size is larger, and the associated memory region must be executable. Another limitation of the Cortex-M4-based architecture (no Core I/D cache available) is the contention caused by code and data memory accesses, which can degrade performance. To mitigate this limitation, the next use case must be considered.
XIP execution mode and separated weight binary file
This mode represents an optimal case where the weights must
reside in an external memory device
(<network>_data.bin file). This configuration
requires a second internal or embedded flash region to store the
code (<network>_rel.bin file). The benefit is to
offload large weight data to external memory, freeing internal flash
space. In this scenario, the critical code executes in place.
However, the drawback is the need to manage two binary files during
upgrades.
AI relocatable run-time API
The proposed API, referred to as the AI relocatable runtime API,
manages the relocatable binary model and is comparable to the
embedded inference client API (refer to API) for the standard
approach. Only the create and initialize functions have been
enhanced to account for the specificities. All functions are
prefixed with ai_rel_network_ and are independent of
the C name of the model. They are defined and implemented in the
ai_reloc_network.c/.h files, located in the
$STEDGEAI_CORE_DIR/Middlewares/ST/AI/Reloc/ folder.
ai_rel_network_rt_get_info()
ai_error ai_rel_network_rt_get_info(const void* obj, ai_rel_network_info* rt);Allows retrieving the dimensioning information required to instantiate a relocatable binary model.
{AI_ERROR_INVALID_HANDLE, AI_ERROR_CODE_INVALID_PTR}errors are returned if the referenced object is not valid (for example, invalid signature or the address is not aligned on 4 bytes).
The table below describes the fields available in the returned
ai_rel_network_info C-struct:
| field | description |
|---|---|
c_name |
Pointer to the user C-name of the generated model (for debugging purposes). |
variant |
32-bit word. Handles the AI RT version, requested Cortex-M ID, etc. |
code_sz |
Code size in bytes (excluding the weight section). |
weights\weights_sz |
Address/size (in bytes) of the weight section, if available. |
acts_sz |
Requested activations size (RAM metric) to run the model. |
rt_ram_xip |
Requested RAM size (in bytes) to install the model in XIP mode. |
rt_ram_copy |
Requested RAM size (in bytes) to install the model in COPY mode |
ai_rel_network_load_and_create()
ai_error ai_rel_network_load_and_create(const void* obj, ai_handle ram_addr,
ai_size ram_size, uint32_t mode, ai_handle* hdl);
ai_handle ai_rel_network_destroy(ai_handle hdl);Create and install an instance of the relocatable binary model,
referenced by obj. Provide a read-write (RW) memory buffer,
defined by ram_addr and ram_size, to create the
data sections (.data, .bss, .got) and resolve internal
references during the relocation process. The mode parameter
specifies the expected execution mode. Retrieve the expected size
for the AI runtime (RT) RAM buffer using the ai_rel_network_rt_get_info()
function.
| mode | description |
|---|---|
AI_RELOC_RT_LOAD_MODE_XIP |
XIP execution mode is requested |
AI_RELOC_RT_LOAD_MODE_COPY |
COPY execution mode is requested |
ai_handlereferences a run-time context (opaque object) which must be used for the other functions.- before to create the instance, Cortex-M id is verified. If requested the function checks also if the FPU is enable.
Note
If ram_addr or/and ram_size are
NULL, default allocation is done through the system
heap. Behavior can be overwritten in the
ai_reloc_network.c file, see
AI_RELOC_MALLOC macro definition.
ai_rel_network_init()
ai_bool ai_rel_network_init(ai_handle hdl, const ai_handle *weights,
const ai_handle *act);Finalizes the initialization of the instance with the addresses of the weights and the activations buffer.
- if the weights are stored in the relocatable binary object,
ai_rel_network_rt_get_info()should be used to retrieve the address. - as for the “static` approach, an activations buffer (or multiple) should be also provided.
ai_handle weights_addr;
ai_handle activations;
...
const ai_handle acts[] = { activations };
res = ai_rel_network_init(net, &weights_addr, acts))
...ai_rel_network_get_info()
ai_bool ai_rel_network_get_info(ai_handle hdl, ai_network_report* report);Allow to retrieve the run-time data attributes of an instantiated
model. Refer to ai_platform.h file to show the detail
of the returned ai_network_report C-struct. It should
be called after the call of ai_rel_network_init().
ai_rel_network_get_error()
ai_error ai_rel_network_get_error(ai_handle hdl);Return the 1st error reported during the execution of a
ai_network_xxx() function.
- see
ai_platform.hfile to have the list of the returned error type (ai_error_type) and associated code (ai_error_code).
ai_rel_network_run()
ai_i32 ai_rel_network_run(ai_handle hdl, const ai_buffer* input, ai_buffer* output);Perform one or the inferences. The input and output buffer
parameters (ai_buffer type) allow to provide the input
tensors and to store the predicted output tensors respectively
(refer to [“IO tensor description” [API]][API_io_tensor]
section).
ai_rel_platform_observer_register()
ai_bool ai_rel_platform_observer_register(ai_handle hdl,
ai_observer_node_cb cb, ai_handle cookie, ai_u32 flags);
ai_bool ai_rel_platform_observer_unregister(ai_handle hdl,
ai_observer_node_cb cb, ai_handle cookie);
ai_bool ai_rel_platform_observer_node_info(ai_handle hdl,
ai_observer_node *node_info);As for the “static” approach, these functions allow the registration of a user callback to be notified before or after the execution of a c-node. There are no restrictions on using the Platform Observer API with a relocatable binary model.