3.0.0
ST Neural-ART NPU - Runtime loadable model support


ST Edge AI Core

ST Neural-ART NPU - Runtime loadable model support


for STM32 target, based on ST Edge AI Core Technology 3.0.0



r1.3

Introduction

What is a runtime loadable model?

A runtime loadable model, also known as a relocatable binary model, is a binary object that can be installed and executed at runtime within an STM32 memory subsystem. This model includes a compiled version of the generated neural network (NN) C files, encompassing the necessary forward kernel functions and weights/parameters. Its primary purpose is to offer a flexible method for upgrading AI-based applications without the need to regenerate and flash the entire end-user firmware. This approach is particularly useful for technologies such as firmware over-the-air (FOTA). This feature allows the following:

  • Models and their associated memory contents to be loaded dynamically at runtime.
  • Flexibility to update or swap models without rebuilding the entire application.
  • Clear separation between application logic and model data.

The binary object can be seen as a lightweight plug-in, capable of running from any address (position-independent code) and storing its data anywhere in memory (position-independent data). An efficient and minimal dynamic/runtime loader enables the instantiation and usage of this model. Unlike traditional systems, the firmware does not embed a complex and resource-intensive dynamic linker for Arm® Cortex®-M microcontrollers. The generated object is mainly self-contained, requiring limited and well-defined external symbols (NPU runtime dependent) at runtime.

In comparison with the software-only solution (fully self-contained object), the runtime loadable model for the Neural-ART accelerator must be installed inside an NPU runtime software stack. This allows supporting multiple instances, including the static version. No specific synchronization mechanism is embedded in the relocatable part; instead, the scheduling and access to the hardware resources (NPU subsystem) are managed by the stack itself (static part).

NPU stack with runtime loadable models

The generated runtime loadable model is a container that includes relocatable code (text, data, and rodata sections) to configure various neural processing unit (NPU) epochs. It primarily comprises a compiled version (microcontroller unit (MCU)-dependent) of the specialized network.c file, the low-level (LL) ATON driver functions used, and the code for delegated software operations, which are part of the optimized network runtime library. For hybrid epochs (calls to LL_ATON_LIB_xxx functions), a specific callback-based mechanism is implemented to directly invoke the services of the stack. These system callback functions are provided by the static part of the NPU stack and are registered during the installation phase of the model (see the ll_aton_reloc_install() function).

Solution overview

Memory model

As illustrated in the following figure, to use runtime loadable models, a portion of the internal RAMs (read/write AI regions) is fixed or reserved and has absolute addresses. These regions must not be used by the application during inference. A non-fixed executable RAM region is required to install a given model. This read/write region allows to resolve the relocatable references during the installation process at runtime.

Platform memory layout
Item Definition
executable RAM region Designates the memory region used to install a model at runtime (txt/data sections). This shared memory-mapped region, located anywhere in the memory subsystem, can be reserved at runtime by the application. The minimum requested size is returned by the ll_aton_reloc_get_info() function (rt_ram_xip or rt_copy_xip fields). Attributes: MCU RWX, NPU RO. For performance reason, the region should be fully cached (MCU). (*)
RW AI regions Designates the memory regions which are reserved for the activations/parameters. Base addresses are absolutes and are defined in the mem-pool descriptor files. Attributes: MCU/NPU memory-mapped, MCU RW, NPU RW. To know/check the memory regions which are used, the ll_aton_reloc_get_mem_pool_desc() can be used. (**)
RW AI region (external RAM) If requested, designates the memory regions which are allocated/reserved to place the requested part of the memory-pool defined for the external RAM. The base address can be relative (can be placed anywhere in the external RAM) or absolute. ONLY one relative region is supported, it can be reserved at runtime by the application. The requested size (ext_ram_sz field) is returned by the ll_aton_reloc_get_info() function. Attributes: MCU RWX, NPU RW (**).
Model X (weights/params) Designates the memory regions where the relocatable binary models are stored. Base addresses (file_ptr) are relative and dependent on the application managing the FOTA-like mechanism. Attributes: MCU/NPU memory-mapped, MCU RO, NPU RO.
App Designates the memory regions which are reserved for the application (static part), it embeds a minimal part (NN load service) allowing to load/install and execute a relocatable model.

(*) This region must be also accessible by the NPU in the case where the epoch controller is enabled.
(**) These regions should be also accessible by the MCU in the case where a operator/epoch is delegated on the MCU.

Limitations

  • Only the STM32N6-based series with Neural-ART accelerator is supported.
  • Addresses of the used internal/on-chip memory-pools are fixed/absolutes (USEMODE_ABSOLUTE attribute). Only the external addresses relative to the off-chip memories: flash and RAM type, can be relocatable/relative (or USEMODE_RELATIVE attribute)
  • Only two relocatable/relative memory pools are supported. One for RO regions handling the weights/params and another for an external RW memory region. Note that the external RAM region can be also absolute.
  • Secure mode generation and [XIP mode][ref_rel_xip_mode] are not supported with ARM clang/llvm toolchain (cmse is not compatible with ROPI/RWPI).
  • Support for the encrypted weights/params throught the ST Edge AI core CLI is not supported.

Embedded compiler options for relocatable mode

To generate relocatable objects, according to the Arm® Embedded toolchain used, two sets of compilation options are considered (see makefiles from the $STEDGEAI_CORE_DIR/scripts/N6_reloc/resources folder):

  • For a GCC Arm® Embedded toolchain, all the code, including the network runtime library, is compiled using the -fpic, -msingle-pic-base, and -mno-pic-data-is-text-relative options. The Arm® Cortex®-M r9 register is designated as the platform register for the global offset table (GOT). The primary task of the dynamic/runtime loader is to update the GOT table and the indirect references in the data structure. Both XIP & COPY modes are fully supported.
  • For a CLANG/LLVM Arm® Embedded toolchain, all the code is compiled with the options -fropi and -frwpi. The r9 register is also used; however, some constant/text parts must also be updated by the dynamic/runtime loader, which does not allow support for the XIP mode.

Prerequisites - Setting up a work environment

The generation of a runtime-loadable model is integrated into the ST Edge AI Core command-line interface (CLI). A specific pass calls the dedicated Python scripts located in the $STEDGEAI_CORE_DIR/scripts/N6_reloc directory. The npu_driver.py script acts as the entry point for generating the runtime-loadable models, which can be used directly. By default, a GNU Arm Embedded tool-chain supporting the Arm® Cortex® M55 (with the ‘arm-none-eabi-’ prefix) is used. The user must ensure that the executable is available in the PATH, including a Make utility. The custom option allows to customize the enviroment.

%STEDGEAI_CORE_DIR% represents the root location where the ST Edge AI Core components are installed, typically in a path like "<tools_dir>/STEdgeAI/<version>/".

Using the Python scripts directly

To use the scripts directly, a Python 3.9+ and the following Python modules are requested:

pyelftools==0.27  
tabulate  
colorama  

Note

The Python interpreter available in the ST Edge AI Core pack can be directly used (all requested Python module are already installed).

Optional tools for built-in validation workflow

For validating a runtime loadable model on the STM32N6570-DK board, you may optionally install:

Note: 'STM32_CUBE_IDE_DIR' system environment variable must be set to indicate the installation folder fo the STM32CubeIDE pack.

Getting started

Generating a runtime loadable model

The -r/--reloc/--relocatable [rel-option] option allows to generate the runtime-loadable model. The standard generating workflow is extended with the additional steps to pre-process and compile the specialized c-files.

Generation of the runtime loadable model

Here is a typical output log example that you might see when running the generate command for an NPU model deployment.

$ stedgeai generate -m <model-name>.tflite/onnx --target stm32n6 --st-neural-art <profile>@<usr_neural_art_reloc>.json --reloc
 ST Edge AI Core v3.0.0
...
 Neural ART - Package for runtime loadable model - v 1.4.0
...
 Generated files (8)
 --------------------------------------------------------------------------------
 <output-directory-path>\<model-name>_OE_3_3_1.onnx
 <output-directory-path>\<model-name>_OE_3_3_1_Q.onnx
 <output-directory-path>\network.c
 <output-directory-path>\network.h
 <output-directory-path>\network_atonbuf.xSPI2.raw
 <output-directory-path>\network_c_info.json
 <output-directory-path>\network_rel.bin
 <output-directory-path>\network_generate_rel.json

Creating report file <output-directory-path>\network_generate_report.txt

Generated files

file description
<network>_rel.bin The main binary file contains the compiled version of the model, including the forward kernel functions and weights by default. It also embeds additional sections (.header, .got, .rel) to enable the installation of the model.
<network>_data.bin Optional file. If the split parameter is used, the weights are generated in a separated binary file.
<network>_generate_rel.json Extra report file containing the main informations of the generated runtime loadable model (JSON format)

Note that the memory initializers for the memory regions which are only used for the activations are not requested.

NPU compiler and CLI options

There are no restrictions on the NPU compiler options used for this step, except for the usage of the --all-buffers-info option, which provides detailed information about the intermediate and activation buffers. Note that this additional information is removed during the generation of the runtime loadable model. For the memory-pool descriptor file, the following attributes are mandatory:

Relocatable options

The -r/--reloc/--relocatable option can be used with the following parameters. These parameters are mainly forwarded to specialized Python scripts.

parameter description
<none> Default behavior. Generates a single model binary file containing network, kernels, and weights.
“split” Generates two separate binary files: one with the model (network + kernels) and another containing only the weights. Useful for flexible memory management.
“gen-c-file” Generates the binary file as a C array instead of a raw binary. Mainly for debugging purposes.
“st-clang” Use the ST Arm Clang toolchain for compilation instead of the default ARM GCC based toolchain.
“llvm” Use the LLVM toolchain for compilation instead of the default ARM GCC based toolchain.
“ecblob-in-params” Places the ecblobs (epoch controller blobs) together with the model weights/parameters in the binary.
“no-secure” Compile the binary with non-secure flags, useful for non-secure execution environments. Forced behavior with llvm-clang compiler.
“no-dbg-info” The LL_ATON_EB_DBG_INFO C-define is not set during the compilation of the model allowing to avoid to embed the debug only information.
“custom[=<usr-file>.json]” Use a custom JSON configuration file to override default environment settings. If no file is specified, uses custom.json by default.

The parameters can be combined using a comma separator.

$ stedgeai generate -m -m <model-name>.tflite/onnx <gen_options> ---target stm32n6 -st-neural-art ... --reloc split,gen-c-file
 ...
 <output-directory-path>\network_rel.bin
 <output-directory-path>\network_data.bin
 <output-directory-path>\network_img_rel.c
 <output-directory-path>\network_img_rel.h

“split” parameter

The split option allows generating two separate files. In this case, to deploy the model, both files should be deployed on the target, and the address of the "network_rel_params.bin" should be passed during the installation process at runtime (see ll_aton_reloc_install() function).

“ecblob-in-params” parameter

The ecblob-in-params option indicates that the ecblobs are stored with the params/weights. This feature allows to reduce the requested exec memory region to execute the model in COPY mode (see “Epoch controller consideration” section). Ecblobs are placed at the beginning of the params/weights memory segment. The split option is always supported, the params file includes the ecblobs.

“no-dbg-info” parameter

The no-dbg-info option allows the removal of the LL_ATON_EB_DBG_INFO C-define to generate the runtime loadable model. This option removes debug information, including the intermediate description of buffers and additional debug fields in the epoch descriptor, thereby reducing the size of the final binary. These debug details are mandatory for using the built-in validation stack.

“custom” parameter

The custom option allows overriding certain default environment variables used to build the relocatable binary model. By default a custom.json file from the current working directory is used to retrieve the expected values. Alternatively, the parameter can be extended with a specific file name (for example: --reloc cutsom=myconfog.json).

supported key description
“runtime_network_lib” Used to indicate the absolute path of the used network runtime library (default: %STEDGEAI_CORE_DIR%/Middlewares/ST/AI/Lib/GCC/ARMCortexM55/NetworkRuntime1100_CM55_GCC_PIC.a )
“extra_system_path” Used to prefix the system path allowing to retrieve the used ARM compiler. By default, the arm-none-eabi-gcc executable is used from the PATH
“llvm_install_path” llvm target only. Used to set the LLVM_COMPILER_PATH value in the llvm makefile to indicate the root directory of the tools chain. Mandatory for the llvm target.
“target_triplet” llvm target only. Set the TARGET_TRIPLET value in the llvm makefile. Default: thumbv8m.main-unknown-none-eabihf
“llvm_sysroot” llvm target only. Set the LLVM_SYSROOT value in the llvm makefile. Default: ${LLVM_COMPILER_PATH}/lib/clang-runtimes/newlib/arm-none-eabi/armv8m.main_hard_fp

Memory layout information

Below information is reported in the log and also in a well-formatted file in JSON format: <network>_generate_rel.json.

 ...
 Runtime Loadable Model - Memory Layout (series="stm32n6npu")
 --------------------------------------------------------------------------------
 v8.0 dbg=True async=True sec=True 'Embedded ARM GCC' cpuid=D22 fpu=True float-abi=2

 XIP size             = 3,344      data(2620)+got(708)+bss(16) sections
 COPY size            = 32,544     +ro(29200) sections
 extra sections       = 1,112      got:708(21.2%) header+rel:404(1.4%)
 params size          = 25,713
 acts size            = 56,496
 binary file size     = 32,832
 params file size     = 25,720

 +--------------------------+----------------------------------+------+------------+-------+
 | name (addr)              | flags                            | foff | dst        | size  |
 +--------------------------+----------------------------------+------+------------+-------+
 | xSPI2 (0x20004c64)       | 0x01010500 RELOC.PARAM.0.RCACHED | 0    | 0x00000000 | 25713 |
 | AXISRAM5 (0x20004c6a)    | 0x03020200 RESET.ACTIV.WRITE     | 0    | 0x342e0000 | 56496 |
 | <undefined> (0x00000000) | 0x00000000 UNUSED                | 0    | 0x00000000 | 0     |
 +--------------------------+----------------------------------+------+------------+-------+
  Table: mempool c-descriptors (off=40000a00, 3 entries, from RAM)

 rt_ctx: c_name="network", acts_sz=56,496, params_sz=25,713, ext_ram_sz=0
 rt_ctx: rt_version_desc="atonn-v1.1.3-5-gdabeb3b4d (RELOC.GCC)"
item description
XIP size Size (in bytes) of the executable RAM memory region when using Execute-In-Place (XIP) mode. This is the memory region where code is executed directly without copying.
COPY size Size (in bytes) of the executable RAM memory region when using COPY mode, where code or data is copied from non-volatile memory to RAM before execution.
params size Total size (in bytes) of the used part of the memory-pools related to the weights and parameters. Detailed information is provided in mempool c-descriptors table.
acts size Total size (in bytes) of the used part of the memory-pools related to the activations needed during inference. Detailed information is provided in mempool c-descriptors table.
binary file size Size (in bytes) of the generated binary file containing the compiled model and its data.
params file size Size (in bytes) of the parameters file when the split option is used to separate weights from the main binary.
rt_ctx/v8.0.. Binary header information indicating details such as the Embedded ARM toolchain version used, compilation flags, and other metadata. This header is used by the loader/install function to check if the binary is compliant with the static part of the runtime.
Layout of the runtime loadable module

The “mempool c-descriptors” table indicates the memory regions (part of the user memory-pools) and the flags which are considered by the ll_aton_reloc_install() function at runtime.

flag description
RELOC Indicates a relocatable region, the address (dst=0) will be resolved at runtime during the installation process.
PARAM/ACTIV/MIXED Indicates the type of contents: PARAM: params/weights only, ACTIV: activations only, MIXED: mixed
RCACHED/WCACHED Indicates that a part of the memory region can be accessible through the NPU cache. RCACHED is associated with a RELOC.PARAM/read-only region
WRITE Indicates that the memory region is a read-write memory region.
RESET Indicates that the memory region can be cleared if the AI_RELOC_RT_LOAD_MODE_CLEAR option is used.
COPY Indicates that the region is initialized/copied during the installation process.
UNUSED Last entry in the mempool c-descriptors
  • The number 0 or 1 indicates the ID of the relocatable memory regions. Currently, only two regions are supported: 0 for a parameters/weights-only region in external flash and 1 for a read/write memory region in the external RAM.
  • foff specifies the offset in the parameters/weights section to locate the associated memory initializer when requested.
  • dst specifies the destination address. If not equal to zero, the address is an absolute address; otherwise, the region is a relocatable region.
  • size specifies the size in bytes.
Example with a tiny model using only the internal NPU RAM for the activations and weights/params.

During runtime in the installation process, the AXIRAM2, AXIRAM3, and AXIRAM4 (absolute address) are initialized with the contents of the parameters and weights section. AXIRAM5 is used exclusively for activations.

 +--------------------------+------------------------------+------+------------+-------+
 | name (addr)              | flags                        | foff | dst        | size  |
 +--------------------------+------------------------------+------+------------+-------+
 | AXISRAM6 (0x20004c64)    | 0x03020200 RESET.ACTIV.WRITE | 0    | 0x34350000 | 500   |
 | AXISRAM5 (0x20004c6d)    | 0x02030200 COPY.MIXED.WRITE  | 0    | 0x342e0000 | 33728 |
 | AXISRAM4 (0x20004c76)    | 0x03020200 RESET.ACTIV.WRITE | 0    | 0x34270000 | 40496 |
 | AXISRAM3 (0x20004c7f)    | 0x03020200 RESET.ACTIV.WRITE | 0    | 0x34200000 | 16000 |
 | <undefined> (0x00000000) | 0x00000000 UNUSED            | 0    | 0x00000000 | 0     |
 +--------------------------+------------------------------+------+------------+-------+
  Table: mempool c-descriptors (off=40000428, 5 entries, from RAM)
Example with a model using only the external RAM/FLASH (internal RAMs are not used)

During runtime in the installation process, the references related to xSPI1 and xSPI2 (relative address) are resolved to the external RAM address (reserved by the application) and the parameters/weights section (part of the installed relocatable module), respectively.

 +--------------------------+----------------------------------+------+------------+-------+
 | name (addr)              | flags                            | foff | dst        | size  |
 +--------------------------+----------------------------------+------+------------+-------+
 | xSPI1 (0x20004c62)       | 0x01020601 RELOC.ACTIV.1.WCACHED | 0    | 0x00000000 | 64496 |
 | xSPI2 (0x20004c68)       | 0x01010500 RELOC.PARAM.0.RCACHED | 0    | 0x00000000 | 32625 |
 | <undefined> (0x00000000) | 0x00000000 UNUSED                | 0    | 0x00000000 | 0     |
 +--------------------------+----------------------------------+------+------------+-------+
  Table: mempool c-descriptors (off=40001618, 3 entries, from RAM)

Epoch controller consideration

There is no functional limitation on using the epoch controller with the runtime-loadable model. By default, the generated command streams (also called ecblobs) are stored as constants in the read-only data (rodata) section. However, as they reference addresses that are not known at generation time (i.e. weights/params buffers), a dedicated relocatable mechanism is implemented to patch the ecblobs during the initialization of the model. This mechanism requires an additional SRAM memory area in the uninitialized data (bss) section to copy the ecblobs before patching. The -v 2 option of the ST Edge AI core CLI allows to report the detailed information about the rodata/bss sections related to the ecblob objects.

$ stedgeai generate -m <model-name>.tflite/onnx --target stm32n6 --st-neural-art <profile>@<usr_neural_art_reloc>.json --reloc -v 2
...
 +--------------------+--------+---------+-------+
 |        Name        |  bss   | ro data | reloc |
 +--------------------+--------+---------+-------+
 | _ec_blob_network_1 | 19,776 | 19,904  | r:ptr |
 |                    |        |         |       |
 |       total        | 19,776 | 19,904  |       |
 +--------------------+--------+---------+-------+
 Table: EC blob objects (1)

Consequently, to install a model, the size of the required executable RAM (XIP or COPY mode) is significantly more critical. Below tables illustrate the different required sizes according to the configuration.

  • Weights/params are placed in the external flash (relative address)
Configuration XIP size COPY size params size binary file size
no EC 3,344 32,544 25,713 58,552
with EC 20,416 53,640 22,065 55,984
with EC + ecblob-in-params 20,416 33,712 22,065 55,960
  • Weights/params are placed in the internal RAM (fixed/absolute address).
Configuration XIP size COPY size params size binary file size
no EC 1,888 32,600 25,713 66,496
with EC 632 33,088 22,065 55,832
with EC + ecblob-in-params 632 13,312 22,065 55,832

Weights/params encryption consideration

If the model is generated to support the encrypted weights/params (with the NPU compiler option: '--encrypt-weights'), before to generate the relocatable binary model, the weights/params file (network_atonbuf.xSPI2.raw file) should be encrypted as for the non-relocatable model. This worklow is not integrated in the ST Edge AI core CLI. It is required to use directly the Python scripts (npu_driver.py srcript) to generate the relocatable binary model. The split option is also preferable to be able to fix the address of the weights/params in the flash memory, because the encryption is dependent of the storage location.

To use a model with the weights/params encrypted, the index/keys for the different bus interfaces should be set before to execute the model.

...
  LL_ATON_RT_Reset_Network(&nn_instance);

  // Set bus interface keys -- used for encrypted inference only
  LL_Busif_SetKeys ( 0 , 0 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
  LL_Busif_SetKeys ( 0 , 1 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
  LL_Busif_SetKeys ( 1 , 0 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
  LL_Busif_SetKeys ( 1 , 1 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );

  do {
    /* Execute first/next step of Cube.AI/ATON runtime */
    ll_aton_rt_ret = LL_ATON_RT_RunEpochBlock(&nn_instance);
    ...

Format of the relocatable binary model

The following figure illustrates the layout of the generated relocatable binary model ("network_rel.bin" file). By default, the memory initializers (params/weights sections) are included in the image. If the split option is used, the params/weights sections are generated in a separated binary file ("network_rel_params.bin" file).

  • If the epoch controller is enabled, the generated bitstreams are included in the rodata/data sections by defaut. The –ecblob-in-params option can be used to store the ecblob with the params section.
  • The description of the params/weights sections is defined in the rodata section.
  • When the split option is used, the address ('ext_params_add') is passed and defined by the application when the ll_aton_reloc_install() function is called.

Evaluating the RT loadable model (STM32N6570-DK board)

A ready-to-use environment for STM32N6570-DK board (DEV mode) is delivered in the ST Edge AI Core pack. It allows performing a classical validation workflow, validate command or through the stm_ai_runner Python package module. Note that a STM32CubeIDE environment must be installed.

Warning

Set the boot mode in development mode (BOOT1 switch position is 1-3, BOOT0 switch position does not matter). After the loading phase, the board must be NOT switched off or disconnected to be able to perform the validation.

After the generation of the relocatable binary model, the st_load_and_run.py Python script is used to flash the binary files at the fixed addresses, to load, and to run a built-in validation firmware. After these steps, a single inference is executed reporting the performance.

[Details] st_load_and_run.py script
usage: st_load_and_run.py [-h] [--input [STR ...]] [--board STR]
                          [--address STR] [--mode STR] [--cube-ide-dir STR]
                          [--log [STR]] [--verbosity [{0,1,2}]] [--debug] [--no-color]

NPU Utility - ST Load and run (dev environment) v2.0

optional arguments:
  -h, --help            show this help message and exit
  --input [STR ...], -i [STR ...]
                        location of the binary files (default: build/network_rel.bin)
  --board STR           ST development board (default: stm32n6570-dk)
  --address STR         destination address - model(,params) (default: 0x71000000,0x71800000)
  --mode STR            firmware variants & mode: copy,xip[no-flash,no-overlay,no-run,usbc,ext]
  --cube-ide-dir STR    installation directory of STM32CubeIDE tools (ex. ~/ST/STM32CubeIDE_1.19.0/STM32CubeIDE)
  --log [STR]           log file
  --verbosity [{0,1,2}], -v [{0,1,2}]
                        set verbosity level
  --debug               enable internal log (DEBUG PURPOSE)
  --no-color            disable log color support

Supported features/limitations

  • Only the STM32N6570-DK board is supported. Destination addresses are fixed.
  • Multiple models can be deployed.
    • By default, they share the same execution RAM memory region. The no-overlay mode can be used to force the creation of a dedicated memory region.
    • Execution of the models is always sequential, the activation regions are overlapped.
  • –no-inputs/outputs-allocation` options are not supported.
  • copy or xip mode can be selected. If the xip mode is not supported, copy mode is automatically used.
  • the execution RAM region is located in the internal SRAM first, if more memory is requested, the external RAM is used.

Deploy a model

$ $STEDGEAI_CORE_DIR/Utilities/windows/python $STEDGEAI_CORE_DIR/scripts/N6_reloc/st_load_and_run.py
   -i <output-directory-path>\network_rel.bin
NPU Utility - ST Load and run (dev environment) (version 2.0)
Creating date : ...

model info     : st_ai_output\network_rel.bin: size=55,832
                 cpuid=0xd22 c_name='network' 'Embedded ARM GCC' ll_aton=1.1.3.3
                 acts/params=57,087/22,065 xip/copy=632/13,312 ext_ram=0 split=False
                 secure=True
board          : 'stm32n6570-dk'
mode           : ['copy']

board info     : 'stm32n6570-dk' baudrate=921600 eflash_loader='MX66UW1G45G_STM32N6570-DK.stldr'
                   eflash[sec/pg]=64.0KB/4.0KB
                 exec_ram[int/ext]=655,360/4,194,304 ext_ram=28,311,552 addrs=0x70FFF000,0x71000000,0x71800000
use clang      : False
install mode   : 'copy'
total (1)      : overlay bin=65,536 xip=640 [copy=13,312] 'installed in int exec_ram' ext_ram=0
 flash@        : 0x71000000
 flash_params@ : 0x00000000

Resetting the board.
Flashing 'header (nb_entries=1)' at address 0x70FFF000 (size=20)..
Flashing 'st_ai_output\network_rel.bin' at address 0x71000000 (size=55832)..
Loading & start the validation application 'stm32n6570-dk-validation-reloc'..
Deployed model is started and ready to be used.
Executing the deployed model (desc=serial:921600)..
...
  Inference time per node
  -------------------------------------------------------------------------------------------------------------------
  c_id    m_id   type                dur (ms)       %    cumul  CPU cycles                      name
  -------------------------------------------------------------------------------------------------------------------
  0       -      epoch (EC)             0.210   96.6%    96.6%  [     896  166,877      573 ]   EpochBlock_1 -> 14
  1       -      epoch (SW)             0.007    3.4%   100.0%  [     110       40    5,704 ]   EpochBlock_15
  -------------------------------------------------------------------------------------------------------------------
  total                                 0.218                   [   1,006  166,917    6,277 ]
                                4592.41 inf/s                   [    0.6%    95.8%     3.6% ]
  -------------------------------------------------------------------------------------------------------------------

Evaluate the model

Default validation workflow can be use to evaluate the deployed model.

$ stedgeai validate -m <quantize_model> --target stm32n6 --mode target -d serial:921600
...
Evaluation report (summary)
 -------------------------------------------------------------------------------------------------------------
 Output       acc    rmse         ...  std        nse        cos        tensor
 -------------------------------------------------------------------------------------------------------------
 X-cross #1   n.a.   0.007084151  ...  0.007081   0.999671   0.999910   10 x uint8(1x3087x6)
 -------------------------------------------------------------------------------------------------------------

Deploy multiple models

Generate the models: model1 and model2

stedgeai generate -m <model1> --target stm32n6 --st-neural-art test@$STEDGEAI_CORE_DIR/scripts/N6_reloc/test/neural_art_reloc.json -r -n model1

stedgeai generate -m <model2> --target stm32n6 --st-neural-art test@$STEDGEAI_CORE_DIR/scripts/N6_reloc/test/neural_art_reloc.json -r -n model2

Warning

Each deployed model must have its own c-name.

Deploy the models for evaluation

$ $STEDGEAI_CORE_DIR/Utilities/windows/python $STEDGEAI_CORE_DIR/scripts/N6_reloc/st_load_and_run.py
   -i <output-directory-path>\model1_rel.bin <output-directory-path>\model2_rel.bin

Evaluate the model1 model

$ stedgeai validate -m <model12> --target stm32n6 --mode target -d serial:921600 -n model1

Deploy and use a relocatable model

There is no specific service allowing to deploy the model on a target, FOTA-like mechanism, and other stack to manage the firmwares or models are application-specific. However, when the relocatable model is flashed on the target at a given memory-mapped address (file_ptr), the ll_aton_reloc_install() function must be called to install the model.

LL_ATON stack configuration

LL_ATON_XX C-defines comment
LL_ATON_PLATFORM 'LL_ATON_PLAT_STM32N6' mandatory
LL_ATON_EB_DBG_INFO mandatory
LL_ATON_RT_RELOC mandatory - Enables the code paths and functionalities required to manage the relocatable mode.
LL_ATON_RT_MODE LL_ATON_RT_ASYNC is recommended but LL_ATON_RT_POLLING can be used.
LL_ATON_OSAL no restriction

Minimal code

The following snippet code illustrates how to install and to use a runtime loadable model within a bare metal environment, single network with a single input tensor, and a single output tensor (epoch controller is used or not). user_model_mgr() function is used to retrieve the address where the module has been flashed.

#include "ll_aton_reloc_network.h"

static NN_Instance_TypeDef nn_instance;  /* LL ATON handle */

uint8_t *input_0, *prediction_0;
uint32_t input_size_0, prediction_size_0;

int ai_init(const uintptr_t file_ptr, const uintptr_t file_params_ptr)
{
  const LL_Buffer_InfoTypeDef *ll_buffer;
  ll_aton_reloc_info rt;
  int res;

  /* Retrieve the requested RAM size to install the model */
  res = ll_aton_reloc_get_info(file_ptr, &rt);
  /* Reserve executable memory region to install the model */
  uintptr_t exec_ram_addr = reserve_exec_memory_region(rt.rt_ram_copy);
  /* Reserve external read/write memory region for external RAM region */
  uintptr_t ext_ram_addr = NULL;
  if (rt.ext_ram_sz > 0)
    ext_ram_addr = reserve_ext_memory_region(rt.ext_ram_sz);

  /* Create and install an instance of the relocatable model */
  ll_aton_reloc_config config;
  config.exec_ram_addr = exec_ram_addr;
  config.exec_ram_size = rt.rt_ram_copy;
  config.ext_ram_addr = ext_ram_addr;
  config.ext_ram_size = rt.ext_ram_sz;
  config.ext_param_addr = NULL;  /* or @ of the weights/params if split mode is used */
  config.mode = AI_RELOC_RT_LOAD_MODE_COPY; // | AI_RELOC_RT_LOAD_MODE_CLEAR;

  res = ll_aton_reloc_install(file_ptr, &config, &nn_instance);

  if (res != 0)
  {
    /* Retrieve the addresses of the input/output buffers */
    ll_buffer = ll_aton_reloc_get_input_buffers_info(&nn_instance, 0);
    input_0 = LL_Buffer_addr_start(ll_buffer);
    input_size_0 = LL_Buffer_len(ll_buffer);
    ll_buffer = ll_aton_reloc_get_output_buffers_info(&nn_instance, 0);
    prediction_0 = LL_Buffer_addr_start(ll_buffer);
    prediction_size_0 = LL_Buffer_len(ll_buffer);

    /* Init the LL ATON stack and the instantiated model */
    LL_ATON_RT_RuntimeInit();
    LL_ATON_RT_Init_Network(&nn_instance);
  }
  return res;
}

void ai_deinit(void)
{
  LL_ATON_RT_DeInit_Network(&NN_Instance_Default); 
  LL_ATON_RT_RuntimeDeInit();
}

void ai_run(void) {
  LL_ATON_RT_RetValues_t ll_aton_ret;
  LL_ATON_RT_Reset_Network(&nn_instance);
  do {
    ll_aton_ret = LL_ATON_RT_RunEpochBlock(&nn_instance);
    if (ll_aton_ret == LL_ATON_RT_WFE) {
      LL_ATON_OSAL_WFE();
      }
  } while (ll_aton_ret != LL_ATON_RT_DONE);
}

void main(void)
{
  uintptr_t file_ptr, file_params_ptr;

  user_system_init();  /* HAL, clocks, NPU sub-system... */

  user_model_mgr(&file_ptr, &file_params_ptr);
  if (ai_init(file_ptr, file_params_ptr)) {
    /*... installation issue ..*/
  }

  while (user_app_not_finished()) {
    /* Fill input buffers */
    user_fill_inputs(input_0);
    /* If requested, perform the NPU/MCU cache operations to guarantee the coherency of the memory. */
    //  LL_ATON_Cache_MCU_Clean_Invalidate_Range(input_0, input_size_0);
    //  LL_ATON_Cache_MCU_Invalidate_Range(prediction_0, prediction_size_0);
    /* Perform a complete inference */
    ai_run();
    /* Post-process the predictions */
    user_post_process(prediction_0);
  }
  ai_deinit();
  //...
}

XIP/COPY execution modes

XIP/COPY mode support

XIP execution mode

This execution mode is the privileged mode, where the code and weight sections are stored in the flash memory. Regarding memory placement, this approach is similar to the static method. This mode is efficient in terms of memory usage, only the executable RAM region to store the data/bss/got sections is requested.

COPY execution mode

This mode involves copying the code to a different memory location before execution. This process is useful for optimizing performance or managing memory access speeds. The requested size for the executable RAM region is critical. Along with the data, bss, and GOT sections, it also includes the text and rodata sections.

Example of NPU compiler configuration files

Memory-pool descriptor files

The "%STEDGEAI_CORE_DIR%/scripts/N6_reloc/test" contains a set of configuration and memory-pool descriptor files which can be used. Here are two main points to consider:

  • If nonsecure context is used to execute the deployed model, the base @ of the different memory-pools should be set with a nonsecure address (ex. 0x24350000 instead of 0x34350000 for the AXIRAM6 memory).
  • For the memory-pools representing an off-chip device, the "USEMODE_RELATIVE" attribute should be used.

For test purpose with STM32N6570-DK board, different ready-to-use configurations are provided.

memory-pool desc. file description
stm32n6_reloc.mpool Full memories. All NPU rams (AXIRAM3..6), AXIRAM2 and external RAM/flash are defined. NPU cache can be used for the external memories.
stm32n6_npuram.mpool NPU memories only. Only the NPU rams (AXIRAM3..6) are defined.
stm32n6_int.mpool Internal memories only. Only the NPU RAMs (AXIRAM3..6 and AXIRAM2) are defined.
stm32n6_int2.mpool Internal memories only. Similar to stm32n6_int.mpool but the AXIRAM2 is privileged for the weights.
stm32n6_reloc_ext.mpool External memories only. Only the external RAM/flash are defined. NPU cache can be used for the external memories.

For information and test purpose, the *non” reloc memory-pool descriptor files are provided which can be used with a normal deployment flow.

Part of the ./test/mpool/stm32n6_reloc.mpool file:

    ...
    {
        "fname": "AXISRAM6",
        "name":  "npuRAM6",
        "fformat": "FORMAT_RAW",
        "prop":   { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW",
                    "byteWidth": 8, "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
        "offset": { "value": "0x34350000", "magnitude":  "BYTES" },
        "size":   { "value": "448",        "magnitude": "KBYTES" }
    },
    {
        "fname": "xSPI1",
        "name":  "hyperRAM",
        "fformat": "FORMAT_RAW",
        "prop":   { "rights": "ACC_WRITE", "throughput": "MID", "latency": "HIGH",
                    "byteWidth": 2, "freqRatio": 5.00, "cacheable": "CACHEABLE_ON",
                    "read_power": 380, "write_power": 340.0, "constants_preferred": "true" },
        "offset": { "value": "0x90000000", "magnitude":  "BYTES" },
        "size":   { "value": "32",         "magnitude": "MBYTES" },
        "mode":   "USEMODE_RELATIVE"
    },
    {
        "fname": "xSPI2",
        "name":  "octoFlash",
        "fformat": "FORMAT_RAW",
        "prop":   { "rights": "ACC_READ",  "throughput": "MID", "latency": "HIGH", 
                    "byteWidth": 1, "freqRatio": 6.00, "cacheable": "CACHEABLE_ON",
                    "read_power": 110, "write_power": 400.0, "constants_preferred": "true" },
        "offset": { "value": "0x70400000", "magnitude":  "BYTES" },
        "size":   { "value": "64",         "magnitude": "MBYTES" },
        "mode":   "USEMODE_RELATIVE"
    }
    ...

“neural_art.json” file

The "%STEDGEAI_CORE_DIR%/scripts/N6_reloc/test" contains two examples of configuration file using the requested memory-pool descriptor files. They provide different profiles using different memory configurations which are aligned with the generic AI test validation.

profile description
test Default profile. Full memories configuration and the epoch controller are not enabled
test-ec Default profile + support of the epoch controller
test-int Internal profile. NPU memories only configuration and the epoch controller is not enabled
test-int-ec Internal profile + support of the epoch controller
test-ext External profile. NPU memories only configuration and the epoch controller is not enabled
test-ext-ec External profile + support of the epoch controller

Part of the ./test/neural_art_reloc.json file:

  ...
   "test" : {
    "memory_pool": "./mpools/stm32n6_reloc.mpool",
      "options": "--native-float --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os
                    --optimization 3 --Oauto-sched --all-buffers-info --csv-file network.csv"
   },
   "test-ec" : {
    "memory_pool": "./mpools/stm32n6_reloc.mpool",
        "options": "--native-float --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os
                    --optimization 3 --Oauto-sched --all-buffers-info --csv-file network.csv --enable-epoch-controller"
   },
  ...

Performance impacts

Accuracy

No difference with the nonrelocatable or static implementation.

Set up and inference time

For the inference time (after the install/init steps), no significant difference is expected versus a nonrelocatable or static implementation. Only the set-up time is impacted to install/create an instance of a given model. The installation time is directly proportional to the number of relocations, and the size of the code/data sections according to the used COPY/XIP mode. Note that if the memory initializers need to be copied into the internal RAMs, this extra time is equivalent to the static implementation.

Install/init time overhead
  • STM32N6570-DK board (DEV mode), overdrive clock setting (NPU 1 GHz, MCU 800 MHz).
  • By default, the external flash is used to store the weights/params.
  • Internal RAMs (AXIRAM3,..6) are used to store the activations.
  • The executable RAM region is located in the internal AXIRAM1 (this is equivalent to the static case where the application is also executed from the AXIRAM1).
yolov5 224 (nano) static mode (absolute @ only) reloc mode (copy) reloc mode (xip) (*)
inference time (w/ ec) 10.4 ms, 95.45 inf/s 10.5 ms, 94.9 inf/s 10.6 ms, 94.6 inf/s
inference time (w/o ec) 13.2 ms, 75.35 inf/s 13.0 ms, 77.1 inf/s 14.3 ms, 69.6 inf/s
install/init time (ms) (w/ ec) 0.0 / 0.03 11.7 / 0.73 10.9 / 0.93
install/init time (ms) (w/o ec) 0.0 / 0.03 11.2 / 0.03 10.8 / 0.03

(*) with the epoch controller, as the blob should be updated, it is fetched from the executable RAM region (AXIRAM1). We only observe an impact in the case where the configuration code is fetched from the external FLASH, w/o epoch controller support.

For the reloc mode with epoch controller support, the 'install/init' time is mainly due to the copy of the blobs in the AXIRAM1 and the requested relocations to resolve the weights/params addresses in the different blobs.

Case where only the AXIRAMx is used for the activations and the weights/params (~2 Mbytes)

yolov5 224 (nano) static mode (absolute @ only) reloc mode (copy) reloc mode (xip)
inference time (w/ ec) 9.5 ms, 105.3 inf/s 9.5 ms, 105.8 inf/s 9.6 ms, 104.6 inf/s
install/init time (ms) (w/ ec) 17+ / 0.03 18.1 / 0.03 18.1 / 0.03

The 'install/init' time is similar in both case. It is mainly represented by the copy of the memory initializers from the FLASH location to the internal RAMs. No extra relocation for the weights/activations is requested (all weighs/activations addresses are absolutes).

Memory layout overhead

In comparison with a static implementation, the relocation mode involves two additional sections, GOT/REL, which are used to support the position-independent code/data. The size is directly proportional to the number of relocating references.

LL ATON runtime API extension

To enable the support of a runtime loadable model, the LL_ATON files should be compiled with the following C-define:

LL_ATON_RT_RELOC

The LL_ATON_RT_RELOC C-define activates the code paths and functionalities required to manage and install runtime loadable models. Ensuring that this macro is defined during compilation is crucial for the successful deployment and execution of runtime loadable models.

ll_aton_reloc_install()

int ll_aton_reloc_install(const uintptr_t file_ptr, const ll_aton_reloc_config *config,
                          NN_Instance_TypeDef *nn_instance);

Description

The ll_aton_reloc_install() function acts as a runtime dynamic loader. It is used to install and to create an instance of a memory-mapped runtime loadable module. By providing the model image pointer (file_ptr), configuration details, and neural network instance, users can set up the model for execution. The function performs compatibility checks, initializes memory pools, and installs/relocates code and data sections as needed.

Parameters

  • file_ptr: A uintptr_t value representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image to be installed.
  • config: A pointer to an ll_aton_reloc_config structure. This parameter provides the configuration details for how the model should be installed, including memory addresses and sizes.
  • nn_instance: A pointer to an NN_Instance_TypeDef structure. This parameter is updated to handle the installed model, creating an instance of the neural network.

Return Value

  • The function returns an integer value. A return value of 0 typically indicates success, while a nonzero value indicates that an error occurred during the installation process.

Steps Executed

  1. Checking step
    • This step checks the compatibility of the binary object against the runtime environment (static part of the firmware). The main points checked include:
      • The version and content of the binary file header.
      • MCU type and whether the FPU (floating-point unit) is enabled (context/setting of the caller is used).
      • Secure or nonsecure context.
      • Whether the binary module has been compiled with the LL_ATON_EB_DBG_INFO or LL_ATON_RT_ASYNC C-defines.
      • Version of the used LL_ATON files.
  2. Memory-Pool initialization step
    • If requested, this step initializes the used memory regions for the given model. Specifically:
      • If read/write memory pools handle the params/weights section, the associated memory region is initialized with values from the params/weights section.
      • Optionally, if the AI_RELOC_RT_LOAD_MODE_CLEAR flag is set, the read/write memory region handling the activations is cleared.
  3. Code/Data installation and relocation step
    • According to the AI_RELOC_RT_LOAD_MODE_COPY or AI_RELOC_RT_LOAD_MODE_XIP flag:
      • The code/data sections are copied into the executable RAM region.
      • The relocation process is performed to update references.
      • Register the call-backs

This function installs and creates an instance of a runtime loadable model referenced by the file_ptr parameter. The config parameter (of type ll_aton_reloc_config) indicates how to install the model, and the nn_instance (of type NN_Instance_TypeDef) is updated to handle the installed model. This function executes the following steps:

ll_aton_reloc_config C-struct

The purpose of the ll_aton_reloc_config C structure is to provide the parameters which are requested to install a runtime loadable model.

typedef struct _ll_aton_reloc_config {
    uintptr_t exec_ram_addr;  /* base@ of the exec memory region to place the relocatable code/data (8-Bytes aligned) */
    uint32_t exec_ram_size;   /* max size in byte of the exec memory region */
    uintptr_t ext_ram_addr;   /* base@ of the external memory region to place the external pool (if requested) */
    size_t ext_ram_size;      /* max size in byte of the external memory region */
    uintptr_t ext_param_addr; /* base@ of the param memory region (if requested) */
    uint32_t mode;
  } ll_aton_reloc_config;
  • 'exec_ram_addr'/'exec_ram_size': These members indicate the base address (8-byte aligned) and the maximum size of the read/write executable RAM memory region. These parameters are mandatory. To determine the required size at runtime, the ll_aton_reloc_get_info function can be used.
  • 'ext_ram_addr'/'ext_ram_addr': These members indicate the base address (8-byte aligned) and the maximum size of the read/write external RAM memory region, if requested. To determine the required size at runtime, the ll_aton_reloc_get_info function can be used.
  • 'ext_param_addr': This member indicates the base address (8-byte aligned) of the memory region containing the parameters/weights of the deployed model. This option is required when the split option is used; otherwise, it must be set to NULL.
  • 'mode': This member indicates the expected execution mode. Or-ed flags can be used. AI_RELOC_RT_LOAD_MODE_XIP or AI_RELOC_RT_LOAD_MODE_CLEAR flag is mandatory. AI_RELOC_RT_LOAD_MODE_CLEAR flag is optional.
mode description
AI_RELOC_RT_LOAD_MODE_XIP XIP (Execute In Place) execution mode
AI_RELOC_RT_LOAD_MODE_COPY COPY execution mode
AI_RELOC_RT_LOAD_MODE_CLEAR Reset the used activation memory regions

ll_aton_reloc_set_callbacks()

int ll_aton_reloc_set_callbacks(const NN_Instance_TypeDef *nn_instance, const struct ll_aton_reloc_callback *cbs)

Description

The ll_aton_reloc_set_callbacks function is used to overwrite the default registration of the callbacks done in the ll_aton_reloc_install function. This function is optional.

Callback services description
assert/lib error to implement the management of the errors generated by the embedded LL ATON functions
NPU/MCU cache maintenance operations to implement the NPU/MCU cache maintenance operations
LL_ATON_LIB_xxx to implement the LL ATON LIB services to support the hybrid epochs

Parameters

  • nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef).
  • cbs: A pointer to a ll_aton_reloc_callback structure (see ll_aton_reloc_network.h file).

Return Value

  • The function returns an integer value. A return value of 0 typically indicates success, while a nonzero value indicates that an error occurred during the installation process.

ll_aton_reloc_get_info()

int ll_aton_reloc_get_info(const uintptr_t file_ptr, ll_aton_reloc_info *rt);

Description

The ll_aton_reloc_get_info function is used to obtain the main dimensioning information from the image of a runtime loadable model. This information can include details such as the size, memory requirements, and other relevant attributes of the model. By providing a pointer to the model image and a reference to an ll_aton_reloc_info structure, users can retrieve and store the necessary information to properly configure and manage the runtime loadable model.

This function is particularly useful for setting up the memory regions and ensuring that the model can be correctly loaded and executed within the available resources.

Parameters

  • file_ptr: A uintptr_t value representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image from which the information will be retrieved.
  • rt: A pointer to an ll_aton_reloc_info structure. This parameter is used to store the retrieved dimensioning information of the runtime loadable model. The function fills this structure with the relevant details.

Return Value

  • The function returns an integer value. A return value of 0 typically indicates success, while a nonzero value indicates that an error occurred during the operation.

ll_aton_reloc_info C-struct

  typedef struct _ll_aton_reloc_info
  {
    const char *c_name;          /* c-name of the model */
    uint32_t variant;            /* 32-b word to handle the reloc rt version,
                                    the used ARM Embedded compiler,
                                    Cortex-Mx (CPUID) and if the FPU is requested */
    uint32_t code_sz;            /* size of the code (header + txt + rodata + data + got + rel sections) */
    uint32_t params_sz;          /* size (in bytes) of the weights */
    uint32_t acts_sz;            /* minimum requested RAM size (in bytes) for the activations buffer */
    uint32_t ext_ram_sz;         /* requested external ram size for the activations (and params) */
    uint32_t rt_ram_xip;         /* minimum requested RAM size to install it, XIP mode */
    uint32_t rt_ram_copy;        /* minimum requested RAM size to install it, COPY mode */
    const char *rt_version_desc; /* rt description */
    uint32_t rt_version;         /* rt version */
    uint32_t rt_version_extra;   /* rt version extra */
  } ll_aton_reloc_info;
member description
c_name indicates the name of the model.
variant or-red 32-bit value indicating the used Arm compiler, CPUID of the Cortex®-M,.. (see ll_aton_reloc_network.h file)
code_sz size in bytes of all code/data sections representing the model: header+txt+rodata+data+got+rel sections
params_sz total size (in bytes) of the params/weights section
acts_sz total size (in bytes) of the activations
ext_ram_sz requested size (in bytes) of the external RAM memory
rt_ram_xip requested size (in bytes) of read/write execution memory region (XIP mode)
rt_ram_copy requested size (in bytes) of read/write execution memory region (COPY mode)
rt_version_desc (debug info) string describing the used LL runtime version
rt_version LL runtime version: major << 24 | minor << 16 | sub << 8
rt_version (debug info) extra dev version value

ll_aton_reloc_get_mem_pool_desc()

ll_aton_reloc_mem_pool_desc *ll_aton_reloc_get_mem_pool_desc(const uintptr_t file_ptr, int index);

Description

The ll_aton_reloc_get_mem_pool_desc function allows to obtain information about the part of the memory-pools which are used for a given model. By providing a pointer to the model image and an index, users can retrieve the necessary information through the returned ll_aton_reloc_mem_pool_desc structure.

The ll_aton_reloc_get_mem_pool_desc function allows users to obtain information about parts of the memory pools used for a given model. By providing a pointer to the model image and an index, users can retrieve the necessary information through the returned ll_aton_reloc_mem_pool_desc structure.

Parameters

  • file_ptr: A uintptr_t value representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image from which the information will be retrieved.
  • index: Index of the requested descriptor.

Return Value

  • The function returns a reference of a ll_aton_reloc_mem_pool_desc object. If the specified index is out of range, the function may return NULL.

Example

A typical snippet of code to display memory pool C-descriptors.

  ll_aton_reloc_mem_pool_desc *mem_c_desc;
  int index = 0;

  while ((mem_c_desc = ll_aton_reloc_get_mem_pool_desc((uintptr_t)bin, index)))
  {
    printf(" %d: flags=%x foff=%d dst=%x s=%d\n", index, mem_c_desc->flags,
           mem_c_desc->foff, mem_c_desc->dst, mem_c_desc->size);
    index++;
  }

ll_aton_reloc_mem_pool_desc C-struct

typedef struct _ll_aton_reloc_mem_pool_desc
{
  const char *name; /* name */
  uint32_t flags;   /* type definition: 32b:4x8b <type><data_type><reserved><id> */
  uint32_t foff;    /* offset in the binary file */
  uint32_t dst;     /* dst @ */
  uint32_t size;    /* real size */
} ll_aton_reloc_mem_pool_desc;

AI_RELOC_MPOOL_GET_XXX(flags) macros (see ll_aton_reloc_network.h file) can be used to know the attributes of the memory pool.

ll_aton_reloc_get_input/output_buffers_info()

const LL_Buffer_InfoTypeDef *ll_aton_reloc_get_input_buffers_info(const NN_Instance_TypeDef *nn_instance,
                                                                  int32_t num);
const LL_Buffer_InfoTypeDef *ll_aton_reloc_get_output_buffers_info(const NN_Instance_TypeDef *nn_instance,
                                                                   int32_t num);

Description

The ll_aton_reloc_get_input/output_buffers_info function is used to obtain information about a specific input/output buffer of a neural network instance. This can be useful for understanding the structure and requirements of the input data for the neural network. By providing the neural network instance and the index of the desired input buffer, users can retrieve detailed information about the buffer, such as its size, type, and memory location.

Parameters

  • nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the input/output buffer information is to be retrieved.
  • num: An integer specifying the index of the input/output buffer whose description is to be retrieved. The index is zero-based, meaning that num = 0 refers to the first input buffer, num = 1 refers to the second input buffer, and so on.

Return Value

  • The function returns a pointer to a LL_Buffer_InfoTypeDef structure, which contains the description of the specified input/output buffer. If the specified buffer index is out of range, the function may return NULL.

ll_aton_reloc_set_input/output()

LL_ATON_User_IO_Result_t ll_aton_reloc_set_input(const NN_Instance_TypeDef *nn_instance, uint32_t num, void *buffer,
                                                 uint32_t size);
LL_ATON_User_IO_Result_t ll_aton_reloc_set_output(const NN_Instance_TypeDef *nn_instance, uint32_t num, void *buffer,
                                                  uint32_t size);

Description

Both ll_aton_reloc_set_input and ll_aton_reloc_set_output functions are used to configure the address of the input and output buffers for a neural network instance, respectively. By providing the neural network instance, buffer index, buffer pointer, and buffer size, users can set up the necessary memory regions for input and output data.

Warning

These functions should be only used when the deployed model is generated with the ‘–no-inputs-allocation’ or/and ‘–no-outputs-allocation’ respectively.

Parameters

  • nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the output buffer is to be set.
  • num: An unsigned integer specifying the index of the output buffer to be set. The index is zero-based.
  • buffer: A pointer to the buffer that will hold the output data. This parameter specifies the memory location where the output data is stored.
  • size: An unsigned integer specifying the size of the input/output buffer in bytes.

Return Value

  • The function returns a value of type LL_ATON_User_IO_Result_t. This return value indicates the result of the operation, such as success or an error code.

ll_aton_reloc_get_input/output()

void *ll_aton_reloc_get_input(const NN_Instance_TypeDef *nn_instance, uint32_t num);
void *ll_aton_reloc_get_output(const NN_Instance_TypeDef *nn_instance, uint32_t num);

Description

Both ll_aton_reloc_get_input and ll_aton_reloc_get_output functions are used to retrieve pointers to the input and output buffers for a neural network instance, respectively. By providing the neural network instance and buffer index, users can obtain direct access to the memory regions used for input and output data.

Warning

These functions should be only used when the deployed model is generated with the ‘–no-inputs-allocation’ or/and ‘–no-outputs-allocation’ respectively.

Parameters

  • nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the input/output buffer pointer is to be retrieved.
  • num: An unsigned integer specifying the index of the output buffer to be retrieved. The index is zero-based.

Return Value

  • The function returns a pointer to the output buffer. If the specified buffer index is out of range or an error occurs, the function may return NULL.

Use directly the Python scripts

Generating the relocatable binary model

The npu_driver.py is the entry point for executing the steps required to generate the loadable model. The '--input/-i' is used to specify the location of the generated "network.c" and associated memory initializers (*.raw files). The default value is ./st_ai_output. The '--output\-o' option is used to specify the output folder (default: ./build).

[Details] npu_driver.py script
usage: npu_driver.py [-h] [--input STR] [--output STR] [--name STR] [--no-secure] [--no-dbg-info] [--ecblob-in-params]
                     [--split] [--llvm] [--st-clang] [--compatible-mode] [--custom [STR]] [--cross-compile STR]
                     [--gen-c-file] [--parse-only] [--no-clean] [--log [STR]] [--json [STR]] [--verbosity [{0,1,2}]]
                     [--debug] [--no-color]

NPU Utility - Relocatable model generator v1.4

optional arguments:
  -h, --help            show this help message and exit
  --input STR, -i STR   location of the generated c-files (or network.c file path)
  --output STR, -o STR  output directory
  --name STR, -n STR    basename of the generated c-files (default=<network-file-name>)
  --no-secure           generate binary model for non secure context
  --no-dbg-info         generate binary model without LL_ATON_EB_DBG_INFO
  --ecblob-in-params    place the EC blob in param section
  --split               generate a separate binary file for the params/weights
  --llvm                use LLVM compiler and libraries (default: GCC compiler is used)
  --st-clang            use ST CLANG compiler and libraries (default: GCC compiler is used)
  --compatible-mode     set the compible option (target dependent)
  --custom [STR]        config file for custom build (default: custom.json)
  --cross-compile STR   prefix of the ARM tool-chain (CROSS_COMPILE env variable can be used)
  --gen-c-file          generate c-file image (DEBUG PURPOSE)
  --parse-only          parsing only the generated c-files
  --no-clean            Don't clean the intermediate files
  --log [STR]           log file
  --json [STR]          Generate result file (json format)
  --verbosity [{0,1,2}], -v [{0,1,2}]
                        set verbosity level
  --debug               Enable internal log (DEBUG PURPOSE)
  --no-color            Disable log color support

“–name/-n” option

The --name/-n option allows to specify/overwrite the expected c-name/file-name of the loadable runtime model. By default, the name of the generated network files is used.

Default behavior.

$ python npu_driver.py -i <gen-dir>/network.c 
...
Generating files...
    creating "build\network_rel.bin" (size=..)
$ python npu_driver.py -i <gen-dir>/my_model.c
...
Generating files...
    creating "build\network_rel.bin" (size=..)
$ python npu_driver.py -i <gen-dir>/network.c -n toto
...
Generating files...
    creating "build\toto_rel.bin" (size=..)