ST Neural-ART NPU - Runtime loadable model support
for STM32 target, based on ST Edge AI Core Technology 3.0.0
r1.3
Introduction
What is a runtime loadable model?
A runtime loadable model, also known as a relocatable binary model, is a binary object that can be installed and executed at runtime within an STM32 memory subsystem. This model includes a compiled version of the generated neural network (NN) C files, encompassing the necessary forward kernel functions and weights/parameters. Its primary purpose is to offer a flexible method for upgrading AI-based applications without the need to regenerate and flash the entire end-user firmware. This approach is particularly useful for technologies such as firmware over-the-air (FOTA). This feature allows the following:
- Models and their associated memory contents to be loaded dynamically at runtime.
- Flexibility to update or swap models without rebuilding the entire application.
- Clear separation between application logic and model data.
The binary object can be seen as a lightweight plug-in, capable of running from any address (position-independent code) and storing its data anywhere in memory (position-independent data). An efficient and minimal dynamic/runtime loader enables the instantiation and usage of this model. Unlike traditional systems, the firmware does not embed a complex and resource-intensive dynamic linker for Arm® Cortex®-M microcontrollers. The generated object is mainly self-contained, requiring limited and well-defined external symbols (NPU runtime dependent) at runtime.
In comparison with the software-only solution (fully self-contained object), the runtime loadable model for the Neural-ART accelerator must be installed inside an NPU runtime software stack. This allows supporting multiple instances, including the static version. No specific synchronization mechanism is embedded in the relocatable part; instead, the scheduling and access to the hardware resources (NPU subsystem) are managed by the stack itself (static part).
The generated runtime loadable model is a container that includes
relocatable code (text, data, and rodata sections) to configure
various neural processing unit (NPU) epochs. It primarily comprises
a compiled version (microcontroller unit (MCU)-dependent) of the
specialized network.c file, the low-level (LL) ATON
driver functions used, and the code for delegated software
operations, which are part of the optimized network runtime library.
For hybrid epochs (calls to LL_ATON_LIB_xxx functions),
a specific callback-based mechanism is implemented to directly
invoke the services of the stack. These system callback functions
are provided by the static part of the NPU stack and are registered
during the installation phase of the model (see the ll_aton_reloc_install()
function).
Solution overview
Memory model
As illustrated in the following figure, to use runtime loadable models, a portion of the internal RAMs (read/write AI regions) is fixed or reserved and has absolute addresses. These regions must not be used by the application during inference. A non-fixed executable RAM region is required to install a given model. This read/write region allows to resolve the relocatable references during the installation process at runtime.
| Item | Definition |
|---|---|
| executable RAM region | Designates the memory region used to install a model at runtime (txt/data sections). This shared memory-mapped region, located anywhere in the memory subsystem, can be reserved at runtime by the application. The minimum requested size is returned by the ll_aton_reloc_get_info() function (rt_ram_xip or rt_copy_xip fields). Attributes: MCU RWX, NPU RO. For performance reason, the region should be fully cached (MCU). (*) |
| RW AI regions | Designates the memory regions which are reserved for the activations/parameters. Base addresses are absolutes and are defined in the mem-pool descriptor files. Attributes: MCU/NPU memory-mapped, MCU RW, NPU RW. To know/check the memory regions which are used, the ll_aton_reloc_get_mem_pool_desc() can be used. (**) |
| RW AI region (external RAM) | If requested, designates the memory regions which are allocated/reserved to place the requested part of the memory-pool defined for the external RAM. The base address can be relative (can be placed anywhere in the external RAM) or absolute. ONLY one relative region is supported, it can be reserved at runtime by the application. The requested size (ext_ram_sz field) is returned by the ll_aton_reloc_get_info() function. Attributes: MCU RWX, NPU RW (**). |
| Model X (weights/params) | Designates the memory regions where the relocatable binary models are stored. Base addresses (file_ptr) are relative and dependent on the application managing the FOTA-like mechanism. Attributes: MCU/NPU memory-mapped, MCU RO, NPU RO. |
| App | Designates the memory regions which are reserved for the application (static part), it embeds a minimal part (NN load service) allowing to load/install and execute a relocatable model. |
(*) This region must be also accessible by the
NPU in the case where the epoch controller is enabled.
(**) These regions should be also accessible by the
MCU in the case where a operator/epoch
is delegated on the MCU.
Limitations
- Only the STM32N6-based series with Neural-ART accelerator is supported.
- Addresses of the used internal/on-chip memory-pools are
fixed/absolutes (USEMODE_ABSOLUTE
attribute). Only the external addresses relative to the
off-chip memories: flash and RAM type, can be relocatable/relative
(or USEMODE_RELATIVE
attribute)
- Only two relocatable/relative memory pools are supported. One for RO regions handling the weights/params and another for an external RW memory region. Note that the external RAM region can be also absolute.
- Secure mode generation and [XIP mode][ref_rel_xip_mode] are not supported with ARM clang/llvm toolchain (cmse is not compatible with ROPI/RWPI).
- Support for the encrypted weights/params throught the ST Edge AI core CLI is not supported.
Embedded compiler options for relocatable mode
To generate relocatable objects, according to the Arm® Embedded
toolchain used, two sets of compilation options are considered (see
makefiles from the
$STEDGEAI_CORE_DIR/scripts/N6_reloc/resources
folder):
- For a GCC Arm® Embedded toolchain, all the code, including the network runtime library, is compiled using the -fpic, -msingle-pic-base, and -mno-pic-data-is-text-relative options. The Arm® Cortex®-M r9 register is designated as the platform register for the global offset table (GOT). The primary task of the dynamic/runtime loader is to update the GOT table and the indirect references in the data structure. Both XIP & COPY modes are fully supported.
- For a CLANG/LLVM Arm® Embedded toolchain, all the code is compiled with the options -fropi and -frwpi. The r9 register is also used; however, some constant/text parts must also be updated by the dynamic/runtime loader, which does not allow support for the XIP mode.
Prerequisites - Setting up a work environment
The generation of a runtime-loadable model is integrated into the
ST Edge AI Core command-line
interface (CLI). A specific pass calls the dedicated Python
scripts located in the
$STEDGEAI_CORE_DIR/scripts/N6_reloc directory. The
npu_driver.py script acts as the entry point for
generating the runtime-loadable models, which can be used directly.
By default, a GNU
Arm Embedded tool-chain supporting the Arm® Cortex® M55 (with
the ‘arm-none-eabi-’ prefix) is used. The user must ensure that the
executable is available in the PATH, including a Make utility. The
custom option allows to
customize the enviroment.
%STEDGEAI_CORE_DIR%represents the root location where the ST Edge AI Core components are installed, typically in a path like"<tools_dir>/STEdgeAI/<version>/".
Using the Python scripts directly
To use the scripts directly, a Python 3.9+ and the following Python modules are requested:
pyelftools==0.27
tabulate
colorama
Note
The Python interpreter available in the ST Edge AI Core pack can be directly used (all requested Python module are already installed).
Optional tools for built-in validation workflow
For validating a runtime loadable model on the STM32N6570-DK board, you may optionally install:
- STM32CubeIDE
supporting the STM32n6 series (version 1.17.0 or higher)
- stm_ai_runner Python package
Note: 'STM32_CUBE_IDE_DIR' system
environment variable must be set to indicate the installation folder
fo the STM32CubeIDE pack.
Getting started
Generating a runtime loadable model
The -r/--reloc/--relocatable [rel-option] option
allows to generate the runtime-loadable model. The standard
generating workflow is extended with the additional steps to
pre-process and compile the specialized c-files.
Here is a typical output log example that you might see when running the generate command for an NPU model deployment.
$ stedgeai generate -m <model-name>.tflite/onnx --target stm32n6 --st-neural-art <profile>@<usr_neural_art_reloc>.json --reloc
ST Edge AI Core v3.0.0
...
Neural ART - Package for runtime loadable model - v 1.4.0
...
Generated files (8)
--------------------------------------------------------------------------------
<output-directory-path>\<model-name>_OE_3_3_1.onnx
<output-directory-path>\<model-name>_OE_3_3_1_Q.onnx
<output-directory-path>\network.c
<output-directory-path>\network.h
<output-directory-path>\network_atonbuf.xSPI2.raw
<output-directory-path>\network_c_info.json
<output-directory-path>\network_rel.bin
<output-directory-path>\network_generate_rel.json
Creating report file <output-directory-path>\network_generate_report.txtGenerated files
| file | description |
|---|---|
<network>_rel.bin |
The main binary file contains the compiled version of the model, including the forward kernel functions and weights by default. It also embeds additional sections (.header, .got, .rel) to enable the installation of the model. |
<network>_data.bin |
Optional file. If the split parameter is used, the weights are generated in a separated binary file. |
<network>_generate_rel.json |
Extra report file containing the main informations of the generated runtime loadable model (JSON format) |
Note that the memory initializers for the memory regions which are only used for the activations are not requested.
NPU compiler and CLI options
There are no restrictions on the NPU compiler options
used for this step, except for the usage of the
--all-buffers-info option, which provides detailed
information about the intermediate and activation buffers. Note that
this additional information is removed during the generation of the
runtime loadable model. For the memory-pool descriptor file, the
following attributes are mandatory:
- “mode”: “USEMODE_RELATIVE” for external memory pools only. This is mandatory for the external flash part and optional for the external RAM part. Otherwise, use “USEMODE_ABSOLUTE”.
- “fformat”: “FORMAT_RAW” for all memory initializers.
Relocatable options
The -r/--reloc/--relocatable option can be used with the following parameters. These parameters are mainly forwarded to specialized Python scripts.
| parameter | description |
|---|---|
| <none> | Default behavior. Generates a single model binary file containing network, kernels, and weights. |
| “split” | Generates two separate binary files: one with the model (network + kernels) and another containing only the weights. Useful for flexible memory management. |
| “gen-c-file” | Generates the binary file as a C array instead of a raw binary. Mainly for debugging purposes. |
| “st-clang” | Use the ST Arm Clang toolchain for compilation instead of the default ARM GCC based toolchain. |
| “llvm” | Use the LLVM toolchain for compilation instead of the default ARM GCC based toolchain. |
| “ecblob-in-params” | Places the ecblobs (epoch controller blobs) together with the model weights/parameters in the binary. |
| “no-secure” | Compile the binary with non-secure flags, useful for non-secure execution environments. Forced behavior with llvm-clang compiler. |
| “no-dbg-info” | The LL_ATON_EB_DBG_INFO C-define is not set during the compilation of the model allowing to avoid to embed the debug only information. |
| “custom[=<usr-file>.json]” | Use a custom JSON
configuration file to override default environment
settings. If no file is specified, uses custom.json by
default. |
The parameters can be combined using a comma separator.
$ stedgeai generate -m -m <model-name>.tflite/onnx <gen_options> ---target stm32n6 -st-neural-art ... --reloc split,gen-c-file
...
<output-directory-path>\network_rel.bin
<output-directory-path>\network_data.bin
<output-directory-path>\network_img_rel.c
<output-directory-path>\network_img_rel.h“split” parameter
The split option allows generating two separate
files. In this case, to deploy the model, both files should be
deployed on the target, and the address of the
"network_rel_params.bin" should be passed during the
installation process at runtime (see ll_aton_reloc_install()
function).
“ecblob-in-params” parameter
The ecblob-in-params option indicates that the
ecblobs are stored with the params/weights. This feature allows to
reduce the requested exec memory region to execute the model in COPY
mode (see “Epoch controller
consideration” section). Ecblobs are placed at the beginning of the params/weights
memory segment. The split option
is always supported, the params file includes the ecblobs.
“no-dbg-info” parameter
The no-dbg-info option allows the removal of the
LL_ATON_EB_DBG_INFO C-define to generate the runtime
loadable model. This option removes debug information, including the
intermediate description of buffers and additional debug fields in
the epoch descriptor, thereby reducing the size of the final binary.
These debug details are mandatory for using the built-in validation
stack.
“custom” parameter
The custom option allows overriding certain default
environment variables used to build the relocatable binary model. By
default a custom.json file from the current working
directory is used to retrieve the expected values. Alternatively,
the parameter can be extended with a specific file name (for
example: --reloc cutsom=myconfog.json).
| supported key | description |
|---|---|
| “runtime_network_lib” | Used to indicate the absolute path of the used network runtime library (default: %STEDGEAI_CORE_DIR%/Middlewares/ST/AI/Lib/GCC/ARMCortexM55/NetworkRuntime1100_CM55_GCC_PIC.a ) |
| “extra_system_path” | Used to prefix the system path allowing to retrieve the used ARM compiler. By default, the arm-none-eabi-gcc executable is used from the PATH |
| “llvm_install_path” | llvm target only. Used to set
the LLVM_COMPILER_PATH value in the llvm makefile to
indicate the root directory of the tools chain. Mandatory
for the llvm target. |
| “target_triplet” | llvm target only. Set the
TARGET_TRIPLET value in the llvm makefile. Default:
thumbv8m.main-unknown-none-eabihf |
| “llvm_sysroot” | llvm target only. Set the
LLVM_SYSROOT value in the llvm makefile. Default:
${LLVM_COMPILER_PATH}/lib/clang-runtimes/newlib/arm-none-eabi/armv8m.main_hard_fp |
Memory layout information
Below information is reported in the log and also in a
well-formatted file in JSON format:
<network>_generate_rel.json.
...
Runtime Loadable Model - Memory Layout (series="stm32n6npu")
--------------------------------------------------------------------------------
v8.0 dbg=True async=True sec=True 'Embedded ARM GCC' cpuid=D22 fpu=True float-abi=2
XIP size = 3,344 data(2620)+got(708)+bss(16) sections
COPY size = 32,544 +ro(29200) sections
extra sections = 1,112 got:708(21.2%) header+rel:404(1.4%)
params size = 25,713
acts size = 56,496
binary file size = 32,832
params file size = 25,720
+--------------------------+----------------------------------+------+------------+-------+
| name (addr) | flags | foff | dst | size |
+--------------------------+----------------------------------+------+------------+-------+
| xSPI2 (0x20004c64) | 0x01010500 RELOC.PARAM.0.RCACHED | 0 | 0x00000000 | 25713 |
| AXISRAM5 (0x20004c6a) | 0x03020200 RESET.ACTIV.WRITE | 0 | 0x342e0000 | 56496 |
| <undefined> (0x00000000) | 0x00000000 UNUSED | 0 | 0x00000000 | 0 |
+--------------------------+----------------------------------+------+------------+-------+
Table: mempool c-descriptors (off=40000a00, 3 entries, from RAM)
rt_ctx: c_name="network", acts_sz=56,496, params_sz=25,713, ext_ram_sz=0
rt_ctx: rt_version_desc="atonn-v1.1.3-5-gdabeb3b4d (RELOC.GCC)"| item | description |
|---|---|
| XIP size | Size (in bytes) of the executable RAM memory region when using Execute-In-Place (XIP) mode. This is the memory region where code is executed directly without copying. |
| COPY size | Size (in bytes) of the executable RAM memory region when using COPY mode, where code or data is copied from non-volatile memory to RAM before execution. |
| params size | Total size (in bytes) of the used part of the memory-pools related to the weights and parameters. Detailed information is provided in mempool c-descriptors table. |
| acts size | Total size (in bytes) of the used part of the memory-pools related to the activations needed during inference. Detailed information is provided in mempool c-descriptors table. |
| binary file size | Size (in bytes) of the generated binary file containing the compiled model and its data. |
| params file size | Size (in bytes) of the parameters file when the split option is used to separate weights from the main binary. |
| rt_ctx/v8.0.. | Binary header information indicating details such as the Embedded ARM toolchain version used, compilation flags, and other metadata. This header is used by the loader/install function to check if the binary is compliant with the static part of the runtime. |
The “mempool c-descriptors” table indicates the memory regions (part of the user memory-pools) and the flags which are considered by the ll_aton_reloc_install() function at runtime.
| flag | description |
|---|---|
| RELOC | Indicates a relocatable region, the address (dst=0) will be resolved at runtime during the installation process. |
| PARAM/ACTIV/MIXED | Indicates the type of contents: PARAM: params/weights only, ACTIV: activations only, MIXED: mixed |
| RCACHED/WCACHED | Indicates that a part of the memory region can be accessible through the NPU cache. RCACHED is associated with a RELOC.PARAM/read-only region |
| WRITE | Indicates that the memory region is a read-write memory region. |
| RESET | Indicates that the memory region can be cleared if the AI_RELOC_RT_LOAD_MODE_CLEAR option is used. |
| COPY | Indicates that the region is initialized/copied during the installation process. |
| UNUSED | Last entry in the mempool c-descriptors |
- The number 0 or 1 indicates the ID of the relocatable memory regions. Currently, only two regions are supported: 0 for a parameters/weights-only region in external flash and 1 for a read/write memory region in the external RAM.
- foff specifies the offset in the parameters/weights section to locate the associated memory initializer when requested.
- dst specifies the destination address. If not equal to zero, the address is an absolute address; otherwise, the region is a relocatable region.
- size specifies the size in bytes.
Example with a tiny model using only the internal NPU RAM for the activations and weights/params.
During runtime in the installation process, the AXIRAM2, AXIRAM3, and AXIRAM4 (absolute address) are initialized with the contents of the parameters and weights section. AXIRAM5 is used exclusively for activations.
+--------------------------+------------------------------+------+------------+-------+
| name (addr) | flags | foff | dst | size |
+--------------------------+------------------------------+------+------------+-------+
| AXISRAM6 (0x20004c64) | 0x03020200 RESET.ACTIV.WRITE | 0 | 0x34350000 | 500 |
| AXISRAM5 (0x20004c6d) | 0x02030200 COPY.MIXED.WRITE | 0 | 0x342e0000 | 33728 |
| AXISRAM4 (0x20004c76) | 0x03020200 RESET.ACTIV.WRITE | 0 | 0x34270000 | 40496 |
| AXISRAM3 (0x20004c7f) | 0x03020200 RESET.ACTIV.WRITE | 0 | 0x34200000 | 16000 |
| <undefined> (0x00000000) | 0x00000000 UNUSED | 0 | 0x00000000 | 0 |
+--------------------------+------------------------------+------+------------+-------+
Table: mempool c-descriptors (off=40000428, 5 entries, from RAM)Example with a model using only the external RAM/FLASH (internal RAMs are not used)
During runtime in the installation process, the references related to xSPI1 and xSPI2 (relative address) are resolved to the external RAM address (reserved by the application) and the parameters/weights section (part of the installed relocatable module), respectively.
+--------------------------+----------------------------------+------+------------+-------+
| name (addr) | flags | foff | dst | size |
+--------------------------+----------------------------------+------+------------+-------+
| xSPI1 (0x20004c62) | 0x01020601 RELOC.ACTIV.1.WCACHED | 0 | 0x00000000 | 64496 |
| xSPI2 (0x20004c68) | 0x01010500 RELOC.PARAM.0.RCACHED | 0 | 0x00000000 | 32625 |
| <undefined> (0x00000000) | 0x00000000 UNUSED | 0 | 0x00000000 | 0 |
+--------------------------+----------------------------------+------+------------+-------+
Table: mempool c-descriptors (off=40001618, 3 entries, from RAM)Epoch controller consideration
There is no functional limitation on using the
epoch
controller with the runtime-loadable model. By default, the
generated command streams (also called ecblobs) are stored as
constants in the read-only data (rodata) section. However, as they
reference addresses that are not known at generation time
(i.e. weights/params buffers), a dedicated relocatable mechanism is
implemented to patch the ecblobs during the initialization of the
model. This mechanism requires an additional SRAM memory area in the
uninitialized data (bss) section to copy the ecblobs before
patching. The -v 2 option of the ST Edge AI core CLI
allows to report the detailed information about the rodata/bss
sections related to the ecblob objects.
$ stedgeai generate -m <model-name>.tflite/onnx --target stm32n6 --st-neural-art <profile>@<usr_neural_art_reloc>.json --reloc -v 2
...
+--------------------+--------+---------+-------+
| Name | bss | ro data | reloc |
+--------------------+--------+---------+-------+
| _ec_blob_network_1 | 19,776 | 19,904 | r:ptr |
| | | | |
| total | 19,776 | 19,904 | |
+--------------------+--------+---------+-------+
Table: EC blob objects (1)Consequently, to install a model, the size of the required executable RAM (XIP or COPY mode) is significantly more critical. Below tables illustrate the different required sizes according to the configuration.
- Weights/params are placed in the external flash (relative address)
| Configuration | XIP size | COPY size | params size | binary file size |
|---|---|---|---|---|
| no EC | 3,344 | 32,544 | 25,713 | 58,552 |
| with EC | 20,416 | 53,640 | 22,065 | 55,984 |
| with EC + ecblob-in-params | 20,416 | 33,712 | 22,065 | 55,960 |
- Weights/params are placed in the internal RAM (fixed/absolute address).
| Configuration | XIP size | COPY size | params size | binary file size |
|---|---|---|---|---|
| no EC | 1,888 | 32,600 | 25,713 | 66,496 |
| with EC | 632 | 33,088 | 22,065 | 55,832 |
| with EC + ecblob-in-params | 632 | 13,312 | 22,065 | 55,832 |
Weights/params encryption consideration
If the model is generated to support the encrypted weights/params
(with the NPU compiler option: '--encrypt-weights'),
before to generate the relocatable binary model, the weights/params
file (network_atonbuf.xSPI2.raw file) should be encrypted as for the
non-relocatable model. This worklow is not integrated in the ST Edge
AI core CLI. It is required to use directly the Python scripts (npu_driver.py srcript) to
generate the relocatable binary model. The split option is also preferable to be
able to fix the address of the weights/params in the flash memory,
because the encryption is dependent of the storage location.
To use a model with the weights/params encrypted, the index/keys for the different bus interfaces should be set before to execute the model.
...
LL_ATON_RT_Reset_Network(&nn_instance);
// Set bus interface keys -- used for encrypted inference only
LL_Busif_SetKeys ( 0 , 0 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
LL_Busif_SetKeys ( 0 , 1 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
LL_Busif_SetKeys ( 1 , 0 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
LL_Busif_SetKeys ( 1 , 1 , BUSIF_LSB_KEY , BUSIF_MSB_KEY );
do {
/* Execute first/next step of Cube.AI/ATON runtime */
ll_aton_rt_ret = LL_ATON_RT_RunEpochBlock(&nn_instance);
...Format of the relocatable binary model
The following figure illustrates the layout of the generated
relocatable binary model ("network_rel.bin" file). By
default, the memory initializers (params/weights sections) are
included in the image. If the split
option is used, the params/weights sections are generated in a
separated binary file ("network_rel_params.bin"
file).
- If the epoch controller is enabled, the generated bitstreams are included in the rodata/data sections by defaut. The –ecblob-in-params option can be used to store the ecblob with the params section.
- The description of the params/weights sections is defined in the rodata section.
- When the split option is used,
the address (
'ext_params_add') is passed and defined by the application when the ll_aton_reloc_install() function is called.
Evaluating the RT loadable model (STM32N6570-DK board)
A ready-to-use environment for STM32N6570-DK board (DEV mode) is delivered in the ST Edge AI Core pack. It allows performing a classical validation workflow, validate command or through the stm_ai_runner Python package module. Note that a STM32CubeIDE environment must be installed.
Warning
Set the boot mode in development mode (BOOT1 switch position is 1-3, BOOT0 switch position does not matter). After the loading phase, the board must be NOT switched off or disconnected to be able to perform the validation.
After the generation of the relocatable binary model, the
st_load_and_run.py Python script is used to flash the
binary files at the fixed addresses, to load, and to run a built-in
validation firmware. After these steps, a single inference is
executed reporting the performance.
[Details] st_load_and_run.py script
usage: st_load_and_run.py [-h] [--input [STR ...]] [--board STR]
[--address STR] [--mode STR] [--cube-ide-dir STR]
[--log [STR]] [--verbosity [{0,1,2}]] [--debug] [--no-color]
NPU Utility - ST Load and run (dev environment) v2.0
optional arguments:
-h, --help show this help message and exit
--input [STR ...], -i [STR ...]
location of the binary files (default: build/network_rel.bin)
--board STR ST development board (default: stm32n6570-dk)
--address STR destination address - model(,params) (default: 0x71000000,0x71800000)
--mode STR firmware variants & mode: copy,xip[no-flash,no-overlay,no-run,usbc,ext]
--cube-ide-dir STR installation directory of STM32CubeIDE tools (ex. ~/ST/STM32CubeIDE_1.19.0/STM32CubeIDE)
--log [STR] log file
--verbosity [{0,1,2}], -v [{0,1,2}]
set verbosity level
--debug enable internal log (DEBUG PURPOSE)
--no-color disable log color supportSupported features/limitations
- Only the STM32N6570-DK board is supported. Destination addresses are fixed.
- Multiple models can be deployed.
- By default, they share the same execution RAM memory region. The
no-overlaymode can be used to force the creation of a dedicated memory region. - Execution of the models is always sequential, the activation regions are overlapped.
- By default, they share the same execution RAM memory region. The
- –no-inputs/outputs-allocation` options are not supported.
copyorxipmode can be selected. If thexipmode is not supported,copymode is automatically used.- the execution RAM region is located in the internal SRAM first, if more memory is requested, the external RAM is used.
Deploy a model
$ $STEDGEAI_CORE_DIR/Utilities/windows/python $STEDGEAI_CORE_DIR/scripts/N6_reloc/st_load_and_run.py
-i <output-directory-path>\network_rel.binNPU Utility - ST Load and run (dev environment) (version 2.0)
Creating date : ...
model info : st_ai_output\network_rel.bin: size=55,832
cpuid=0xd22 c_name='network' 'Embedded ARM GCC' ll_aton=1.1.3.3
acts/params=57,087/22,065 xip/copy=632/13,312 ext_ram=0 split=False
secure=True
board : 'stm32n6570-dk'
mode : ['copy']
board info : 'stm32n6570-dk' baudrate=921600 eflash_loader='MX66UW1G45G_STM32N6570-DK.stldr'
eflash[sec/pg]=64.0KB/4.0KB
exec_ram[int/ext]=655,360/4,194,304 ext_ram=28,311,552 addrs=0x70FFF000,0x71000000,0x71800000
use clang : False
install mode : 'copy'
total (1) : overlay bin=65,536 xip=640 [copy=13,312] 'installed in int exec_ram' ext_ram=0
flash@ : 0x71000000
flash_params@ : 0x00000000
Resetting the board.
Flashing 'header (nb_entries=1)' at address 0x70FFF000 (size=20)..
Flashing 'st_ai_output\network_rel.bin' at address 0x71000000 (size=55832)..
Loading & start the validation application 'stm32n6570-dk-validation-reloc'..
Deployed model is started and ready to be used.
Executing the deployed model (desc=serial:921600)..
...
Inference time per node
-------------------------------------------------------------------------------------------------------------------
c_id m_id type dur (ms) % cumul CPU cycles name
-------------------------------------------------------------------------------------------------------------------
0 - epoch (EC) 0.210 96.6% 96.6% [ 896 166,877 573 ] EpochBlock_1 -> 14
1 - epoch (SW) 0.007 3.4% 100.0% [ 110 40 5,704 ] EpochBlock_15
-------------------------------------------------------------------------------------------------------------------
total 0.218 [ 1,006 166,917 6,277 ]
4592.41 inf/s [ 0.6% 95.8% 3.6% ]
-------------------------------------------------------------------------------------------------------------------Evaluate the model
Default validation workflow can be use to evaluate the deployed model.
$ stedgeai validate -m <quantize_model> --target stm32n6 --mode target -d serial:921600...
Evaluation report (summary)
-------------------------------------------------------------------------------------------------------------
Output acc rmse ... std nse cos tensor
-------------------------------------------------------------------------------------------------------------
X-cross #1 n.a. 0.007084151 ... 0.007081 0.999671 0.999910 10 x uint8(1x3087x6)
-------------------------------------------------------------------------------------------------------------
Deploy multiple models
Generate the models: model1 and
model2
stedgeai generate -m <model1> --target stm32n6 --st-neural-art test@$STEDGEAI_CORE_DIR/scripts/N6_reloc/test/neural_art_reloc.json -r -n model1
stedgeai generate -m <model2> --target stm32n6 --st-neural-art test@$STEDGEAI_CORE_DIR/scripts/N6_reloc/test/neural_art_reloc.json -r -n model2Warning
Each deployed model must have its own c-name.
Deploy the models for evaluation
$ $STEDGEAI_CORE_DIR/Utilities/windows/python $STEDGEAI_CORE_DIR/scripts/N6_reloc/st_load_and_run.py
-i <output-directory-path>\model1_rel.bin <output-directory-path>\model2_rel.binEvaluate the model1 model
$ stedgeai validate -m <model12> --target stm32n6 --mode target -d serial:921600 -n model1Deploy and use a relocatable model
There is no specific service allowing to deploy the model on a
target, FOTA-like mechanism, and other stack to manage the firmwares
or models are application-specific. However, when the relocatable
model is flashed on the target at a given memory-mapped address
(file_ptr), the ll_aton_reloc_install()
function must be called to install the model.
LL_ATON stack configuration
| LL_ATON_XX C-defines | comment |
|---|---|
| LL_ATON_PLATFORM | 'LL_ATON_PLAT_STM32N6'
mandatory |
| LL_ATON_EB_DBG_INFO | mandatory |
| LL_ATON_RT_RELOC | mandatory - Enables the code paths and functionalities required to manage the relocatable mode. |
| LL_ATON_RT_MODE | LL_ATON_RT_ASYNC is
recommended but LL_ATON_RT_POLLING can be used. |
| LL_ATON_OSAL | no restriction |
Minimal code
The following snippet code illustrates how to install and to use
a runtime loadable model within a bare metal environment, single
network with a single input tensor, and a single output tensor
(epoch controller is used or not). user_model_mgr()
function is used to retrieve the address where the module has been
flashed.
#include "ll_aton_reloc_network.h"
static NN_Instance_TypeDef nn_instance; /* LL ATON handle */
uint8_t *input_0, *prediction_0;
uint32_t input_size_0, prediction_size_0;
int ai_init(const uintptr_t file_ptr, const uintptr_t file_params_ptr)
{
const LL_Buffer_InfoTypeDef *ll_buffer;
ll_aton_reloc_info rt;
int res;
/* Retrieve the requested RAM size to install the model */
res = ll_aton_reloc_get_info(file_ptr, &rt);
/* Reserve executable memory region to install the model */
uintptr_t exec_ram_addr = reserve_exec_memory_region(rt.rt_ram_copy);
/* Reserve external read/write memory region for external RAM region */
uintptr_t ext_ram_addr = NULL;
if (rt.ext_ram_sz > 0)
ext_ram_addr = reserve_ext_memory_region(rt.ext_ram_sz);
/* Create and install an instance of the relocatable model */
ll_aton_reloc_config config;
config.exec_ram_addr = exec_ram_addr;
config.exec_ram_size = rt.rt_ram_copy;
config.ext_ram_addr = ext_ram_addr;
config.ext_ram_size = rt.ext_ram_sz;
config.ext_param_addr = NULL; /* or @ of the weights/params if split mode is used */
config.mode = AI_RELOC_RT_LOAD_MODE_COPY; // | AI_RELOC_RT_LOAD_MODE_CLEAR;
res = ll_aton_reloc_install(file_ptr, &config, &nn_instance);
if (res != 0)
{
/* Retrieve the addresses of the input/output buffers */
ll_buffer = ll_aton_reloc_get_input_buffers_info(&nn_instance, 0);
input_0 = LL_Buffer_addr_start(ll_buffer);
input_size_0 = LL_Buffer_len(ll_buffer);
ll_buffer = ll_aton_reloc_get_output_buffers_info(&nn_instance, 0);
prediction_0 = LL_Buffer_addr_start(ll_buffer);
prediction_size_0 = LL_Buffer_len(ll_buffer);
/* Init the LL ATON stack and the instantiated model */
LL_ATON_RT_RuntimeInit();
LL_ATON_RT_Init_Network(&nn_instance);
}
return res;
}
void ai_deinit(void)
{
LL_ATON_RT_DeInit_Network(&NN_Instance_Default);
LL_ATON_RT_RuntimeDeInit();
}
void ai_run(void) {
LL_ATON_RT_RetValues_t ll_aton_ret;
LL_ATON_RT_Reset_Network(&nn_instance);
do {
ll_aton_ret = LL_ATON_RT_RunEpochBlock(&nn_instance);
if (ll_aton_ret == LL_ATON_RT_WFE) {
LL_ATON_OSAL_WFE();
}
} while (ll_aton_ret != LL_ATON_RT_DONE);
}
void main(void)
{
uintptr_t file_ptr, file_params_ptr;
user_system_init(); /* HAL, clocks, NPU sub-system... */
user_model_mgr(&file_ptr, &file_params_ptr);
if (ai_init(file_ptr, file_params_ptr)) {
/*... installation issue ..*/
}
while (user_app_not_finished()) {
/* Fill input buffers */
user_fill_inputs(input_0);
/* If requested, perform the NPU/MCU cache operations to guarantee the coherency of the memory. */
// LL_ATON_Cache_MCU_Clean_Invalidate_Range(input_0, input_size_0);
// LL_ATON_Cache_MCU_Invalidate_Range(prediction_0, prediction_size_0);
/* Perform a complete inference */
ai_run();
/* Post-process the predictions */
user_post_process(prediction_0);
}
ai_deinit();
//...
}XIP/COPY execution modes
XIP execution mode
This execution mode is the privileged mode, where the code and weight sections are stored in the flash memory. Regarding memory placement, this approach is similar to the static method. This mode is efficient in terms of memory usage, only the executable RAM region to store the data/bss/got sections is requested.
COPY execution mode
This mode involves copying the code to a different memory location before execution. This process is useful for optimizing performance or managing memory access speeds. The requested size for the executable RAM region is critical. Along with the data, bss, and GOT sections, it also includes the text and rodata sections.
Example of NPU compiler configuration files
Memory-pool descriptor files
The "%STEDGEAI_CORE_DIR%/scripts/N6_reloc/test"
contains a set of configuration and memory-pool
descriptor files which can be used. Here are two main points to
consider:
- If nonsecure context is used to execute the deployed model, the
base @ of the different memory-pools should be set with a nonsecure
address (ex.
0x24350000instead of0x34350000for the AXIRAM6 memory). - For the memory-pools representing an off-chip device, the
"USEMODE_RELATIVE"attribute should be used.
For test purpose with STM32N6570-DK board, different ready-to-use configurations are provided.
| memory-pool desc. file | description |
|---|---|
| stm32n6_reloc.mpool | Full memories. All NPU rams (AXIRAM3..6), AXIRAM2 and external RAM/flash are defined. NPU cache can be used for the external memories. |
| stm32n6_npuram.mpool | NPU memories only. Only the NPU rams (AXIRAM3..6) are defined. |
| stm32n6_int.mpool | Internal memories only. Only the NPU RAMs (AXIRAM3..6 and AXIRAM2) are defined. |
| stm32n6_int2.mpool | Internal memories only. Similar to stm32n6_int.mpool but the AXIRAM2 is privileged for the weights. |
| stm32n6_reloc_ext.mpool | External memories only. Only the external RAM/flash are defined. NPU cache can be used for the external memories. |
For information and test purpose, the *non” reloc memory-pool descriptor files are provided which can be used with a normal deployment flow.
Part of the
./test/mpool/stm32n6_reloc.mpool file:
...
{
"fname": "AXISRAM6",
"name": "npuRAM6",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "HIGH", "latency": "LOW",
"byteWidth": 8, "freqRatio": 1.25, "read_power": 18.531, "write_power": 16.201 },
"offset": { "value": "0x34350000", "magnitude": "BYTES" },
"size": { "value": "448", "magnitude": "KBYTES" }
},
{
"fname": "xSPI1",
"name": "hyperRAM",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_WRITE", "throughput": "MID", "latency": "HIGH",
"byteWidth": 2, "freqRatio": 5.00, "cacheable": "CACHEABLE_ON",
"read_power": 380, "write_power": 340.0, "constants_preferred": "true" },
"offset": { "value": "0x90000000", "magnitude": "BYTES" },
"size": { "value": "32", "magnitude": "MBYTES" },
"mode": "USEMODE_RELATIVE"
},
{
"fname": "xSPI2",
"name": "octoFlash",
"fformat": "FORMAT_RAW",
"prop": { "rights": "ACC_READ", "throughput": "MID", "latency": "HIGH",
"byteWidth": 1, "freqRatio": 6.00, "cacheable": "CACHEABLE_ON",
"read_power": 110, "write_power": 400.0, "constants_preferred": "true" },
"offset": { "value": "0x70400000", "magnitude": "BYTES" },
"size": { "value": "64", "magnitude": "MBYTES" },
"mode": "USEMODE_RELATIVE"
}
...“neural_art.json” file
The "%STEDGEAI_CORE_DIR%/scripts/N6_reloc/test"
contains two examples of configuration file using the requested
memory-pool descriptor files. They provide different profiles using
different memory configurations which are aligned with the generic
AI test validation.
| profile | description |
|---|---|
| test | Default profile. Full memories configuration and the epoch controller are not enabled |
| test-ec | Default profile + support of the epoch controller |
| test-int | Internal profile. NPU memories only configuration and the epoch controller is not enabled |
| test-int-ec | Internal profile + support of the epoch controller |
| test-ext | External profile. NPU memories only configuration and the epoch controller is not enabled |
| test-ext-ec | External profile + support of the epoch controller |
Part of the ./test/neural_art_reloc.json
file:
...
"test" : {
"memory_pool": "./mpools/stm32n6_reloc.mpool",
"options": "--native-float --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os
--optimization 3 --Oauto-sched --all-buffers-info --csv-file network.csv"
},
"test-ec" : {
"memory_pool": "./mpools/stm32n6_reloc.mpool",
"options": "--native-float --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os
--optimization 3 --Oauto-sched --all-buffers-info --csv-file network.csv --enable-epoch-controller"
},
...Performance impacts
Accuracy
No difference with the nonrelocatable or static implementation.
Set up and inference time
For the inference time (after the install/init steps), no significant difference is expected versus a nonrelocatable or static implementation. Only the set-up time is impacted to install/create an instance of a given model. The installation time is directly proportional to the number of relocations, and the size of the code/data sections according to the used COPY/XIP mode. Note that if the memory initializers need to be copied into the internal RAMs, this extra time is equivalent to the static implementation.
- STM32N6570-DK
board (DEV mode), overdrive clock setting (NPU 1 GHz, MCU 800
MHz).
- By default, the external flash is used to store the
weights/params.
- Internal RAMs (AXIRAM3,..6) are used to store the activations.
- The executable RAM region is located in the internal AXIRAM1 (this is equivalent to the static case where the application is also executed from the AXIRAM1).
| yolov5 224 (nano) | static mode (absolute @ only) | reloc mode (copy) | reloc mode (xip) (*) |
|---|---|---|---|
| inference time (w/ ec) | 10.4 ms, 95.45 inf/s | 10.5 ms, 94.9 inf/s | 10.6 ms, 94.6 inf/s |
| inference time (w/o ec) | 13.2 ms, 75.35 inf/s | 13.0 ms, 77.1 inf/s | 14.3 ms, 69.6 inf/s |
| install/init time (ms) (w/ ec) | 0.0 / 0.03 | 11.7 / 0.73 | 10.9 / 0.93 |
| install/init time (ms) (w/o ec) | 0.0 / 0.03 | 11.2 / 0.03 | 10.8 / 0.03 |
(*) with the epoch controller, as the blob should be updated, it is fetched from the executable RAM region (AXIRAM1). We only observe an impact in the case where the configuration code is fetched from the external FLASH, w/o epoch controller support.
For the reloc mode with epoch controller support, the
'install/init' time is mainly due to the copy of the
blobs in the AXIRAM1 and the requested relocations to resolve the
weights/params addresses in the different blobs.
Case where only the AXIRAMx is used for the activations and the weights/params (~2 Mbytes)
| yolov5 224 (nano) | static mode (absolute @ only) | reloc mode (copy) | reloc mode (xip) |
|---|---|---|---|
| inference time (w/ ec) | 9.5 ms, 105.3 inf/s | 9.5 ms, 105.8 inf/s | 9.6 ms, 104.6 inf/s |
| install/init time (ms) (w/ ec) | 17+ / 0.03 | 18.1 / 0.03 | 18.1 / 0.03 |
The 'install/init' time is similar in both case. It
is mainly represented by the copy of the memory initializers from
the FLASH location to the internal RAMs. No extra relocation for the
weights/activations is requested (all weighs/activations addresses
are absolutes).
Memory layout overhead
In comparison with a static implementation, the relocation mode involves two additional sections, GOT/REL, which are used to support the position-independent code/data. The size is directly proportional to the number of relocating references.
LL ATON runtime API extension
To enable the support of a runtime loadable model, the LL_ATON files should be compiled with the following C-define:
LL_ATON_RT_RELOC
The LL_ATON_RT_RELOC C-define activates the code
paths and functionalities required to manage and install runtime
loadable models. Ensuring that this macro is defined during
compilation is crucial for the successful deployment and execution
of runtime loadable models.
ll_aton_reloc_install()
int ll_aton_reloc_install(const uintptr_t file_ptr, const ll_aton_reloc_config *config,
NN_Instance_TypeDef *nn_instance);Description
The ll_aton_reloc_install() function acts as a
runtime dynamic loader. It is used to install and to create an
instance of a memory-mapped runtime loadable module. By providing
the model image pointer (file_ptr), configuration
details, and neural network instance, users can set up the model for
execution. The function performs compatibility checks, initializes
memory pools, and installs/relocates code and data sections as
needed.
Parameters
file_ptr: Auintptr_tvalue representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image to be installed.config: A pointer to an ll_aton_reloc_config structure. This parameter provides the configuration details for how the model should be installed, including memory addresses and sizes.nn_instance: A pointer to anNN_Instance_TypeDefstructure. This parameter is updated to handle the installed model, creating an instance of the neural network.
Return Value
- The function returns an integer value. A return value of
0typically indicates success, while a nonzero value indicates that an error occurred during the installation process.
Steps Executed
- Checking step
- This step checks the compatibility of the binary object against
the runtime environment (static part of the firmware). The main
points checked include:
- The version and content of the binary file header.
- MCU type and whether the FPU (floating-point unit) is enabled (context/setting of the caller is used).
- Secure or nonsecure context.
- Whether the binary module has been compiled with the
LL_ATON_EB_DBG_INFOorLL_ATON_RT_ASYNCC-defines. - Version of the used LL_ATON files.
- This step checks the compatibility of the binary object against
the runtime environment (static part of the firmware). The main
points checked include:
- Memory-Pool initialization step
- If requested, this step initializes the used memory regions for
the given model. Specifically:
- If read/write memory pools handle the params/weights section, the associated memory region is initialized with values from the params/weights section.
- Optionally, if the
AI_RELOC_RT_LOAD_MODE_CLEARflag is set, the read/write memory region handling the activations is cleared.
- If requested, this step initializes the used memory regions for
the given model. Specifically:
- Code/Data installation and relocation step
- According to the
AI_RELOC_RT_LOAD_MODE_COPYorAI_RELOC_RT_LOAD_MODE_XIPflag:- The code/data sections are copied into the executable RAM region.
- The relocation process is performed to update references.
- Register the call-backs
- According to the
This function installs and creates an instance of a runtime
loadable model referenced by the file_ptr parameter.
The config parameter (of type ll_aton_reloc_config) indicates
how to install the model, and the nn_instance (of type
NN_Instance_TypeDef) is updated to handle the installed
model. This function executes the following steps:
ll_aton_reloc_config C-struct
The purpose of the ll_aton_reloc_config C structure
is to provide the parameters which are requested to install a
runtime loadable model.
typedef struct _ll_aton_reloc_config {
uintptr_t exec_ram_addr; /* base@ of the exec memory region to place the relocatable code/data (8-Bytes aligned) */
uint32_t exec_ram_size; /* max size in byte of the exec memory region */
uintptr_t ext_ram_addr; /* base@ of the external memory region to place the external pool (if requested) */
size_t ext_ram_size; /* max size in byte of the external memory region */
uintptr_t ext_param_addr; /* base@ of the param memory region (if requested) */
uint32_t mode;
} ll_aton_reloc_config;'exec_ram_addr'/'exec_ram_size': These members indicate the base address (8-byte aligned) and the maximum size of the read/write executable RAM memory region. These parameters are mandatory. To determine the required size at runtime, the ll_aton_reloc_get_info function can be used.'ext_ram_addr'/'ext_ram_addr': These members indicate the base address (8-byte aligned) and the maximum size of the read/write external RAM memory region, if requested. To determine the required size at runtime, the ll_aton_reloc_get_info function can be used.'ext_param_addr': This member indicates the base address (8-byte aligned) of the memory region containing the parameters/weights of the deployed model. This option is required when the split option is used; otherwise, it must be set to NULL.'mode': This member indicates the expected execution mode. Or-ed flags can be used.AI_RELOC_RT_LOAD_MODE_XIPorAI_RELOC_RT_LOAD_MODE_CLEARflag is mandatory.AI_RELOC_RT_LOAD_MODE_CLEARflag is optional.
| mode | description |
|---|---|
AI_RELOC_RT_LOAD_MODE_XIP |
XIP (Execute In Place) execution mode |
AI_RELOC_RT_LOAD_MODE_COPY |
COPY execution mode |
AI_RELOC_RT_LOAD_MODE_CLEAR |
Reset the used activation memory regions |
ll_aton_reloc_set_callbacks()
int ll_aton_reloc_set_callbacks(const NN_Instance_TypeDef *nn_instance, const struct ll_aton_reloc_callback *cbs)Description
The ll_aton_reloc_set_callbacks function is used to
overwrite the default registration of the callbacks done in the
ll_aton_reloc_install function. This
function is optional.
| Callback services | description |
|---|---|
| assert/lib error | to implement the management of the errors generated by the embedded LL ATON functions |
| NPU/MCU cache maintenance operations | to implement the NPU/MCU cache maintenance operations |
| LL_ATON_LIB_xxx | to implement the LL ATON LIB services to support the hybrid epochs |
Parameters
nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef).cbs: A pointer to all_aton_reloc_callbackstructure (seell_aton_reloc_network.hfile).
Return Value
- The function returns an integer value. A return value of
0typically indicates success, while a nonzero value indicates that an error occurred during the installation process.
ll_aton_reloc_get_info()
int ll_aton_reloc_get_info(const uintptr_t file_ptr, ll_aton_reloc_info *rt);Description
The ll_aton_reloc_get_info function is used to
obtain the main dimensioning information from the image of a runtime
loadable model. This information can include details such as the
size, memory requirements, and other relevant attributes of the
model. By providing a pointer to the model image and a reference to
an ll_aton_reloc_info structure, users can retrieve and
store the necessary information to properly configure and manage the
runtime loadable model.
This function is particularly useful for setting up the memory regions and ensuring that the model can be correctly loaded and executed within the available resources.
Parameters
file_ptr: Auintptr_tvalue representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image from which the information will be retrieved.rt: A pointer to an ll_aton_reloc_info structure. This parameter is used to store the retrieved dimensioning information of the runtime loadable model. The function fills this structure with the relevant details.
Return Value
- The function returns an integer value. A return value of
0typically indicates success, while a nonzero value indicates that an error occurred during the operation.
ll_aton_reloc_info C-struct
typedef struct _ll_aton_reloc_info
{
const char *c_name; /* c-name of the model */
uint32_t variant; /* 32-b word to handle the reloc rt version,
the used ARM Embedded compiler,
Cortex-Mx (CPUID) and if the FPU is requested */
uint32_t code_sz; /* size of the code (header + txt + rodata + data + got + rel sections) */
uint32_t params_sz; /* size (in bytes) of the weights */
uint32_t acts_sz; /* minimum requested RAM size (in bytes) for the activations buffer */
uint32_t ext_ram_sz; /* requested external ram size for the activations (and params) */
uint32_t rt_ram_xip; /* minimum requested RAM size to install it, XIP mode */
uint32_t rt_ram_copy; /* minimum requested RAM size to install it, COPY mode */
const char *rt_version_desc; /* rt description */
uint32_t rt_version; /* rt version */
uint32_t rt_version_extra; /* rt version extra */
} ll_aton_reloc_info;| member | description |
|---|---|
c_name |
indicates the name of the model. |
variant |
or-red 32-bit value indicating the
used Arm compiler, CPUID of the Cortex®-M,.. (see
ll_aton_reloc_network.h file) |
code_sz |
size in bytes of all code/data sections representing the model: header+txt+rodata+data+got+rel sections |
params_sz |
total size (in bytes) of the params/weights section |
acts_sz |
total size (in bytes) of the activations |
ext_ram_sz |
requested size (in bytes) of the external RAM memory |
rt_ram_xip |
requested size (in bytes) of read/write execution memory region (XIP mode) |
rt_ram_copy |
requested size (in bytes) of read/write execution memory region (COPY mode) |
rt_version_desc |
(debug info) string describing the used LL runtime version |
rt_version |
LL runtime version:
major << 24 | minor << 16 | sub << 8 |
rt_version |
(debug info) extra dev version value |
ll_aton_reloc_get_mem_pool_desc()
ll_aton_reloc_mem_pool_desc *ll_aton_reloc_get_mem_pool_desc(const uintptr_t file_ptr, int index);Description
The ll_aton_reloc_get_mem_pool_desc function allows
to obtain information about the part of the memory-pools which are
used for a given model. By providing a pointer to the model image
and an index, users can retrieve the necessary information through
the returned ll_aton_reloc_mem_pool_desc structure.
The ll_aton_reloc_get_mem_pool_desc function allows
users to obtain information about parts of the memory pools used for
a given model. By providing a pointer to the model image and an
index, users can retrieve the necessary information through the
returned ll_aton_reloc_mem_pool_desc
structure.
Parameters
file_ptr: Auintptr_tvalue representing the pointer to the image of the runtime loadable model. This parameter specifies the location of the model image from which the information will be retrieved.index: Index of the requested descriptor.
Return Value
- The function returns a reference of a
ll_aton_reloc_mem_pool_descobject. If the specified index is out of range, the function may return NULL.
Example
A typical snippet of code to display memory pool C-descriptors.
ll_aton_reloc_mem_pool_desc *mem_c_desc;
int index = 0;
while ((mem_c_desc = ll_aton_reloc_get_mem_pool_desc((uintptr_t)bin, index)))
{
printf(" %d: flags=%x foff=%d dst=%x s=%d\n", index, mem_c_desc->flags,
mem_c_desc->foff, mem_c_desc->dst, mem_c_desc->size);
index++;
}ll_aton_reloc_mem_pool_desc C-struct
typedef struct _ll_aton_reloc_mem_pool_desc
{
const char *name; /* name */
uint32_t flags; /* type definition: 32b:4x8b <type><data_type><reserved><id> */
uint32_t foff; /* offset in the binary file */
uint32_t dst; /* dst @ */
uint32_t size; /* real size */
} ll_aton_reloc_mem_pool_desc;AI_RELOC_MPOOL_GET_XXX(flags) macros (see
ll_aton_reloc_network.h file) can be used to know the
attributes of the memory pool.
ll_aton_reloc_get_input/output_buffers_info()
const LL_Buffer_InfoTypeDef *ll_aton_reloc_get_input_buffers_info(const NN_Instance_TypeDef *nn_instance,
int32_t num);
const LL_Buffer_InfoTypeDef *ll_aton_reloc_get_output_buffers_info(const NN_Instance_TypeDef *nn_instance,
int32_t num);Description
The ll_aton_reloc_get_input/output_buffers_info
function is used to obtain information about a specific input/output
buffer of a neural network instance. This can be useful for
understanding the structure and requirements of the input data for
the neural network. By providing the neural network instance and the
index of the desired input buffer, users can retrieve detailed
information about the buffer, such as its size, type, and memory
location.
Parameters
nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the input/output buffer information is to be retrieved.num: An integer specifying the index of the input/output buffer whose description is to be retrieved. The index is zero-based, meaning thatnum = 0refers to the first input buffer,num = 1refers to the second input buffer, and so on.
Return Value
- The function returns a pointer to a
LL_Buffer_InfoTypeDefstructure, which contains the description of the specified input/output buffer. If the specified buffer index is out of range, the function may return NULL.
ll_aton_reloc_set_input/output()
LL_ATON_User_IO_Result_t ll_aton_reloc_set_input(const NN_Instance_TypeDef *nn_instance, uint32_t num, void *buffer,
uint32_t size);
LL_ATON_User_IO_Result_t ll_aton_reloc_set_output(const NN_Instance_TypeDef *nn_instance, uint32_t num, void *buffer,
uint32_t size);Description
Both ll_aton_reloc_set_input and
ll_aton_reloc_set_output functions are used to
configure the address of the input and output buffers for a neural
network instance, respectively. By providing the neural network
instance, buffer index, buffer pointer, and buffer size, users can
set up the necessary memory regions for input and output data.
Warning
These functions should be only used when the deployed model is generated with the ‘–no-inputs-allocation’ or/and ‘–no-outputs-allocation’ respectively.
Parameters
nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the output buffer is to be set.num: An unsigned integer specifying the index of the output buffer to be set. The index is zero-based.buffer: A pointer to the buffer that will hold the output data. This parameter specifies the memory location where the output data is stored.size: An unsigned integer specifying the size of the input/output buffer in bytes.
Return Value
- The function returns a value of type
LL_ATON_User_IO_Result_t. This return value indicates the result of the operation, such as success or an error code.
ll_aton_reloc_get_input/output()
void *ll_aton_reloc_get_input(const NN_Instance_TypeDef *nn_instance, uint32_t num);
void *ll_aton_reloc_get_output(const NN_Instance_TypeDef *nn_instance, uint32_t num);Description
Both ll_aton_reloc_get_input and
ll_aton_reloc_get_output functions are used to retrieve
pointers to the input and output buffers for a neural network
instance, respectively. By providing the neural network instance and
buffer index, users can obtain direct access to the memory regions
used for input and output data.
Warning
These functions should be only used when the deployed model is generated with the ‘–no-inputs-allocation’ or/and ‘–no-outputs-allocation’ respectively.
Parameters
nn_instance: A pointer to the neural network instance (NN_Instance_TypeDef). This parameter specifies the neural network instance for which the input/output buffer pointer is to be retrieved.num: An unsigned integer specifying the index of the output buffer to be retrieved. The index is zero-based.
Return Value
- The function returns a pointer to the output buffer. If the
specified buffer index is out of range or an error occurs, the
function may return
NULL.
Use directly the Python scripts
Generating the relocatable binary model
The npu_driver.py is the entry point for executing
the steps required to generate the loadable model. The
'--input/-i' is used to specify the location of the
generated "network.c" and associated memory
initializers (*.raw files). The default value is
./st_ai_output. The '--output\-o' option
is used to specify the output folder (default:
./build).
[Details] npu_driver.py script
usage: npu_driver.py [-h] [--input STR] [--output STR] [--name STR] [--no-secure] [--no-dbg-info] [--ecblob-in-params]
[--split] [--llvm] [--st-clang] [--compatible-mode] [--custom [STR]] [--cross-compile STR]
[--gen-c-file] [--parse-only] [--no-clean] [--log [STR]] [--json [STR]] [--verbosity [{0,1,2}]]
[--debug] [--no-color]
NPU Utility - Relocatable model generator v1.4
optional arguments:
-h, --help show this help message and exit
--input STR, -i STR location of the generated c-files (or network.c file path)
--output STR, -o STR output directory
--name STR, -n STR basename of the generated c-files (default=<network-file-name>)
--no-secure generate binary model for non secure context
--no-dbg-info generate binary model without LL_ATON_EB_DBG_INFO
--ecblob-in-params place the EC blob in param section
--split generate a separate binary file for the params/weights
--llvm use LLVM compiler and libraries (default: GCC compiler is used)
--st-clang use ST CLANG compiler and libraries (default: GCC compiler is used)
--compatible-mode set the compible option (target dependent)
--custom [STR] config file for custom build (default: custom.json)
--cross-compile STR prefix of the ARM tool-chain (CROSS_COMPILE env variable can be used)
--gen-c-file generate c-file image (DEBUG PURPOSE)
--parse-only parsing only the generated c-files
--no-clean Don't clean the intermediate files
--log [STR] log file
--json [STR] Generate result file (json format)
--verbosity [{0,1,2}], -v [{0,1,2}]
set verbosity level
--debug Enable internal log (DEBUG PURPOSE)
--no-color Disable log color support“–name/-n” option
The --name/-n option allows to specify/overwrite the
expected c-name/file-name of the loadable runtime model. By default,
the name of the generated network files is used.
Default behavior.
$ python npu_driver.py -i <gen-dir>/network.c
...
Generating files...
creating "build\network_rel.bin" (size=..)$ python npu_driver.py -i <gen-dir>/my_model.c
...
Generating files...
creating "build\network_rel.bin" (size=..)$ python npu_driver.py -i <gen-dir>/network.c -n toto
...
Generating files...
creating "build\toto_rel.bin" (size=..)