ST Neural-ART - How to deploy/manage the NPU memory initializers
for STM32 target, based on ST Edge AI Core Technology 3.0.0
r1.0
Introduction
The article discusses two primary end-to-end deployment flows for NPU (Neural Processing Unit) memory initializers. These initializers are essential for setting the initial content of memory blocks (e.g., RAM, ROM, flash) that the NPU requires to run inference tasks. They contain the initial data that must be loaded into memory regions before the NPU can execute inference. These memory blocks may include:
- Weights and biases
- Configuration parameters
- Lookup tables
- Other static data needed by the model
The ST Edge AI Core and NPU Compiler are the main tools that generate and export these memory initializers. They offer flexibility in how memory contents are exported, allowing adaptation to different workflows or tooling environments. Two deployment flows controlled through memory-pool descriptor files are considered:
- Based on an external loader
- Based on the application build system
External loader scenario
In this scenario, the memory initializers are exported as separate binary blobs or files. An external loader (e.g., bootloader or firmware component) is used to load the NPU memory initializers. The application, including the model and the AI runtime stack, and the memory initializers are deployed separately.
Note
Be aware that this scenario don’t consider that the memory initializers are decoupled from the main application. The memory initializers are tightly coupled with the model layout embedded in the application. This means that the memory contents (weights, parameters, etc.) and the model structure must match exactly. Therefore, any mismatch between the model and the memory initializers would cause runtime errors or incorrect inference. For true decoupling, the Runtime loadable model feature must be considered.
Application build system scenario
Memory initializers are generated as C objects, which can be directly imported and compiled by the application build process. A simple image is created embedding the entire AI runtime stack and the generated model. The memory contents are embedded into the application binary or linked as part of the firmware. The application itself handles memory initialization during startup.
This approach is suitable when:
- You want a single monolithic firmware image.
- Simplified deployment without external dependencies.
Memory pool descriptor attributes
When generating memory initializers for an NPU system, the memory pool descriptor contains key attributes that guide how memory buffers are positioned, addressed, and output to files.
| Attribute | Description |
|---|---|
"mode" |
Specifies whether the buffer address is absolute or relative to a C symbol. |
"offset" |
Specifies the memory start address (base address). Its interpretation depends on the mode. |
"fformat" |
Specifies the file format used when emitting memory contents to a file. |
“mode”
This attribute indicates how the base address is interpreted.
'USEMODE_ABSOLUTE': The buffer uses an absolute (fixed/hardcoded) address.'USEMODE_RELATIVE': The buffer address is relative to a C symbol, resolved later at link time.'USEMODE_AUTO': Interpretation is dependent of theoffsetvalue, if'offset != 0', absolute mode is considered else relative mode is assumed.
“offset”
This atttribute specifies the memory start address. Depending on
the 'mode', this value may or may not be used by the
NPU compiler when emitting code.
“fformat”
This attributes indicates the requested file format for the memory initializer. Three categories are considered:
FORMAT_RAW: A raw binary file containing only data without any metadata.FORMAT_HEX,FORMAT_HEX16/32/64andFORMAT_IHEX: Hexadecimal text formats that include the target memory start address (*).FORMAT_C: Generates a C source file with a C array containing the data. Useful for embedding data directly in firmware source code.
(*) In this article, RAW and HEX/IHEX format are considered as equivalent.
Example use case
- Generate a memory initializer file for weights located at
absolute address 0x20010000 in raw binary format:
- mode =
USEMODE_ABSOLUTE - offset = 0x20010000
- fformat =
FORMAT_RAW
- mode =
- You want the weights to be linked relative to a C symbol (e.g.,
__weights_start), resolved at link time, and output as a C array,
you would set:
- mode =
USEMODE_RELATIVE - offset = 0 (or undefined)
- fformat =
FORMAT_C
- mode =
Management of activation buffers
The required memory regions to handle activations are considered scratch buffers. These memory regions do not need to be filled with zeros before inference. The application must ensure the mapping of these regions by the neural processing unit (NPU) and microcontroller unit (MCU) subsystems during inference. Regarding the weights, the memory pool must be accessible by the MCU to support the software fallback mechanism. In most cases, defining the base addresses as absolute (hardcoded) is recommended to facilitate deployment. To avoid unpredictable situations, memory regions reserved for activations should be excluded from the application memory layout, such as the scatter memory file.
In scenarios where a simple NPU subsystem, such as a unique static random-access memory (SRAM), is shared with the MCU subsystem, a simple memory pool can be defined with a relative address resolved at linking time. In this case, the underlying region is treated as a non-initialized section by the C linker. In C linker terminology, this corresponds to a section like .bss or a custom section marked as NOLOAD. This approach enhances flexibility, making the memory layout more portable and easier to adapt without modifying hardcoded addresses.
Placement of the memory initializers
Weights and/or activations
In most cases (the nominal or default case), a unique memory initializer is generated containing constant values representing the weights and/or parameters of the entire model. However, in some particular cases, for example, when only the internal NPU RAMs are defined, weights and activations can also be mixed in the same memory pool or multiple memory pools. The placement fully depends on the memory pools descriptor file provided by the user and the underlying heuristics of the NPU compiler.
Weights fetched from flash
The memory initializer is stored in non-volatile memory (or flash). The associated region is memory-mapped into the address space of the NPU and the MCU. During the inference, the weights are directly fetched from its storage region. Base address is absolute, fixed at generation time or linking time. No additional runtime or initialization steps are needed since the weights are directly accessible.
To achieve this, the following typical memory pool descriptor is
used. The "prop" key properties is used to indicate
that the constants (weights/parameters of the model) should
preferably be placed in this RO memory region (with
"rights": "ACC_READ" and
"constants_preferred" : True).
Warning
Be aware that if the size of the memory pool is not sufficient, the NPU compiler can use other memory pools to place the parameters without specific notification.
{ ..
"prop": { "rights": "ACC_READ", "constants_preferred": "true", ..},
"size": { "value": "64", "magnitude": "MBYTES" },
.. }The "mode" and 'offset" attributes
allow indicating how to manage the base address. It can be absolute
("mode": "USEMODE_ABSOLUTE") or relative
("mode": "USEMODE_RELATIVE").
{ ..
"prop": { "rights": "ACC_READ", "constants_preferred": "true", ..},
"offset": { "value": "<address_in_flash>", "magnitude": "BYTES" },
"size": { "value": "64", "magnitude": "MBYTES" },
"mode": "USEMODE_ABSOLUTE" // "USEMODE_RELATIVE"
.. }In the first case, the
"offset"key should be provided and its value is directly used to configure the processing units of the NPU (hardcoded value) as illustrated by the extracted part of code from the generated network.c file... .cache_allocate = 0, .addr_base = {(unsigned char *)(0x70000000UL) /* Equivalent hex address = <address_in_flash>UL */}, /* Conv2D_7_weights */ .offset_start = 1219712, ..If
"USEMODE_RELATIVE"is defined, a specific C-label is generated to reference the base address of the weights/params. The"offset"value is not considered.extern unsigned char _mem_pool_<pool_name>_Default[]; /* [1276737]; */ .. .cache_allocate = 0, .addr_base = {(unsigned char *)ATON_LIB_VIRTUAL_TO_PHYSICAL_ADDR((uintptr_t)_mem_pool_<pool_name>_Default)}, /* Conv2D_7_weights */ .offset_start = 1197920, ..
To complete the memory pool descriptor, the
"fformat" attribute indicates the format of the
generated memory initializer file.
{ ..
"fformat": "FORMAT_RAW", // "FORMAT_C"
"prop": { "rights": "ACC_READ", "constants_preferred": "true", ..},
"offset": { "value": "<address_in_flash>", "magnitude": "BYTES" },
"size": { "value": "64", "magnitude": "MBYTES" },
"mode": "USEMODE_ABSOLUTE" // "USEMODE_RELATIVE"
.. }Following table defines the way to deploy the memory initializer
according to the "mode” value:
| conf. | “fformat” | “mode” | deployment process |
|---|---|---|---|
| 1 | RAW | ABSOLUTE | The raw memory initializer file must be placed at a fixed absolute address. An external loader is required to load it at the expected address (note 1,3). |
| 2 | RAW | RELATIVE | The raw memory initializer file is placed at an address resolved at linking time. An external loader is still required to deploy it at the resolved address (note 2). |
| 3 | C | ABSOLUTE | The memory initializer is generated as a C source file containing an array. The C linker MUST place this array at the fixed absolute address using scatter file directives or linker scripts. A single monolithic firmware image is generated (note 3). |
| 4 | C | RELATIVE | The C file containing the initializer array is placed at an address resolved at linking time by the linker. A single monolithic firmware image is generated. |
Note 1: The configuration 1 is the
default setting used by the proposed STM32N6 evaluation
environment. The loading and initialization of the memory
initializers is fully managed by the "n6_loader" Python
script (including the loading of the weights in the internal SRAM).
Note 2: Deployment can be facilitate if the symbol
is placed at an ‘fixed’ address by a linker directive. Note
3: For absolute placement, no mechanism is in place to
verify that the chosen address does not overlap with other critical
application sections.
Typical generated c-file containing the C-array as const.
LL_ATON_ALIGNED(64) const uint64_t _mem_pool_<pool_name>_Default[159593] = {
0xFEF508EAFC98FD05ULL,0xF92742D0E51C39F5ULL,0x09220BE6F428F21CULL,0xD6DF0858DF1CE62BULL,
...
0x4A0E00FD20D7FA94ULL,0x35030B14B04748BEULL,0x0000000000000053ULL,
};Weights fetched from sram
For the small models or for performance reasons, if the SRAM region is large enough to hold the weights and the activations, it can be interesting to place the weights in a low-latency/high-throughput memory region. Lower power consumption can be also a criterion. To support it, before inference execution, the associated memory initializers must be copied from the storage location (e.g., flash) to the execting location.
The default C-startup code can be used to copy the weights from the flash to the SRAM if the weights are generated as a non-const C-array. To do this, the associated memory pool can be defined as follow:
{ ..
"fformat": "FORMAT_C",
"prop": { "rights": "ACC_WRITE", "constants_preferred": "true", ..},
"offset": { "value": "<address_in_ram>", "magnitude": "BYTES" },
"size": { "value": "1", "magnitude": "MBYTES" },
"mode": "USEMODE_RELATIVE"
.. }The "constants_preferred" attribute allows to create
a single preferable region to handle the weights. As the pool is
marked as RW ("ACC_WRITE" attribute), a non-const
C-array is created and it will be considered by the embedded target
C-toolchain as an initialized data section.
LL_ATON_ALIGNED(64) uint64_t _mem_pool_<pool_name>_Default[15593] = {
0xFEF508EAFC98FD05ULL,0xF92742D0E51C39F5ULL,0x09220BE6F428F21CULL,0xD6DF0858DF1CE62BULL,
...
0x4A0E00FD20D7FA94ULL,0x35030B14B04748BEULL,0x0000000000000053ULL,
};Be aware that to use the C-startup, before performing the copying operation (data section initialization), the destination memory should be previously enabled or clocked. If not, custom startup code is required to manage this sequence safely to ensure reliable system initialization and prevent memory access faults during startup or post-init sequence. Otherwise, specific application code should be designed to manage these copying operations before performing the inference.
Important
It is the application’s responsibility to implement the logic to copy weights to the execution memory region. Outside the Runtime loadable model support, no extra mechanism is generated to manage these initialization or copying operations.
Weights/activations mixed in sram
This case is similar to the previous case but no preferable
memory pool is defined to handle the weights
("constants_preferred" attribute is not used and all
memory pools have the "ACC_WRITE" attribute). It is the
application’s responsibility to implement the logic to copy the
weights according to the used "mode"/"fformat" before
to perform the initialization of the model and to execute the
inference.
Placement of the epoch controller blobs
The memory blobs that handle the different bitstreams are generated as standard constant C arrays. They are treated as standard read-only data (rodata) objects by the embedded target C toolchain. The application is responsible for ensuring that the associated memory regions are memory-mapped and accessible by the NPU subsystem. Note that if the application is loaded from flash into an executable memory region, and if the MCU cache is enabled, these associated memory regions must be invalidated and cleaned.
To place the blob memory sections in specific memory regions, the
ECBLOB_CONST_SECTION and/or
ECBLOB_RUNTIME_SECTION can be overwritten by the build
system.
// network_ecblobs.h file
...
#if !defined(ECBLOB_CONST_SECTION)
#define ECBLOB_CONST_SECTION /* Empty */
#endif
#if !defined(ECBLOB_RUNTIME_SECTION)
#define ECBLOB_RUNTIME_SECTION /* Empty */
#endifTips and attention points
NPU cache consideration
For an NPU subsystem from the STM32N6, the NPU
cache can be used to accelerate access to external memories. If
the associated memory pool descriptors set the
"cacheable" attribute to "CACHEABLE_ON",
the end-to-end deployment MUST ensure that the
memory initializers are effectively placed in the external memories.
Otherwise, placement in internal memory or flash without proper
mapping can cause unpredictable inference
execution.
RELATIVE mode and virtual memory pool support
When physically contiguous memory regions are mapped to different
memory pools, the --enable-virtual-mem-pools option is
recommended. This option places large buffers across these pools and
optimizes memory placement to improve parallel
access performance. Consequently, for these memory pools,
"USEMODE_RELATIVE" cannot be used because a single
relative base address is is invalid.
RELATIVE mode and Epoch controller
Bitstreams generated for the epoch controller are constant
offline-generated objects, they cannot contain indirect or relative
addresses directly. With "USEMODE_RELATIVE", the
addresses of the referenced objects should be resolved/fixed during
the linking phase. To manage these non-constant values, a specific
mechanism is in place to update or patch the bitstreams at runtime
with the fixed/resolved addresses. To do this, extra SRAM regions
(bss section) are created to hold the copied and patched bitstreams.
This operation is done (see generated network_ecblobs.h
file) during the initialization phase of the model. This operation
does not impact inference time, but additional SRAM
size must be considered.
Memory pool properties
Using relative mode is inherently more flexible;
it allows relaxing the dependency on the generation phase and the
integration into the final firmware. However, be aware that in
non-homogeneous memory subsystems like the STM32N6,
the defined properties (such as ""throughput",
""latency", ""byteWidth", etc.) in the
memory pool desciptors are the key factors used by the NPU compiler
to optimize the placement of the buffers to guarantee the
performance. Consequently, the application build system must ensure
that the placement of different memory initializers matches the
memory file descriptors used during compilation.