ST Neural-ART - How to deploy/manage the NPU memory initializers

for STM32 target, based on ST Edge AI Core Technology 4.0.0

r1.0

STM32 devices with Neural-ART accelerator™

Introduction

The article discusses two primary end-to-end deployment flows for NPU (Neural Processing Unit) memory initializers. These initializers are essential for setting the initial content of memory blocks (e.g., RAM, ROM, flash) that the NPU requires to run inference tasks. They contain the initial data that must be loaded into memory regions before the NPU can execute inference. These memory blocks may include:

Weights and biases
Configuration parameters
Lookup tables
Other static data needed by the model

The ST Edge AI Core and NPU Compiler are the main tools that generate and export these memory initializers. They offer flexibility in how memory contents are exported, allowing adaptation to different workflows or tooling environments. Two deployment flows controlled through memory-pool descriptor files are considered:

Based on an external loader
Based on the application build system

External loader scenario

In this scenario, the memory initializers are exported as separate binary blobs or files. An external loader (e.g., bootloader or firmware component) is used to load the NPU memory initializers. The application, including the model and the AI runtime stack, and the memory initializers are deployed separately.

NPU memory initializers with external Loader

Note

Be aware that this scenario don’t consider that the memory initializers are decoupled from the main application. The memory initializers are tightly coupled with the model layout embedded in the application. This means that the memory contents (weights, parameters, etc.) and the model structure must match exactly. Therefore, any mismatch between the model and the memory initializers would cause runtime errors or incorrect inference. For true decoupling, the Runtime loadable model feature must be considered.

Application build system scenario

Memory initializers are generated as C objects, which can be directly imported and compiled by the application build process. A simple image is created embedding the entire AI runtime stack and the generated model. The memory contents are embedded into the application binary or linked as part of the firmware. The application itself handles memory initialization during startup.

NPU memory initializers through C-linker

This approach is suitable when:

You want a single monolithic firmware image.
Simplified deployment without external dependencies.

Memory pool descriptor attributes

When generating memory initializers for an NPU system, the memory pool descriptor contains key attributes that guide how memory buffers are positioned, addressed, and output to files.

Attribute	Description
`"mode"`	Specifies whether the buffer address is absolute or relative to a C symbol.
`"offset"`	Specifies the memory start address (base address). Its interpretation depends on the mode.
`"fformat"`	Specifies the file format used when emitting memory contents to a file.

“mode”

This attribute indicates how the base address is interpreted.

'USEMODE_ABSOLUTE': The buffer uses an absolute (fixed/hardcoded) address.
'USEMODE_RELATIVE': The buffer address is relative to a C symbol, resolved later at link time.
'USEMODE_AUTO' : Interpretation is dependent of the offset value, if 'offset != 0', absolute mode is considered else relative mode is assumed.

“offset”

This atttribute specifies the memory start address. Depending on the 'mode', this value may or may not be used by the NPU compiler when emitting code.

“fformat”

This attributes indicates the requested file format for the memory initializer. Three categories are considered:

FORMAT_RAW: A raw binary file containing only data without any metadata.
FORMAT_HEX, FORMAT_HEX16/32/64 and FORMAT_IHEX: Hexadecimal text formats that include the target memory start address (*).
FORMAT_C: Generates a C source file with a C array containing the data. Useful for embedding data directly in firmware source code.

(*) In this article, RAW and HEX/IHEX format are considered as equivalent.

Example use case

Generate a memory initializer file for weights located at absolute address 0x20010000 in raw binary format:
- mode = USEMODE_ABSOLUTE
- offset = 0x20010000
- fformat = FORMAT_RAW
You want the weights to be linked relative to a C symbol (e.g., __weights_start), resolved at link time, and output as a C array, you would set:
- mode = USEMODE_RELATIVE
- offset = 0 (or undefined)
- fformat = FORMAT_C

Management of activation buffers

The required memory regions to handle activations are considered scratch buffers. These memory regions do not need to be filled with zeros before inference. The application must ensure the mapping of these regions by the neural processing unit (NPU) and microcontroller unit (MCU) subsystems during inference. Regarding the weights, the memory pool must be accessible by the MCU to support the software fallback mechanism. In most cases, defining the base addresses as absolute (hardcoded) is recommended to facilitate deployment. To avoid unpredictable situations, memory regions reserved for activations should be excluded from the application memory layout, such as the scatter memory file.

In scenarios where a simple NPU subsystem, such as a unique static random-access memory (SRAM), is shared with the MCU subsystem, a simple memory pool can be defined with a relative address resolved at linking time. In this case, the underlying region is treated as a non-initialized section by the C linker. In C linker terminology, this corresponds to a section like .bss or a custom section marked as NOLOAD. This approach enhances flexibility, making the memory layout more portable and easier to adapt without modifying hardcoded addresses.

Placement of the memory initializers

Weights and/or activations

In most cases (the nominal or default case), a unique memory initializer is generated containing constant values representing the weights and/or parameters of the entire model. However, in some particular cases, for example, when only the internal NPU RAMs are defined, weights and activations can also be mixed in the same memory pool or multiple memory pools. The placement fully depends on the memory pools descriptor file provided by the user and the underlying heuristics of the NPU compiler.

Weights fetched from flash

The memory initializer is stored in non-volatile memory (or flash). The associated region is memory-mapped into the address space of the NPU and the MCU. During the inference, the weights are directly fetched from its storage region. Base address is absolute, fixed at generation time or linking time. No additional runtime or initialization steps are needed since the weights are directly accessible.

To achieve this, the following typical memory pool descriptor is used. The "prop" key properties is used to indicate that the constants (weights/parameters of the model) should preferably be placed in this RO memory region (with "rights": "ACC_READ" and "constants_preferred" : True).

Warning

Be aware that if the size of the memory pool is not sufficient, the NPU compiler can use other memory pools to place the parameters without specific notification.

{ ..
    "prop": { "rights": "ACC_READ",  "constants_preferred": "true", ..},
    "size":   { "value": "64",         "magnitude": "MBYTES" },
.. }

The "mode" and 'offset" attributes allow indicating how to manage the base address. It can be absolute ("mode": "USEMODE_ABSOLUTE") or relative ("mode": "USEMODE_RELATIVE").

{ ..
    "prop": { "rights": "ACC_READ",  "constants_preferred": "true", ..},
    "offset": { "value": "<address_in_flash>", "magnitude":  "BYTES" },
    "size":   { "value": "64",         "magnitude": "MBYTES" },
    "mode": "USEMODE_ABSOLUTE"  // "USEMODE_RELATIVE"
.. }

In the first case, the "offset" key should be provided and its value is directly used to configure the processing units of the NPU (hardcoded value) as illustrated by the extracted part of code from the generated network.c file.
```
..
.cache_allocate = 0,
.addr_base = {(unsigned char *)(0x70000000UL) /* Equivalent hex address = <address_in_flash>UL */}, /* Conv2D_7_weights */
.offset_start = 1219712,
..
```

If "USEMODE_RELATIVE" is defined, a specific C-label is generated to reference the base address of the weights/params. The "offset" value is not considered.

extern unsigned char _mem_pool_<pool_name>_Default[]; /* [1276737]; */
..
.cache_allocate = 0,
.addr_base = {(unsigned char *)ATON_LIB_VIRTUAL_TO_PHYSICAL_ADDR((uintptr_t)_mem_pool_<pool_name>_Default)}, /* Conv2D_7_weights */
.offset_start = 1197920,
..

To complete the memory pool descriptor, the "fformat" attribute indicates the format of the generated memory initializer file.

{ ..
    "fformat": "FORMAT_RAW", // "FORMAT_C"
    "prop": { "rights": "ACC_READ",  "constants_preferred": "true", ..},
    "offset": { "value": "<address_in_flash>", "magnitude":  "BYTES" },
    "size":   { "value": "64",         "magnitude": "MBYTES" },
    "mode": "USEMODE_ABSOLUTE"  // "USEMODE_RELATIVE"
.. }

Following table defines the way to deploy the memory initializer according to the "mode” value:

conf.	“fformat”	“mode”	deployment process
1	RAW	ABSOLUTE	The raw memory initializer file must be placed at a fixed absolute address. An external loader is required to load it at the expected address (note 1,3).
2	RAW	RELATIVE	The raw memory initializer file is placed at an address resolved at linking time. An external loader is still required to deploy it at the resolved address (note 2).
3	C	ABSOLUTE	The memory initializer is generated as a C source file containing an array. The C linker MUST place this array at the fixed absolute address using scatter file directives or linker scripts. A single monolithic firmware image is generated (note 3).
4	C	RELATIVE	The C file containing the initializer array is placed at an address resolved at linking time by the linker. A single monolithic firmware image is generated.

Note 1: The configuration 1 is the default setting used by the proposed STM32N6 evaluation environment. The loading and initialization of the memory initializers is fully managed by the "n6_loader" Python script (including the loading of the weights in the internal SRAM). Note 2: Deployment can be facilitate if the symbol is placed at an ‘fixed’ address by a linker directive. Note 3: For absolute placement, no mechanism is in place to verify that the chosen address does not overlap with other critical application sections.

Typical generated c-file containing the C-array as const.

LL_ATON_ALIGNED(64) const uint64_t _mem_pool_<pool_name>_Default[159593] = {
  0xFEF508EAFC98FD05ULL,0xF92742D0E51C39F5ULL,0x09220BE6F428F21CULL,0xD6DF0858DF1CE62BULL,
  ...
  0x4A0E00FD20D7FA94ULL,0x35030B14B04748BEULL,0x0000000000000053ULL,
};

Weights fetched from sram

For the small models or for performance reasons, if the SRAM region is large enough to hold the weights and the activations, it can be interesting to place the weights in a low-latency/high-throughput memory region. Lower power consumption can be also a criterion. To support it, before inference execution, the associated memory initializers must be copied from the storage location (e.g., flash) to the execting location.

The default C-startup code can be used to copy the weights from the flash to the SRAM if the weights are generated as a non-const C-array. To do this, the associated memory pool can be defined as follow:

{ ..
    "fformat": "FORMAT_C",
    "prop": { "rights": "ACC_WRITE",  "constants_preferred": "true", ..},
    "offset": { "value": "<address_in_ram>", "magnitude":  "BYTES" },
    "size":   { "value": "1",         "magnitude": "MBYTES" },
    "mode": "USEMODE_RELATIVE"
.. }

The "constants_preferred" attribute allows to create a single preferable region to handle the weights. As the pool is marked as RW ("ACC_WRITE" attribute), a non-const C-array is created and it will be considered by the embedded target C-toolchain as an initialized data section.

LL_ATON_ALIGNED(64) uint64_t _mem_pool_<pool_name>_Default[15593] = {
  0xFEF508EAFC98FD05ULL,0xF92742D0E51C39F5ULL,0x09220BE6F428F21CULL,0xD6DF0858DF1CE62BULL,
  ...
  0x4A0E00FD20D7FA94ULL,0x35030B14B04748BEULL,0x0000000000000053ULL,
};

Be aware that to use the C-startup, before performing the copying operation (data section initialization), the destination memory should be previously enabled or clocked. If not, custom startup code is required to manage this sequence safely to ensure reliable system initialization and prevent memory access faults during startup or post-init sequence. Otherwise, specific application code should be designed to manage these copying operations before performing the inference.

Important

It is the application’s responsibility to implement the logic to copy weights to the execution memory region. Outside the Runtime loadable model support, no extra mechanism is generated to manage these initialization or copying operations.

Weights/activations mixed in sram

This case is similar to the previous case but no preferable memory pool is defined to handle the weights ("constants_preferred" attribute is not used and all memory pools have the "ACC_WRITE" attribute). It is the application’s responsibility to implement the logic to copy the weights according to the used "mode"/"fformat" before to perform the initialization of the model and to execute the inference.

Placement of the epoch controller blobs

The memory blobs that handle the different bitstreams are generated as standard constant C arrays. They are treated as standard read-only data (rodata) objects by the embedded target C toolchain. The application is responsible for ensuring that the associated memory regions are memory-mapped and accessible by the NPU subsystem. Note that if the application is loaded from flash into an executable memory region, and if the MCU cache is enabled, these associated memory regions must be invalidated and cleaned.

To place the blob memory sections in specific memory regions, the ECBLOB_CONST_SECTION and/or ECBLOB_RUNTIME_SECTION can be overwritten by the build system.

// network_ecblobs.h file
...
#if !defined(ECBLOB_CONST_SECTION)
#define ECBLOB_CONST_SECTION /* Empty */
#endif
#if !defined(ECBLOB_RUNTIME_SECTION)
#define ECBLOB_RUNTIME_SECTION /* Empty */
#endif

Tips and attention points

NPU cache consideration

For an NPU subsystem from the STM32N6, the NPU cache can be used to accelerate access to external memories. If the associated memory pool descriptors set the "cacheable" attribute to "CACHEABLE_ON", the end-to-end deployment MUST ensure that the memory initializers are effectively placed in the external memories. Otherwise, placement in internal memory or flash without proper mapping can cause unpredictable inference execution.

RELATIVE mode and virtual memory pool support

When physically contiguous memory regions are mapped to different memory pools, the --enable-virtual-mem-pools option is recommended. This option places large buffers across these pools and optimizes memory placement to improve parallel access performance. Consequently, for these memory pools, "USEMODE_RELATIVE" cannot be used because a single relative base address is is invalid.

RELATIVE mode and Epoch controller

Bitstreams generated for the epoch controller are constant offline-generated objects, they cannot contain indirect or relative addresses directly. With "USEMODE_RELATIVE", the addresses of the referenced objects should be resolved/fixed during the linking phase. To manage these non-constant values, a specific mechanism is in place to update or patch the bitstreams at runtime with the fixed/resolved addresses. To do this, extra SRAM regions (bss section) are created to hold the copied and patched bitstreams. This operation is done (see generated network_ecblobs.h file) during the initialization phase of the model. This operation does not impact inference time, but additional SRAM size must be considered.

Memory pool properties

Using relative mode is inherently more flexible; it allows relaxing the dependency on the generation phase and the integration into the final firmware. However, be aware that in non-homogeneous memory subsystems like the STM32N6, the defined properties (such as ""throughput", ""latency", ""byteWidth", etc.) in the memory pool desciptors are the key factors used by the NPU compiler to optimize the placement of the buffers to guarantee the performance. Consequently, the application build system must ensure that the placement of different memory initializers matches the memory file descriptors used during compilation.

ST Neural-ART - How to deploy/manage the NPU memory initializers - r1.0
ST Edge AI Core Technology 4.0.0

ST logo Information in this document is provided solely in connection with ST products. The contents of this document are subject to change without prior notice. © Copyright STMicroelectronics 2025. All rights reserved. www.st.com