ST Neural-ART NPU concepts
for STM32 target, based on ST Edge AI Core Technology 2.2.0
r1.1
Introduction to Neural-ART accelerator™
Overview
Neural-Art Accelerator™ is a branded family of design-time parametric and runtime reconfigurable neural processing unit (NPU) cores. It is specifically designed to accelerate the inference execution of a wide range of quantized convolutional neural network (CNN) models in area-constrained embedded and IoT devices. The STM32N6xx device embeds the first generation of NPU: Neural-ART 14, which includes four CAs (see the “Specialized hardware processing unit” section).
The NPU subsystem is designed to optimize the performance of AI/ML models. Intensive computations are offloaded from the main CPU for efficient execution of neural network inferences, reducing latency and power consumption. It comprises the ST Neural-ART NPU and its associated memory subsystem, which stores working buffers, parameters/weights, and the “command stream,” whether privileged or not. Additionally, the system includes a HOST subsystem (based on an Arm Cortex®-M), which orchestrates the platform. The HOST subsystem manages the main application, middleware, and runtime/driver components to handle the NPU IP. It also serves as an additional processing resource, enabling the deployment of non-hardware-assisted operators, such as floating-point-based operators.
Moreover, the flexibility of the HOST subsystem to handle non-hardware assisted operators means it can manage tasks requiring floating-point precision or other specialized computing tasks not suited for the NPU. This dual capability enhances the versatility and applicability of the ST Neural-ART NPU subsystem in various AI/ML applications.
Note
For STM32N6 target, the HOST subsystem is based on the AI-capable Arm Cortex®-M55 core featuring the M-Profile vector extension MVE (also referred as Arm Helium™ technology). Combined with the NPU, it allows users to deploy efficiently the advanced AI/ML applications.
Unique power domain
A single power domain (VDD core) is associated with all internal blocks used by the NPU subsystem. Multiple frequency domains enable fine-grained control of the clocks, allowing for various UC-oriented implementations to address low-power scenarios. Each block, including embedded memories, NPU IP, and specialized hardware processing units, can be individually enabled and clocked.
Note that the NPU runtime software stack controls only the clocks for the specialized hardware processing units. The application must handle the setting of different clocks and frequency values.
Streaming-based architecture
The ST Neural-ART NPU is a reconfigurable and scalable inference engine. It implements a flexible data-flow streaming processing engine in hardware, with specialized hardware accelerators (also called processing units) that can be dynamically connected to each other at runtime.
The stream engine units function as intelligent half-DMAs, capable of reading and writing data to and from external memory. Once the NPU is configured and initiated, it autonomously handles memory-to-memory transfers. It fetches data from external memory to supply the connected processing units, and after computation, the processed data is pushed back to the external memory.
The inputs/sources can include activations/features and parameters (constant data in memory-mapped nonvolatile memory) representing weights. Once the data-stream processing pipes are configured and started, the settings remain immutable until end-of-transfer event notifications are received from all DMA outputs/sinks. This immutability ensures consistent and reliable data processing, preventing midtransfer changes that could lead to errors or data corruption.
The NPU’s autonomous handling of memory-to-memory transfers allows for efficient data management, freeing up the main processor for other tasks. This streaming-based design enhances overall system performance and reliability. It is ideal for applications requiring high-speed data processing and real-time inference.
Specialized hardware processing units
For the STM32N6xx device, the Neural-ART 14 configuration includes the following processing units to address a wide range of quantized convolutional neural network (CNN) models:
Processing unit | description |
---|---|
CONV_ACC (x4) | The main processing unit performing the convolution operations with up to 72 8x8 multiply accumulate operations per cycle (or 18 16x16 macc/cycle). This allows a theoretical peak processing of 72 x 4 x 2 (addition and multiplication) = 578+ GOPs @1GHz or 600+ GOPs including the operations from the ACTIV/ARITH/POOL units. |
POOL_ACC (x2) | Perform the pooling operations like local 2D windowed (NxN), min, max, average pooling as well as global max, min, or average pooling. |
ACTIV_ACC (x2) | Perform the activation functions associated with the convolutional neural networks: Logistic, TanH, ReLU, PReLU, etc. |
ARITH_ACC (x4) | Perform the arithmetic operations: element-wise addition/subtraction/multiplication or any other affine operation. |
STREAM_ENG (2x5) | 5 per port, Key unit (smart half-DMA engine) to fetch/push data to/from memory subsystem. |
- All processing units are based on integer arithmetic with 8-bit, 16-bit, and 24-bit data path widths. They support both signed and unsigned formats.
- They are designed to support a fixed-point format, enhanced by shifters with rounding and saturation capabilities to adjust values. There is no hardware support for floating-point arithmetic (no floating-point unit).
- The absence of floating-point support means that the NPU relies solely on fixed-point arithmetic. This is generally more efficient in terms of power consumption and computational speed.
The following figure illustrates various typical data processing paths that can be implemented:
- The number of independent chains is limited by the hardware resources available. It cannot exceed the half of the number of available stream engines. On Neural-ART 14, maximum five chains are possible.
Additional specific units are also integrated allowing efficient integration inside the STM32N6 memory subsystem to improve the memory bandwidth and latency.
Specific unit | description |
---|---|
DEC_UNIT (x2) | Perform the decompression on-the-fly of the compressed kernel weights (Data are compressed offline). |
RECBUF_UNIT (x1) | Internal unit to insert a partial buffer between two processing units avoiding potential deadlocks, starvation. |
EPOCH_CTRL (x1) | A finite state machine which is able to decode simple microinstructions from a “command stream” (blob object) to execute a set of epochs (see “Epoch definition”). |
DEBUG & TRACE UNIT (x1) | The debug unit allowing to collect various internal signals monitoring various internal signals (a set of specific 32b event counters). |
NPU memory subsystem
The NPU memory subsystem plays a crucial role in this architecture by ensuring that data is readily available for processing, thus minimizing bottlenecks. The amount of memory required by the NPU depends on the complexity—number of layers, size of the input data and the target performance of the selected neural network model. To achieve peak performance, the system memory is organized with enough independent memory banks. These banks have separate AXI slave ports on the system interconnect to ensure maximum internal memory bandwidth and fully leverage AXI bus parallelism.
In addition to the on-chip memory, dedicated external memory interfaces are needed to provide access to nonvolatile memory. This nonvolatile memory hosts the model parameters (weights and biases) and additional RAM to accommodate larger neural networks that cannot fit entirely in internal (on-chip) memory.
The following figure shows the memory instance (yellow boxes) accessible and usable by the NPU IP. The HOST subsystem and NPU subsystem share the same physical 4 GB memory address space. No virtual address space or specific remapping/aliasing mechanism is defined.
- If the NPU cache is not enabled,
the associated memory (256 KB) can be used as a normal memory.
- The D-TCM can also be used to share the data without potential cache maintenance operations.
Warning
No hardware mechanism is in place to ensure coherence between the NPU domain and HOST domain when a memory region is marked as cached and shared. During write operations data is written into the data caches and is not always automatically transferred to the destination memory. The lack of a hardware coherence mechanism means that software must ensure data consistency between the NPU and HOST domains. After write operations, it is necessary to ensure that all data is effectively available also in the physical memory. During inference, the NPU runtime and NPU compiler take precautions to exclude such inconsistencies by calling, when resquested, the NPU/MCU cache maintenance operations.
Processing-Bound vs Memory-Bound Operation
The response time (latency) and available bandwidth of the NPU master interfaces are limiting factors and become critical for large neural networks. They are the deciding factors in the achievable frame rate, efficiency, and utilization of NPU computing units. The two master ports of the NPU connect to a high-speed local interconnect (NIC) that provides four 448 KB banks of fast memory (NPU RAMs) with privileged, high-throughput, low-latency, asynchronous access. The NIC also connects to the main network on Chip (NOC) through an asynchronous bridge, which provides access to medium throughput, medium latency on-chip memories: 400 KB of FlexMem, 624 KB, and 1 MB of system SRAM. Additional low-throughput, high-latency external memories are accessed through the FMC and XSPI controllers.
According to the different frequency domains, the placement of the NPU buffers in the different memories is critical. The local memories closed to the NPU IP (NPU RAMs and NPU CACHE ram) are the privileged memories. They are used in priority to place the critical NPU buffers (activations and/or parameters) and after the other memories are used according the needs. For the large model, we can consider that the NPU subsystem is memory-bound which means limited by the memory accesses. The latency and throughput attributes are the main parameters used by the ST Neural-ART compiler to optimize the placement of the buffers.
NPU Cache
The NPU cache (also named AXICACHE IP) can only be used to support the NPU’s cacheable accesses to and from the external memories. Note that the HOST subsystem cannot use the NPU cache and consequently the host cannot access the NPU cacheable memory regions without NPU cache maintenance operations.
The entire buffer managed by a given stream engine unit is marked
as cacheable or noncacheable. The ST Neural-ART compiler is
responsible for setting the final property. However, to enable this
feature, the associated memory pool describing the external memory
must have the CACHEABLE_ON
property (refer to the “Neural-ART compiler
primer” article). If this property is not set, the
noncacheable path is considered
When an NPU buffer has the attribute cacheable
(.cacheable = 1
C-field), only two policies are
considered: allocate
or no-allocate
(.cache_allocate = [1, 0]
C-field).
Warning
Note that when a model is generated with an NPU cache support, the application is in charge and must ensure that the NPU cache IP is correctly configured.
Virtual memory pool
The buffers manipulated by the NPU should be physically contiguous and memory-mapped. For large neural networks, the allocation or mapping of large buffers can be a constraint, not allowing optimal performance. To mitigate this constraint, the ST Neuarl-ART compiler implements multiple heuristics. The system architecture is also designed to provide up to 4 MB of contiguous system memory (2x1 MB + 4x448 KB + 256 KB). They can be seamlessly shared between the host subsystem and the NPU subsystem. At the same time, it also offers privileged access to a meaningful high-speed portion of these 4 MB to the NPU.
The compiler can create one or more virtual memory pools, which can group two or more of the contiguous memories accessible through the bus. This decision is frequency and protocol agnostic and only cares about address ranges. A virtual memory pool can group many memory banks as needed and can extend from system AXRAM1 to NPU RAM6.
- External memories accessible are not contiguous with any other on-chip memory. They cannot be part of a virtual memory pool.
Security considerations
The NPU is not TrustZone® aware. Therefore, multicontext and multitenancy are supported through the implementation of isolation compartments that are separated by CIDs. The AHB control interface is protected by a RISUP placed upstream. A RIMU is placed in the downstream of the AXI master interfaces, to assign CID values, so that NPU could only access the sole memories protected with that CID. Refer to the security chapter of the “RM0408 Reference Manual - STM32N647/657xx Arm®-based 32-bit MCUs”) for further information.
Regarding encryption, for latency reasons and block granularity issues, the MCE engine is not considered suitable. A low overhead/latency encryption/decryption unit based on the Keccak-p[200] SHA-3 algorithm cipher, with a programmable number of rounds, is integrated into the NPU bus interfaces. It can be shared between different stream engines and supports both weights and activations decryption and encryption. All data to and from all accessible memories can be encrypted, particularly the external memories, which are more vulnerable to side-channel attacks than internal memories (traffic on the interface is accessible).
Programming model
Epoch definition
Generally, an entire model cannot fit on the available NPU hardware resources. It must be split into elementary subsets, named NPU epochs (or ‘epoch’), which fit the NPU’s available resources. The model is compiled offline, producing the settings for the different epochs needed to execute the whole model.
Four kinds of epochs are defined:
Name | description |
---|---|
HW epoch | designates a case where the operations related to a part of the model are fully mapped on the NPU HW resources. |
SW epoch | designates a case where the operation is delegated on the HOST. It is no-hardware accelerated. |
Hybrid epoch | designates a specific case where a part of the operation is executed in software with a support of the predefined HW epochs. |
Meta epoch | designates a set of HW epochs controlled by a command stream thanks to the epoch controller unit. |
Scheduling
As part of the NPU runtime software stack, a lightweight scheduler engine is responsible for executing the list of different epochs. Each epoch is considered an atomic operation, and its execution is ordered and fixed to ensure data dependency across the entire computational graph. For each type of epoch, three phases are defined. After an initialization phase (called ‘pre-op’), the data-stream processing pipes are enabled and started (‘hw-op’ phase). The scheduler engine then waits for end-of-transfer event notifications from all DMA outputs before performing the deinitialization phase (called ‘post-op’) of the used resources. Note that no internal NPU hardware state is preserved between epochs; only the external memory subsystem is used to store intermediate results. Consequently:
- If the ‘hw-op’ is suspended, the epoch can be not
restarted (fetched memory can be overwritten with a partial
result)
- stop/resume mode is possible but not context is saved.
The NPU runtime software stack schedules epochs as illustrated in the following figures. You can manage end-of-epoch completion either by polling or using interrupt mode. The offline ST Neural-ART tools generate one or more specialized files to initialize the NPU runtime software stack. These specialized files define the hardware configuration for each epoch and the execution order
The MCU workload and the number of epochs depend on the deployed model. In a scenario where all operators are mapped to the NPU and scheduled in software, the MCU workload is approximately 10-15% of the inference time. As illustrated in the following figure, this ratio depends on the MCU frequency relative to the NPU frequency. To limit this overhead and offload the MCU, a hardware-assisted mode (based on the Epoch controller unit) can be used.
- For a SW epoch, the ‘pre-op’ and ‘hw-op’
phases are NULL, only the ‘post-op’ is executed to
perform the software operation.
- Note that the caller executes all ‘pre-op’ and ‘post-op’ phases. This means that a higher-priority task can preempt the execution.
Epoch controller mode
The epoch controller unit is a finite state machine (FSM) capable of decoding simple microinstructions from a binary blob, known as the “command stream”. This command stream includes basic flow control instructions such as read, store, poll, step by step, and stop. Using these instructions, the hardware epoch controller can configure all processing units involved in model execution directly, without MCU support. This allows the MCU to be freed up for other tasks.
When enabled, the hardware epoch controller does not create new HW epochs. Instead, the ST Neural-ART NPU compiler merges or concatenates a set of HW epochs into a simple command stream (or blob), called a ‘metaepoch’. The NPU runtime software stack is always required to manage the epochs, particularly if a software operation is delegated to the HOST system. As illustrated in the following figure, a given HW epoch is configured (in the ‘pre-op’ phase) by the NPU itself (FSM) rather than by the MCU, improving inference time. However, the number of cycles required to execute a HW epoch (‘hw op’ phase) remains unchanged. As the binary blob is fetched through the main interface, which provides a larger bandwidth than the control interface, the overall inference time is reduced. The interactions between NPU and MCU are based on the interrupts and for a given model, multiple binary blobs can be used.
Multiple models support
Multiple models can be supported within the same firmware, but only two modes are considered. The NPU runtime software stack does not use a dedicated thread to implement a server design for supporting the execution of multiple instances. Instead, it employs a lock mechanism (RTOS port) to ensure a thread-safe environment, ensuring secure access to the NPU hardware resources.
Serial mode
Inferences of each model are serialized. Before executing the next model, the execution of the first model must be completely finished. In this case, the memory regions requested to store the activation buffers can be shared or overlapped, and the parameters are stored at different addresses. No specific service is provided through the NPU runtime software stack. The application (whether bare metal or RTOS-based) should control and implement this behavior. Note that for input buffers, the -no-inputs-allocation option allows filling the input buffer of one model during the execution of another.
Epoch level mode
In an application with an RTOS running, there could be multiple application threads that contain different ML workloads. Based on the semaphore/mutex mechanism, the NPU runtime software stack ensures that only one application thread can access the ST Neural-ART NPU at a time. The caller thread’s priority implicitly determines the priority. Since the system cannot preempt or suspend an epoch, it defines the synchronization point to access the NPU hardware resources at the end of the currently executed epoch.. At the end of each epoch, the NPU resource can be used for another model according to the priority of the caller. In this case, the memory regions required to store the activation buffers cannot be shared.
This mode helps limit latency in a multimodel environment where the execution of a specific model is critical. However, as shown in the figure above, this latency is inherently dependent on the other models. To control this latency, model 0 can be generated without the epoch hardware controller support if requested. This approach limits the time during which the NPU is used without notifying the host.
Zero setup time and latency considerations
The code allowing the execution of the deployed model is mainly generated offline (specialized C-files). It includes the setting of the epochs (in C-array form) and the placement of the associated buffers (fixed addresses). By default, at compile time, all references are resolved. Addresses of the buffers/params point directly to the memory-mapped memory regions, avoiding the need to copy that data. During the initialization phase (creation of the instance), there is no dynamic allocation or specific complex code required to set up a deployed model before performing the inference. Only a couple of pointers associated with a given instance are reset. If the parameter/weight buffers are placed in internal SRAM, additional setup time may be needed to copy the associated buffers before execution.
Software support
Ahead-Of-Time (AOT) flow
The proposed end-to-end flow to develop an AI/DL application using the NPU subsystem is based on the Ahead-Of-Time (AOT) flow. It can be divided into three parts:
The first part involves a collection of offline popular/classical tools. They create, prepare, and optimize a DL model, including a quantization process to address resource-constraints targets.
The ST Neural-ART compiler, part of the ST Edge AI Core CLI tools, generates configuration files for execution on an NPU-based subsystem. It identifies operators in the model that the ST Neural-ART NPU can handle. If an operator is not supported, the delegate/fall-back mechanism calls an optimized function running on the host..
The third part is an NPU runtime software stack consisting of a range of software components. They run on the hardware target using the generated specialized C-files with minimal overhead.
The generated specialized files (epoch configurations, network weights, memory initializers) and a generic NPU runtime software stack are integrated into a classical embedded C project to generate the firmware image. No interpreter-based engine is embedded in the target to minimize and optimize the usage of hardware resources. All optimizations are done offline. This process allows the model to be efficiently mapped onto the NPU subsystem, including support for software and hybrid epochs.
Intermediate files
The ST Neural-ART compiler is integrated as a specific back-end in the ST Edge Core CLI. The import and export ONNX passes act as converters, transforming the original model (TFLite or ONNX QDQ format) into an internal representation. This internal representation includes a formal ONNX file and a JSON description file, providing metadata and quantization parameters for the tensors.. These passes can perform necessary quantization scheme conversions and apply specific graph optimizations and transformations. For example, they can manage I/O data type modifications (refer to the article “How to change the I/O data type or layout (NHWC vs NCHW)”). It is important to note that the intermediate ONNX file is not a self-content ONNX file. It requires the accompanying JSON file to be used.
Warning
To call the ST Neural-ART compiler through the ST Edge Core CLI,
the ‘--st-neural-art
[profile@conf.json]’ option is required. Otherwise, the imported
model is optimized for the HOST (Arm Cortex® M55 core)
subsystem.
Two steps usage
The ST Edge AI Core CLI provides a ‘export-onnx’ command allowing to generate the specialized C-files in two steps.
Generate the intermediate files
$ stedgeai export-onnx --model <my_model>.[onnx|tflite] --target stm32n6 -o <output_dir> [--no-outputs/inputs-allocation ...]
Call directly the NPU compiler
$ atonn --load-mpool my_mpool.mpool --onnx-input my_model_OE_2_3_2.onnx --json-quant-file my_model_OE_2_3_2_Q.json --cache-maintenance --Ocache-opt --mvei --native-float --enable-virtual-mem-pools --load-mdesc stm32n6
NPU compiler
The NPU compiler, or ST Neural-ART compiler, is a platform-agnostic compiler for the Neural-Art Accelerator™. It is used to compile, transform, and optimize a higher-level computational graph into an optimized lower-level “language” (C-code or stream command). Based on the ONNX libraries, the NPU compiler takes an ONNX model, applies various optimizations (generic and hardware-specific), and transforms it into a proprietary internal representation. It then performs low-level phases (graph scheduling, buffer allocation, etc.) to emit the C-code and memory initializers.
In addition to the model itself (ONNX and JSON files), the main entry points that drive the heuristics of the NPU compiler for performing different passes are:
- A machine description file (JSON format) defines the available
processing units and ST Neural-ART NPU configuration (Neural-ART
14). The user should not modify this fixed file.
- The memory-pool descriptors file (defined by the user) allows to indicate the memory resources and their properties (clock ratio, byte width, latency, and throughput,..) and which are can be used to place the requested activations/params buffers.
- User options
Fall back and delegate mechanism
For the hybrid models (that is, models with not fully quantized), the execution of the operator is directly delegated to the host and the optimized AI runtime library is called. The calls of the MCU/NPU cache maintenance operations will also be generated before and after the operation if necessary.
When an operator cannot be mapped to the NPU (due to an unsupported operator or configuration, see the “ST Neural-ART NPU - Supported operators and limitations” article), the following fallback mechanism is applied:
- DMA-based operation can be not fully supported/optimized in
hardware. A hybrid implementation is called (LL_ATON_LIB_XX functions).
- Use the optimized integer software implementation from the AI runtime library, if available.
- Use the optimized floating-point software implementation from the AI runtime library, if available. Dequantize and quantize operations are automatically inserted.
- Report an error if neither implementation is available.
Optimization objectives
The primary objective of the NPU compiler is to minimize inference time while balancing power consumption and memory peak usage. To achieve this, multiple heuristics are implemented to maximize the usage of NPU RAMs and leverage parallel accesses.
Tip
The use of the --enable-virtual-mem-pools
option is
important to indicate to the compiler that a large buffer can be
placed across different contiguous memory pools. However, in some
corner cases, the compiler may prioritize parallel accesses and
place the large buffer in a less efficient memory pool. To override
this behavior, it can be beneficial to create a simple memory pool
that encompasses the four NPU RAMs, relaxing the constraint and
prioritizing the placement of the buffer.
NPU runtime software stack
The NPU runtime software stack is a simple, configurable, and portable C-level codebase that supports various platforms through the OSAL/port layer. It is enough to run directly in bare-metal environments with no operating systems, dynamic memory, or threads. A lightweight and efficient engine schedules the generated list of epoch descriptors, which are exported as specialized C-files. The stack provides a set of low-level functions for initializing the system and performing inference.
It is important to note that the NPU runtime software stack is not strictly a driver. The application is responsible for managing certain system resources, such as the clock, power domain, and memory, required by the NPU subsystem.
Minimal stack configuration for STM32N6xx device
-DLL_ATON_PLATFORM=LL_ATON_PLAT_STM32N6 -DLL_ATON_OSAL=LL_ATON_OSAL_BARE_METAL
LL_ATON_RT_MODE=LL_ATON_RT_ASYNC
is defined by default elseLL_ATON_RT_POLLING
can be used to but is not recommended.
OSAL/port layer
Provide the minimal runtime functions (IRQ registration, MCU
cache maintenance operations..) to support a bare-metal environment.
It provides also the adaptation code from ThreadX
(LL_ATON_OSAL == LL_ATON_OSAL_THREADX
) or FreeRTOS
(LL_ATON_OSAL == LL_ATON_OSAL_FREERTOS
)
environment.
LL driver
The LL (Low-Level) driver layer provides functions and C-types to configure and enable the NPU processing units. It also includes the entry point for the ‘hybrid’ epoch.
Hybrid or DMA-based operations:
()
LL_ATON_LIB_Split()
LL_ATON_LIB_Slice()
LL_ATON_LIB_SW_SpaceToDepth()
LL_ATON_LIB_SW_DepthToSpace()
LL_ATON_LIB_Transpose()
LL_ATON_LIB_Pad()
LL_ATON_LIB_Cast...
Software library mapper
Optimized implementations of the C-kernels running on the host system are supported through a simple software library wrapper. The optimized AI network runtime library is the same used to deploy a model on the host.
Profiling/accuracy considerations
Why bit exactness is practically impossible
Variations in hardware, such as accumulator sizes and rounding schemes, as well as differences in software runtimes, can lead to significant discrepancies. These discrepancies are noticeable between different versions of TensorFlow/ONNX runtime and across operating systems (Linux/Windows). Additionally, in a quantized model and its optimized inference engine, computed confidence levels exhibit discrete steps rather than continuous values due to the quantization process. Internally, creating intermediate ONNX/JSON files can introduce approximations during the mapping of TFLite operators to ONNX operators. Furthermore, there are approximations in mapping operations to specialized and optimized hardware processing units.
However, similar performance are expected.
How to check it?
No NPU emulator or simulator running on the host machine is available to validate or test the generated specialized C-files; a physical board must be used. This can be done using the default validation flow to:
- Compare the performance (in terms of accuracy) of the deployed C-model vs. the original model. A set of standard task agnostic metrics (RMSE, MAE, etc.) provides the quick indicators. They compare the predictions generated by the original model on the host system with those produced by the deployed model on a physical board.
- Inject user or random input data, and save the outputs for postprocessing (including the intermediate results).
- Report global inference time or per epochs.
Or the ai_runner Python package (Refer to “Getting started - How to evaluate a model on STM32N6 board” article) allowing to use a “real” validation process and associated task-oriented metrics.