ST Edge AI Core - Command-line Interface
ST Edge AI Core Technology 2.2.0
r8.2
Overview
The 'stedgeai'
application is a console utility. It
provides a complete and unified Command Line Interface
(CLI) for compiling a pretrained deep learning (DL) or machine
learning (ML) model into an optimized library. This library can run
on an ST device/target
enabling edge AI on microcontrollers (MCUs) with or without ST Neural ART NPU,
microprocessors (MPUs), and smart sensors. It consists of three main
commands: analyze,
validate and generate. Each command
can be used independently of the other with the same set of common options (model files, compression
factor, output directory…) and specific options. The supported-ops
command lists the supported operators and associated constraints for
a given deep learning framework.
Supported ST device/target
target | description |
---|---|
stm32[xx] |
The STM32
family of 32-bit microcontrollers based on the Arm Cortex®-M
processor. 'stm32xx' is requested to define a specific
stm32 series (see the “Supported
STM32 series” section) else 'stm32h7' is
used. |
stellar-e |
Series of MCUs based on the Arm Cortex®-M7 processor and tailored for the specific requirements of electrified vehicles, ensuring efficient actuation of power conversion and e-drive train applications |
stellar-pg[xx] |
Series
of MCUs (P and G families), based on Arm Cortex®-R52+ and Arm
Cortex®-M4 processors, tailored for/to their application domains to
offer optimized and rational solutions to meet the needs for the
next generation of vehicles. 'Stellar P' , designed to
meet the demands of integrating the next generation of drivetrains,
electrification solutions and domain-oriented systems, delivers a
new level of real-time performance, safety, and determinism.
'Stellar G' , addressing the key challenges of
next-generation body integration and zone-oriented vehicle
architectures, ensures performance, safety, and power efficiency
combined with wide connectivity and high security.
'stellar-pg[xx]' is requested to define which
stellar-pg core architecture (-r52 or -m4, see the “Supported
STELLAR-PG series” section) else
'stellar-pg-r52' is used. |
ispu |
A new generation of MEMS sensors featuring an embedded intelligent sensor processing unit (ISPU) |
mlc |
MEMS sensors embedding the machine learning core (MLC) |
stm32mp |
The STM32
family of general-purpose 32-bit microprocessors (MPUs) provides
developers with greater design flexibility. They are based on single
or dual Arm Cortex®-A cores, combined with a Cortex®-M core.
'stm32mpxx' is requested to define a specific stm32mpu
series (see the “Supported
STM32MPU series” section) |
Synopsis
usage: stedgeai --model FILE --target stm32|stellar-e|stellar-pg|ispu|mlc [--type keras|onnx|tflite] [--name STR]
[--compression none|lossless|low|medium|high] [--allocate-inputs] [--allocate-outputs] [--no-inputs-allocation] [--no-outputs-allocation] [--input-memory-alignment INT]
[--output-memory-alignment INT] [--workspace DIR] [--output DIR]
[--split-weights] [--optimization OBJ] [--memory-pool FILE] [--no-onnx-optimizer]
[--use-onnx-simplifier] [--fix-parametric-shapes FIX_PARAMETRIC_SHAPES]
[--input-data-type float32|int8|uint8] [--output-data-type float32|int8|uint8]
[--inputs-ch-position chfirst|chlast] [--outputs-ch-position chfirst|chlast]
[--prefetch-compressed-weights] [--custom FILE] [--c-api st-ai|legacy]
[--cut-input-tensors CUT_INPUT_TENSORS] [--cut-output-tensors CUT_OUTPUT_TENSORS]
[--cut-input-layers CUT_INPUT_LAYERS] [--cut-output-layers CUT_OUTPUT_LAYERS]
[--allocate-activations] [--allocate-states] [--st-neural-art [ST_NEURAL_ART]] [--quantize [FILE]]
[--binary] [--dll] [--ihex] [--address ADDR] [--copy-weights-at ADDR] [--relocatable]
[--lib DIR] [--no-c-files] [--batch-size INT] [--mode host|target|host-io-only|target-io-only]
[--desc DESC] [--val-json FILE] [--valinput FILE [FILE ...]] [--valoutput FILE [FILE ...]]
[--range MIN MAX [MIN MAX ...]] [--full] [--io-only] [--save-csv] [--classifier]
[--no-check] [--no-exec-model] [--seed SEED] [--with-report] [--no-report]
[--no-workspace] [-h]
[--version] [--tools-version] [--verbosity [0|1|2|3]] [--quiet]
analyze|generate|validate|supported-ops
A short description of the options can be displayed with the following command:
$ stedgeai --help
...
To know the versions of the main Python modules which are used:
$ stedgeai --tools-version
stedgeai - ST Edge AI Core v2.2.0
- Python version : 3.9.13
- Numpy version : 1.26.4
- TF version : 2.18.0
- TF Keras version : 3.7.0
- ONNX version : 1.15.0
- ONNX RT version : 1.18.1
Options for a given target
The 'target'
option
can be specified to know the options which are available for a given
target.
$ stedgeai --target mlc --help
usage: stedgeai [--target STR] [--output DIR] --device DEVICE [--script FILE]
[--json FILE] [--type {arff,ucf}] [--port COM] [--ucf FILE]
[--logs FILE | DIR] [--ignore-zero] [--tree FILE] [--arff FILE]
[--meta FILE] [--no-report] [--help] [--version] [--tools-version]
[--verbosity [{0,1}]]
generate|validate|analyze
ST Edge AI Core v1.0.0 (MLC 1.0.0)
...
Command workflow
For each command, the same preliminary steps are applied. A report (txt file) is systematically created and fully or partially displayed. Additional JSON files (dictionary based) are generated in the workspace directory to be parsed by the external tools/script to retrieve the results. Note that they can also be used by a nonregression environment. The format of these files is out of the scope of this document.
<workspace-directory-path>\<name>_c_info.json
<output-directory-path>\<name>_<cmd_name>_report.txt
'analyze'
workflow- import the model
- map, render, and optimize internally the model
- log and display a report
- import the model
'validate'
workflow- import the model
- map, render, and optimize internally the model
- execute the generated C-model (on the desktop or on the
board)
- execute the original model using the original deep learning runtime framework for x86
- evaluate the metrics
- log and display a report
- import the model
'generate'
workflow- import the model
- map, render, and optimize internally the model
- export the specialized C-files
- log and display a report
- import the model
Enable AutoML pipeline for resource-constrained environment
The CLI can be integrated into an automatic or manual pipeline. It allows to design a deployable, and effective neural network architecture for a resource-constrained environment (that is, with low memory/computational resources and/or critical power consumption budgets). The main loop can be extended with a post-analyzing/validating step of the pretrained models. The candidates are checked according to the end-user target constraints thanks to the respective analyze and validate commands.
- Checking of the budgeted memory (ROM/RAM) can be done in the inner loop (topology selection/definition) before the time-consuming training process (or retraining process) to preconstraint the choices of the neural network architecture according to the memory budgets.
- Note that the “analyze” and “host validate” step can be merged, “analyze” information are also available in the “validate” reports.
Error handling
During the execution of a given command, after the parsing of the
arguments if an error is raised, the stedgeai
application returns -1
(else 0
is
returned).
A category and a short description prefix an error message.
category | description |
---|---|
CLI ERROR | specific CLI error |
LOAD ERROR | error during the load/import of the model or the connection with the board - OSError, IOError |
NOT IMPLEMENTED | expected feature is not implemented - NotImplementedError |
INTERRUPT | indicates that the execution of the
command has been interrupted (CTRL-C or kill system
signal) - KeyboardInterrupt, SystemExit |
TOOLS ERROR, INTERNAL ERROR | internal error - ImportError, RuntimeError, ValueError |
Note
There is a specific attention to have explicit and relevant short description of the error messages. Unfortunately, this is not always the case, do not hesitate to contact the local support or to use the product ST Community channel/forum, “Edge AI”.
Example of specific error
.tflite --target ispu -t keras
$ stedgeai validate model...
(CliArgumentError): Wrong model files for 'keras' E102
Analyze command
Description
The 'analyze'
command is the primary command to
import, parse, and check an uploaded pretrained model. A detailed report provides
the main metrics to know if the generated code can be deployed on
the targeted device. It also includes the rendering information by
layer or/and operator (see “C-graph
description” section). It provides the size of the requested RT memory size to store the
kernel and specific network binary objects. After completion, the
user can be fully confident in the imported model in term
of supported layers/operators.
Examples
Analyze a model
$ stedgeai analyze -m <model_file_path> --target <target>
Analyze Keras model saved in two separated files: model.json + model.hdf5
$ stedgeai analyze -m <model_file_path>.json -m <model_file_path>.hdf5 --target <target>
Analyze a 32b float model with compression request
$ stedgeai analyze -m <model_file_path> -c low --target <target>
Analyze a model with the input tensors placed in the activations buffer
$ stedgeai analyze -m <model_file_path> --allocate-inputs --target <target>
Common options
This section describes the common options for the analyze, validate, and generate commands. The specific options are described in the respective command section.
-m/--model FILE
Path of the model files (see “Deep Learning (DL) framework
detection” section). Note that the same -m
argument
should be also used to indicate the weights file if necessary -
Mandatory
Details
Deep Learning (DL) framework detection
Extensions of the model files are used to identify the DL
framework which should be used to import the model. If the
autodetection is ambiguous, the '--type/-t'
option
should be used to define the correct framework.
DL framework | type (–type/-t) | file extension |
---|---|---|
Keras | keras |
.h5 or .hdf5
and .json |
TensorFlow lite | tflite |
.tflite |
ONNX | onnx |
.onnx |
--target STR
Set the targeted device - Mandatory
-t/--type STR
Indicate the type of the original DL framework when the extension of the model files does not allow inference (see “DL framework detection” section) - Optional
-w/--workspace DIR
Indicate a working/temporary directory for the
intermediate/temporary files (default:"./st_ai_ws/"
directory) - Optional
-o/--output DIR
Indicate the output directory for the generated C-files and
report files (default:"./st_ai_output/"
directory) -
Optional
-n/--name STR
Indicate the C-name (C-string
type) for the imported
model. This name id used to prefix the name of specialized NN
C-files and the API functions. It is also used for the temporary
files, allowing you to use the same workspace/output directories for
different models (default: "network"
). -
Optional
-c/--compression STR
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
- unsupported target:
stm32n6
with NPU,mlc
,ispu
,stm32mp
Indicate the expected compression level applied on the different
operators. Supported values:
none|lossless|low|medium|high
(Default:
lossless
) to apply the same level for all operators, or
a simple json
file to define a compression level by
operator. - Optional
Details
During the optimization passes of the imported model, different compression level can be applied. Underlying applied compression algorithms depend on the selected level and the configuration of the operator/layer itself.
level | description |
---|---|
none |
no compression |
lossless |
applied algorithms ensuring the accuracy (structural compression) |
low |
applied algorithms trying to reduce the size of the parameters with a minimum of accuracy loss |
medium |
more aggressive algorithms, the final accuracy loss can be more important |
high |
extreme aggressive algorithms (not used) |
Supported compressions by operator
Floating-point dense or fully connected layers:
'low/medium/high'
enables the compression of weights or/and bias. With'low'
, a targeted compression factor of'4'
is applied while with'medium\high' the targeted compression factor is
‘8’`.ONNX-ML TreeEnsembleClassifier operator: With
'none'
level, no compression or optimization is applied.'lossless'
enables a first level of compression w/o loss of accuracy.'low','medium','high'
enable compression weights.
Warning
Only float32 (or float) are supported by the code generator, this implies that during the import of the operator, the float64 (or double) values are converted in float32.
Specify a compression level by operator
By default, the compression process tries to apply globally the same compression level for all eligible operators. If the global accuracy is too much impacted or to force the compression, the user has the possibility to refine the expected compression level layer-by-layer.
A JSON file must be defined to indicate what is the compression level which must be applied for a given layer. The original name specifies the name of the layer.
{
"layers": {
"dense_2": {"factor": "high"},
}
}
The option -c/--compression
can be used to pass the
configuration file.
$ stedgeai analyze -m <model_file> --target stm32 -c <conf_file>.json
Note
Be aware that a specific layer could not be compressed if the gain of weight size is not enough.
--no-inputs-allocation
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
If defined, this flag indicates that no space is reserved in the “activations” buffer to store the input buffers. The application should allocate them separately in the user memory space and provide them to the execution engine before performing the inference. Refer to “IO buffers into activations buffer” section. - Optional
--no-outputs-allocation
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
If defined, this flag indicates that no space is reserved in the “activations” buffer to store the output buffers. The application should allocate them separately in the user memory space and provide them to the execution engine before performing the inference. Refer to “I/O buffers into activations buffer” section. - Optional
--input-memory-alignment INT
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32n6
with NPU,stm32mp
,mlc
If defined, set the memory-alignment constraint in bytes (multiple of 2) to allocate the input buffer inside the “activations” buffer. By default 4 (or 8 bytes dependent of system bus-width) are used. Refer to the “I/O buffers into activations buffer” section. - Optional
--output-memory-alignment INT
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32n6
with NPU,stm32mp
,mlc
If defined, set the memory-alignment constraint in bytes (multiple of 2) to allocate the output buffer inside the “activations” buffer. By default 4 (or 8 bytes dependent of system bus-width) are used. Refer to “I/O buffers into activations buffer” section. - Optional
--allocate-inputs
DEPRECATED (enabled by default) - If defined, this flag indicates that a space is reserved in the “activations” buffer to store the input buffers. Otherwise, they should be allocated separately in the user memory space. Depending on the size of the input data, the “activations” buffer may be bigger but overall less than the sum of the activation buffer plus the input buffer. To retrieve the address of the associated input buffers refer to the “I/O buffers into activations buffer” section. - Optional
--allocate-outputs
DEPRECATED (enabled by default) - If defined, this flag indicates that a space is reserved in the “activations” buffer to store the output buffers. Otherwise, they should be allocated separately in the user memory space. (Refer to “I/O buffers into activations buffer” section). - Optional
--memory-pool FILE
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
- unsupported target:
stm32n6
with NPU,mlc
,ispu
,stm32mp
Indicate the file path of the memory pool descriptor file. It allows to describe the memory regions allowing to support the multiheap use-case.- Optional
Details
Description of the Memory pools
For advanced use-cases, the user has the possibility to pass
(through the '--memory-pool'
option), a text file (JSON
file format) allowing to specify some properties of the targeted
device. The “1.0” JSON version allows only to
provide the descriptions of the memory pools.
{
"version": "1.0",
"memory": {
"mempools": [
{
"name": "sram",
"size": "128KB",
"usable_size": "96KB"
}
]
}
}
key | description |
---|---|
"version" |
version/format of the JSON file. Only
"1.0" value is supported - mandatory |
"memory" |
key to describe the memory properties (dict) - optional |
"mempools" |
key to describe the memory pool (list of dict) - optional |
Memory pool description ("mempools"
item)
key | description |
---|---|
"name" |
user name, if not defined a generic
name is generated: pool_{pos} - optional |
"size" |
indicate the total size - mandatory |
"usable_size" |
indicate the maximum size which can be
used - if not defined, size is used -
optional |
"address" |
indicate the base @ of memory pool
(ex. 0x2000000 ) - optional |
- value is defined as a string,
"B,KB,MB"
can be used to indicate the size."0x"
prefix indicates value in hexadecimal.
- no target device database is embedded in the CLI to check that the provided memory pool descriptors are valid.
- Note that if
"address"
attribute is defined and if the associated memory pool is used, the value is used as-is to define the default address of the activations buffer by the generated code (see generated<network>_data.c
file).
A typical example of JSON file indicates that two budgeted memory
pools can be used to place the activations buffer.
"dtcm"
memory pool is privileged to place the critical
buffers.
{
"version": "1.0",
"memory": {
"mempools": [
{
"name": "dtcm",
"size": "128KB",
"usable_size": "64KB"
},
{
"name": "ram_d1",
"size": "512KB",
"usable_size": "256KB"
},
{
"name": "default"
}
]
}
}
--fix-parametric-shapes
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
Set parametric dimensions in the shapes of the input tensors (default: parametric dimensions are set to 1) - Optional
Details
Accepted formats are
format | description |
---|---|
List of tuples | A list of tuples specifying for each
input tensor its input shape. The tensors order is the same shown by
Netron. Example: [(1,2,3),(1,3,4)] |
Dictionary (tensors) | A dictionary which associates an input
tensor name with its value. Example:
{'input1':(1,5,6),'input0':(1,4,5)} |
Dictionary (dimensions) | A dictionary which associates a
dimension name with its value. Example:
{'batch_size':1,'sequence_length':10,'sample_size':12} |
--split-weights
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32n6
with NPU,stm32mp
,mlc
If defined, this flag indicates that one c-array is generated per weights/bias data tensor instead to have a unique C-array (“weights” buffer) for the whole model (default: disabled), (refer to “Split weights buffer” section) - Optional
-O/--optimization STR
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
- unsupported target:
stm32n6
with NPU,mlc
,ispu
,stm32mp
Indicate the objective
of the applied optimization passes - Optional
Details
Optimization objectives
The '-O/--optimization'
option is used to indicate
the objective of the optimization passes which are applied to deploy
the c-model. Note that the accuracy/precision of the generated model
is not impacted. By default (w/o option), a trade-off (that is,
balanced
) is considered.
objective | description |
---|---|
time |
apply the optimization passes to reduce the inference time (or latency). In this case, the size of the used RAM (activations buffer) can be impacted. |
ram |
apply the optimization passes to reduce the RAM used for the activations. In this case, the inference time can be impacted. |
balanced |
trade-off between the
'time' and the 'ram' objectives. Reduces
RAM usage yet minimizing impact on inference time |
The following figure illustrates the usage of the optimization
option. It is based on the 'Nucleo-H743ZI@480MHz'
board
with the small MLPerf Tiny quantized models from https://github.com/mlcommons/tiny/tree/master/benchmark/training.
--c-api STR
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
(forispu
onlyst-ai
is supported)
- unsupported target:
stm32n6
with NPU,mlc
,stm32mp
Select the generated embedded c-api: 'legacy
' or
'st-ai
'. 'legacy
' is the c-api supported
by default for STM32 and Stellar targets (refer to “Embedded Inference
Client API” and “Embedded Inference Client
ST Edge AI API” articles for details) -
Optional
Note
that for the next release, default value will be aligned to
'st-ai
' for all targets.
--allocate-activations
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32n6
with NPU,stm32mp
,mlc
(Experimental) Supported only with the st-ai
c-api,
this option indicates that the runtime space must allocate the
memory buffers to store the activations. Otherwise, the application
must provide them (default behavior) - Optional
--allocate-states
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32n6
with NPU,stm32mp
,mlc
(Experimental) Supported only with the st-ai
c-api,
this option indicates that the runtime space must allocate the
memory buffers to store the states. Otherwise, the application must
provide them (default behavior) - Optional
--input-data-type
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
For the quantized models, indicates the expected inputs data type of generated implementation. Multiple inputs definition is supported: in_data_type_1,in_data_type_2,… If one data type is given, it will be applied for all inputs (possible values: float32|int8|uint8) - Optional
Details
model type | supported options |
---|---|
Keras (float) | int8 and
uint8 data type are not supported. float32
can be used but the original data type is unchanged. |
ONNX (float) | int8 and
uint8 data type are not supported. float32
can be used but the original data type is unchanged. |
TFlite (float) | int8 and
uint8 data type are not supported. float32
can be used but the original data type is unchanged. |
TFlite (quantized) | int8 , uint8
and float32 are supported. According the original data
types a converter is inserted. |
ONNX (quantized)* | int8 , uint8
and float32 are supported. According the original data
types a converter is inserted. |
(*) by default for this type of model, the original I/O data type (float32) is converted to int8 data type to feed directly the int8 kernels allowing to support efficiently the deployed ONNX QDQ models (see “QDQ format deployment” section).
--output-data-type
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
For the quantized models, indicates the expected outputs data type of generated implementation. Multiple outputs definition is supported: out_data_type_1,out_data_type_2,… If one data type is given, it is applied for all outputs (possible values: float32|int8|uint8) - Optional
--cut-input-tensors
For the TFlite models should be specified a single tensor location or a comma separated list of tensors locations by which to cut the input model. For the ONNX models should be specified a single tensor name or a comma separated list of tensors names by which to cut the input model.
--cut-output-tensors
For the TFlite models should be specified a single tensor location or a comma separated list of tensors locations by which to cut the input model. For the ONNX models should be specified a single tensor name or a comma separated list of tensors names by which to cut the input model.
--cut-input-layers
For the TFlite models should be specified a single layer location or a comma separated list of layers locations by which to cut the input model. For the Keras models should be specified a single layer index or a comma separated list of layers indexes by which to cut the input model.
--cut-output-layers
For the TFlite models should be specified a single layer location or a comma separated list of layers locations by which to cut the input model. For the Keras models should be specified a single layer index or a comma separated list of layers indexes by which to cut the input model.
--inputs-ch-position
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
Indicate the expected NCHW (channel first) or NHWC (channel last) data layout for the inputs (refer to “How to change the I/O data type or layout (NHWC vs NCHW)” article for more details). Note that this option is applied for all inputs if multiple inputs - possible values: chfirst|chlast - Optional
--outputs-ch-position
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
Indicate the expected NCHW (channel first) or NHWC (channel last) data layout for the outputs (refer to “How to change the I/O data type or layout (NHWC vs NCHW)” article for more details). Note that this option is applied for all outputs if multiple outputs - possible values: chfirst|chlast - Optional
--no-onnx-optimizer
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
Disable the ONNX optimizer pass before to import the ONNX model - Optional
--use-onnx-simplifier
- supported target:
stm32xx
,stm32n6
with NPU ,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
Enable the ONNX simplifier pass before to import the ONNX model (default: False) - Optional
-q/--quantize FILE
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
Path of the configuration file (JSON file) to define the tensor format configuration.
--st-neural-art STR
- supported target: stm32n6 with ST Neural-ART NPU
- unsupported target:
stm32xx
,stellar-e
,stellar-pg[xx]
,ispu
,stm32mp
,mlc
Set the selected profile (including the ST Neural-ART compiler options, refer to “ST Neural ART compiler primer” article) from a well-defined configuration file.
--custom FILE
- supported target:
stm32xx
,stm32n6
with NPU,stellar-e
,stellar-pg[xx]
,ispu
- unsupported target:
stm32mp
,mlc
Path of the configuration file (JSON file) to support the custom layers (refer to “Keras Lambda/custom layer support” article) - Optional
-v/--verbosity {0,1,2,3}
Set the level of verbosity (or level of displayed information).
Supported values: 0,1,2 (default:1
) -
Optional
--quiet
Disable the display of the progress bar during the execution of the command - Optional
Out-of-the-box information
The first part of the log shows the used arguments and the main metrics of the C implementation.
$ stedgeai analyze -m ds_cnn.h5 --target stm32
..
Exec/report summary (analyze)
-------------------------------------------------------------------------------------------
model file : <model-path>\ds_cnn.h5
type : keras
c_name : network
compression : lossless
optimization : balanced
target/series : stm32h7
workspace dir : <workspace-directory-path>
output dir : <output-directory-path>
model_fmt : float
model_name : ds_cnn
model_hash : 0xb773f449281f9d970d5b982fb57db61f
params # : 40,140 items (156.80 KiB)
-------------------------------------------------------------------------------------------
input 1/1 : 'input_0', f32(1x49x10x1), 1.91 KBytes, user
output 1/1 : 'dense_1', f32(1x12), 48 Bytes, user
macc : 4,833,792
weights (ro) : 158,768 B (155.05 KiB) (1 segment) / -1,792(-1.1%) vs float model
activations (rw) : 55,552 B (54.25 KiB) (1 segment)
ram (total) : 57,560 B (56.21 KiB) = 55,552 + 1,960 + 48
-------------------------------------------------------------------------------------------
...
The initial subsection recalls the CLI arguments. Note that the
full raw command line is saved at the beginning of the generated
report file:
<output-directory-path>\network_<cmd>_report.txt
field | description |
---|---|
model file | reports the full-path of the original
model files (-m/--model ). If multiple files, there is
one line by file. |
type | reports the -t/--type
value or inferred DL framework type |
c_name | reports the expected C-name for the
generated C-model (-n/--name ) |
compression | reports the applied compression level
(-c/--compression ) |
optimization | reports the selected objective:
balanced (default), ram or
time , (-O/--optimization ) |
target/series | reports the selected target/series
(--target ) |
workspace dir | full-path of the workspace directory
(-w/--workspace ) |
output dir | full-path of the output directory
(-o/--output ) |
The second part shows the results of the importing and rendering stages.
field | description |
---|---|
model_fmt | designates the main format of the
generated model: float , ss/sa ,
dqnn ,.. |
model_name | designates the name of the provided model. This is generally the name of the model file. |
model_hash | provides the computed MD5 signature of the imported model files. |
input | indicates the name, the format, the
shape, and the size in bytes of an input tensor. There is one line
by input. 'inputs (total)' field indicates the total
size (in bytes) of the inputs. |
output | indicates the name, the format, the
shape, and the size of the output tensor. There is one line by
output. outputs (total) field indicates the total size
(in bytes) of the outputs. |
params # | indicates the total number of parameters of the original model and its associated size in bytes. |
macc | indicates the whole computational
complexity of the original model. Value is defined in
MACC operations: Multiply ACCumulated operations, refer
to “Computational
complexity: MACC and cycles/MACC” |
weights (ro) | indicates the requested size (in bytes) for the generated constant RO parameters (weights and bias tensors). The size is 4 bytes aligned. If the value is different from the original model files, the ratio is also reported. (refer to “Memory-related metrics” section) |
activations (rw) | indicates the requested size (in bytes) for the working RW memory buffer (also called activations buffer). It is mainly used as an internal heap for the activations and temporary results. (refer to “Memory-related metrics” section) |
ram (total) | indicates the requested total size (in bytes) for the RAM including the input and output buffers. |
Note that when the --memory-pool
is
passed, the next part 'Memory-pools summary'
summarizes
the usage of the memory pools.
Memory-pools summary (activations/ domain)
--------------------------- ---- -------------------------- ---------
name id used buffer#
--------------------------- ---- -------------------------- ---------
sram 0 54.25 KiB (10.8%) 34
weights_array 1 155.05 KiB (15876800.0%) 35
input_0_output_array_pool 2 1.91 KiB (196000.0%) 1
dense_1_output_array_pool 3 48 B (4800.0%) 1
--------------------------- ---- -------------------------- ---------
Example of ‘input/output’ description
'input_0', f32(1x49x10x1), 1.91 KBytes, user
Indicates that
input_0
tensor has a size of 490 floating-point items (size in bytes =490 x 4B = 1.91KiB
) with a(1x49x10x1)
shape, the associated memory chunk will be provided by the application ('user'
domain) application (refer to “I/O tensor description” section). On the contrary in'input_0', f32(1x49x10x1), 1.91 KBytes, activations
The description is similar, however thanks the
--allocate-inputs
option, a specific region is reserved in the activations buffer for the input ('activations'
domain).Compressed floating-point model example
For a “compressed” floating-point model, the compression gain for the
'weights'
size, here -72.90% is the global difference between a 32b float model and the generated “compressed” C-model. Note that only the full-connected or dense layers can be compressed.$ stedgeai analyze -m dnn.h5 -c low --target stm32 ... compression : low ... input 1/1 : 'input_0', f32(1x490), 1.91 KBytes, user output 1/1 : 'dense_4', f32(1x12), 48 Bytes, user macc : 114,816 weights (ro) : 123,792 B (120.89 KiB) (1 segment) / -333,024(-72.9%) vs float model activations (rw) : 1,152 B (1.12 KiB) (1 segment) ram (total) : 3,160 B (3.09 KiB) = 1,152 + 1,960 + 48 ...
Quantized TFLite model example - integer format
The following report shows the case where a TensorFlow lite quantized model is imported and the inputs are placed in the activations buffer. Note that for each input (or output), type/scale, and zero-point value are reported. Additional infos are displayed in the “IR Graph description” section.
$ stedgeai analyze -m <quantized_model_file>.tflite --allocate-inputs --target stm32 ... input 1/1 : 'Reshape_1', int8(1x1960), 1.91 KBytes, QLinear(0.101715684,-128,int8), activations output 1/1 : 'nl_3', int8(1x4), 4 Bytes, QLinear(0.003906250,-128,int8), user macc : 336,072 weights (ro) : 16,688 B (16.30 KiB) (1 segment) / -49,920(-74.9%) vs float model activations (rw) : 12,004 B (11.72 KiB) (1 segment) * ram (total) : 12,008 B (11.73 KiB) = 12,004 + 0 + 4 (*) input buffers can be used from the activations buffer ...
IR graph description
The outlined “graph” section (table form) provides a summary of
the topology of the network which is considered before the
optimization, render, and generation stages. The 'id'
column indicates the index of the operator from the original graph.
It is generated by the importer. The described graph is an internal
platform-independent representation (or IR) created during the
import of the model. Only training operators are ignored. Note that
if no input operator is defined, an “input” layer is added and the
nonlinearity functions are unfused.
field | description |
---|---|
id | indicates the layer/operator index in the original model. |
layer (type) | designates the name and type of the
operator. The name is inferred from the original name. In the case
where a nonlinearity function is unfused, the new IR-node is created
with the original name suffixed with '_nl' (see next
figure with the first layer) |
shape | indicates the output shape of the layer. Follow the “HWC” layout or channel last representation (refer to “I/O tensor” section) |
param/size | indicates the number of parameters and their sizes in bytes (4 bytes aligned) |
macc | designates the complexity in multiply-accumulated operations, refer to “Computational complexity: MACC and cycles/MACC” |
connected to | designates the name of the incoming operators/layers |
The right side of the table ('c_*'
columns) reports
generated C-object after optimization and rendering stages.
field | description |
---|---|
c_size | indicates the difference in bytes of
the size for the implemented weights/params tensors. If nothing is
indicated, the size is unchanged compared to the original size
('-/size' field) |
c_macc | indicates the difference in MACC. If
nothing is displayed, the final complexity of the C-operator is
comparable to the complexity of the original layer/operator
('macc' field). |
c_type | indicates the type of the c-operator.
The value between square is the index in the c-graph. The value
between parenthesis is the data type: '()' indicates a
float32 type, '(i)' for integer type,
'(c4, c8)' for the compressed floating-point layer
(size includes also the associated dictionary). Multiple c-operators
can be generated for an original operator. |
Footer summarizes the differences for the whole model including the requested RAM size for the activations buffer and for the I/O tensors.
/c-model: macc=369,672/369,688 +16(+0.0%) weights=18,288/18,288
model=--/6,032 io=--/2,111 activations
In the case where the optimizer engine has folded or/and fused
the IR nodes, the 'c_type'
is empty.
The following figure is an example of an IR graph with a residual
neural network. As for the multiple branches, no specific
information is added, 'connected to'
column allows to
know the connections.
Warning
For a compressed or quantized model, the MACC values (by layer or globally) are unchanged since the number of operations is always the same. Only the associated number of CPU cycles by MACC is changed. In particular, for the quantized models.
Number of operations per c-layer
Number of operations by generated C-layer ('c_id'
)
according to the type of data is provided. With the synthesis by
types of operations for the entire model, this information makes it
possible to know the partitioning of the operations in relation with
the types of data.
Number of operations per c-layer
----------------------------------------------------------------------------------------------
c_id m_id name (type) #op (type)
----------------------------------------------------------------------------------------------
0 1 quant_conv2d_conv2d (conv2d_dqnn) 230,416 (smul_s8_s8)
1 3 quant_conv2d_1_conv2d (conv2d_dqnn) 1,843,200 (sxor_s1_s1)
...
14 25 quant_depthwise_conv2d_3_conv2d (conv2d_dqnn) 28,800 (sxor_s1_s1)
...
16 28 quant_conv2d_7_conv2d (conv2d_dqnn) 1,638,400 (sxor_s1_s1)
17 30 activation (nl) 6,400 (op_f32_f32)
18 32 conv2d_conv2d (conv2d) 76,812 (smul_f32_f32)
----------------------------------------------------------------------------------------------
total 10,067,228
Number of operation types
---------------------------------------------
smul_s8_s8 230,416 2.3%
sxor_s1_s1 9,740,800 96.8%
op_s1_s1 12,800 0.1%
op_f32_f32 6,400 0.1%
smul_f32_f32 76,812 0.8%
operation | description |
---|---|
smul_f32_f32 |
floating-point macc-type operation |
smul_s8_s8 |
8-bit signed integer macc-type operation |
op_f32_f32 |
Floating point operation (nonlinearity, elemtwise op…) |
conv_s8_f32 |
converter operation; s8 -> f32 |
xor_s1_s1 |
binary operation (~macc) |
Complexity report per layer
The last part of the report summarizes the relative network
complexity in term of MACC and associated ROM size by layer. Note
that only the operators which contribute to the global
'c_macc'
and 'c_rom'
metrics are reported.
'c_id'
indicates the index of the associated
c-node.
Complexity report per layer - macc=18,752,688 weights=7,552 act=3,097,600 ram_io=602,184
---------------------------------------------------------------------------------------------------
id name c_macc c_rom c_id
---------------------------------------------------------------------------------------------------
1 separable_conv1 || 1.8% || 1.6% [0]
1 separable_conv1_conv2d ||| 3.2% |||| 3.4% [1]
2 depthwise_conv2d_1 ||||||||| 10.3% |||||||| 8.5% [2]
3 conv2d_1 |||||||||||||||| 17.6% |||||||||||||| 14.4% [3]
5 dw_conv_branch1 |||||||| 9.3% |||||||| 8.5% [7]
6 pw_branch1 |||||||||||||||| 17.6% |||||||||||||| 14.4% [8]
7 dw_conv_branch0 |||||||| 9.3% |||||||| 8.5% [6]
8 batch_normalization_1 || 2.1% || 1.7% [9]
9 separable_conv1_branch2 |||||||| 9.3% |||||||| 8.5% [4]
9 separable_conv1_branch2_conv2d ||||||||||||||| 16.5% |||||||||||||| 14.4% [5]
10 add_1 || 2.1% | 0.0% [10, 11]
11 global_average_pooling2d_1 | 1.0% | 0.0% [12]
12 dense_1 | 0.0% |||||||||||||||| 16.2% [13]
12 dense_1_nl | 0.0% | 0.0% [14]
C-graph description
Additional “Generated C-graph summary” section is included in the
report (also displayed with '-v 2'
argument). It
summarizes the main computational and associated elements
(c-objects) used by the C-inference engine (runtime library). It is
based on the c-structures generated inside the
'<name>.c'
file. A complete graphic
representation is available through the UI (refer to [UM]).
The first part recalls the main structural elements: c-name, number of c-nodes, number of C-array for the data storage of the associated tensors. Input and output name of the I/O tensors.
Generated C-graph summary
---------------------------------------------------------------------------------------------------
model name : microspeech_01
c-name : network
c-node # : 5
c-array # : 11
activations size : 4352
weights size : 16688
macc : 336084
inputs : ['Reshape_1_output_array']
outputs : ['nl_2_fmt_output_array']
As illustrated in the following figure, the implemented c-graph (legacy API) can be considered as a sequential graph, managed as a simple linked list. Fixed-executing order is defined by the C-code optimizer according to two main criteria: data-path dependencies (or tensor dependencies) and the minimization of the RAM memory peak usage.
Each computational c-node is entirely defined by:
- operation type, parameters
- input tensors list: [I]
- optional weights/bias tensors list: [W]
- optional scratches tensors list: [S]
- outputs tensors list: [O]
C-Arrays table
'C-Arrays'
table lists the objects allowing to
handle the base address, size, and metadata of the data memory
segments for the different tensors. For each item, number of items,
size in byte ('item/size'
), memory segment location
('mem-pool'
), type ('c-type'
) and short
format description ('fmt'
) are reported.
C-Arrays (11)
---------------------------------------------------------------------------------------------------
c_id name (*_array) item/size mem-pool c-type fmt comment
---------------------------------------------------------------------------------------------------
0 conv2d_0_scratch0 352/352 activations uint8_t ua8
1 dense_1_bias 4/16 weights const int32_t ss32
2 dense_1_weights 16000/16000 weights const uint8_t ua8
3 conv2d_0_bias 8/32 weights const int32_t ss32
4 conv2d_0_weights 640/640 weights const uint8_t ua8
5 Reshape_1_output 1960/1960 user uint8_t ua8 /input
6 conv2d_0_output 4000/4000 activations uint8_t ua8
7 dense_1_output 4/4 activations uint8_t ua8
8 dense_1_fmt_output 4/16 activations float float
9 nl_2_output 4/16 activations float float
10 nl_2_fmt_output 4/4 user uint8_t ua8 /output
---------------------------------------------------------------------------------------------------
mem_pool | description |
---|---|
activations | part of the activations buffer |
weights | part of a ROM segment |
user | part of a memory segment owned by the user (client application level) |
fmt | format description |
---|---|
float | 32b float numbers |
s1/packed | binary format |
bool | boolean format |
c4/c8 | compressed 32b float numbers. The size includes the dictionary. |
s, u, ua, ss, sa | integer or/and quantized format (refer
to [“Quantized models support”]]STAI_CORE_QUANT article).
'/ch(n)' indicates that per-channel scheme is used
(else per-tensor). |
C-Layers table
'C-Layers'
table lists the c-nodes. For each node,
the c-name (name
), type, macc, rom and associated
tensors (with the shape for the I/O tensors) are reported.
Associated c-array can be found with its name (or array id).
(5)
C-Layers ---------------------------------------------------------------------------------------------------
(*_layer) id type macc rom tensors shape (array id)
c_id name ---------------------------------------------------------------------------------------------------
0 conv2d_0 0 conv2d 320008 672 I: Reshape_1_output [1, 49, 40, 1] (5)
: conv2d_0_scratch0
S: conv2d_0_weights
W: conv2d_0_bias
W: conv2d_0_output [1, 25, 20, 8] (6)
O---------------------------------------------------------------------------------------------------
1 dense_1 1 dense 16000 16016 I: conv2d_0_output0 [1, 1, 1, 4000] (6)
: dense_1_weights
W: dense_1_bias
W: dense_1_output [1, 1, 1, 4] (7)
O---------------------------------------------------------------------------------------------------
2 dense_1_fmt 1 nl 8 0 I: dense_1_output [1, 1, 1, 4] (7)
: dense_1_fmt_output [1, 1, 1, 4] (8)
O---------------------------------------------------------------------------------------------------
3 nl_2 2 nl 60 0 I: dense_1_fmt_output [1, 1, 1, 4] (8)
: nl_2_output [1, 1, 1, 4] (9)
O---------------------------------------------------------------------------------------------------
4 nl_2_fmt 2 nl 8 0 I: nl_2_output [1, 1, 1, 4] (9)
: nl_2_fmt_output [1, 1, 1, 4] (10)
O---------------------------------------------------------------------------------------------------
'id'
designates the layer/operator index from the original model allowing to retrieve the link with the implemented node ('c_id'
).
The following figure illustrates a quantized model where the softmax operator is implemented in float requesting to insert two converters. Note that this is just an example, the softmax operator is fully supported in int8.
Runtime memory size
“Runtime” identifies all involved kernel objects
(software components) which are requested to execute the deployed
c-model on a given device (also called runtime AI-stack). To compute
these information, the '--target'
option is used
to know the targeted device and an embedded gcc-based compiler
application should be available in the PATH
.
The first part indicates the final contribution by module
(generated c-file or library) and by type of memory segment.
'RT total'
line sum-ups the different contributors.
'lib (toolchain)'
indicates the contribution of the
used toolchain objects (including typically the low-level floating
point operations from the libm/libgcc libraries). The extra lines
weights/activations/io
recalls the requested size for
respectively the weights, the activations buffer and the payload for
the input/output tensors (refer to “memory-related
metrics” section from “Evaluation report and
metrics” article).
segment | description |
---|---|
text | size in bytes for the code |
rodata | size in bytes for the const data (usually stored in nonvolatile memory device, FLASH type, except for ISPU) |
data | size in bytes for the initialized data (stored in volatile memory device like embedded RAM, initial values is stored in FLASH, except for ISPU) |
bss | size in bytes for the zero-initialized data (stored in RAM) |
$ stedgeai analyze -m <model_path> --target stm32h7 --c-api legacy
...
Requested memory size by section - "stm32h7" target
----------------------------- -------- -------- ------- --------
module text rodata data bss
----------------------------- -------- -------- ------- --------
NetworkRuntime910_CM7_GCC.a 19,100 0 0 0
network.o 482 213 1,520 116
network_data.o 48 16 88 0
lib (toolchain)* 104 0 0 0
----------------------------- -------- -------- ------- --------
RT total** 19,734 229 1,608 116
----------------------------- -------- -------- ------- --------
weights 0 16,688 0 0
activations 0 0 0 12,004
io 0 0 0 1,964
----------------------------- -------- -------- ------- --------
TOTAL 19,734 16,917 1,608 14,084
----------------------------- -------- -------- ------- --------
* toolchain objects (libm/libgcc*)
** RT AI runtime objects (kernels+infrastructure)
module | description |
---|---|
NetworkRuntime910_CM7_GCC.a | kernel objects implementing the requested operators |
network.o | specialized code/data to manage the c-model |
network_data.o | specialized code/data to manage the weight/activation buffers |
Note that the
'<network>_params_data.o'
file does not appear in the table, because it contains only the values of the weights (c-array form) which is represented by the'*weights*
’ extra line.
The last part summarizes the whole requested memory size per type of memory. It also illustrates the breakdown between the RT objects and the main dimensioning memory-related metrics of the deployed c-model (that is, ROM/RAM metrics).
Summary - "stm32h7" target
---------------------------------------------------
FLASH (ro) %* RAM (rw) %
---------------------------------------------------
RT total 21,571 56.4% 1,724 11.0%
---------------------------------------------------
TOTAL 38,259 15,692
---------------------------------------------------
* rt/total
ISPU example
The following log illustrates an example for the
'ispu'
target. In the final summary, as the firmware is
loaded in the internal RAM through a serial interface by a host
processor, the requested size to store the initialized value of the
.data
section is not considered.
$ stedgeai analyze -m <model_path> --target ispu --c-api stai
...
Requested memory size by section - "ispu" target
------------------- -------- -------- ------ --------
module text rodata data bss
------------------- -------- -------- ------ --------
network_runtime.a 10,970 0 4 0
network.o 1,968 80 0 0
lib (toolchain)* 1,844 428 0 0
------------------- -------- -------- ------ --------
RT total** 14,782 508 4 0
------------------- -------- -------- ------ --------
weights 0 16,688 0 0
activations 0 0 0 12,004
states 0 0 0 0
io 0 0 0 1,964
------------------- -------- -------- ------ --------
TOTAL 14,782 17,196 4 13,968
------------------- -------- -------- ------ --------
* toolchain objects (libm/libgcc*)
** RT AI runtime objects (kernels+infrastructure)
Summary - "ispu" target
----------------------------------------------------------
Code RAM (ro) %* Data RAM (rw) %
----------------------------------------------------------
RT total 15,290 47.8% 4 0.0%
----------------------------------------------------------
TOTAL 31,978 13,972
----------------------------------------------------------
* rt/total
Validate command
Description
The 'validate'
command allows validating the
generated/deployed model. Two modes (--mode
option) are
considered: host
and target
. The detailed
descriptions of the used metrics are described in the “Evaluation report and
metrics” article.
Validation on host
Option: '--mode host'
(Default)
The specialized NN generated c-files are compiled on the host and linked with a specific network-runtime library implementing the reference C-kernels closed to the target implementation.
Validation on target
Option: '--mode target -d <desc>'
This mode allows validating the deployed model on the associated
board. Before to execute the 'validate'
command, the
board should be flashed with a specific validation firmware
including a specific COM stack and the deployed C-model. For each
target the way to deploy the model on the associate development
board can be specific.
- How-to create an STM32
validation firmware
- How to evaluate a
model deployed on the Neural ART accelerator™ on STM32N6
board
- How-to create an ISPU
validation firmware
- How-to create a STELLAR validation firmware
When the board is flashed and started, the same validation process is applied, only the execution of the deployed c-model is delegated to the target.
Examples
Minimal command to validate a 32b float model with the self-generated random input data (“Validation on desktop”).
-m <model_f32p_file_path> --target stm32 $ stedgeai validate
Minimal command to validate a 32b float model on STM32 target. Note that a complete profiling report including execution time by layer is generated by default.
-m <model_f32p_file_path> --mode target --target stm32 $ stedgeai validate
Validation of a 32b float model with compression factor (“Validation on desktop”)
-m <model_f32p_file_path> -c medium --target stm32 $ stedgeai validate
Validate a model with a custom dataset
-m <model_file_path> -vi test_data.csv --target stm32 $ stedgeai validate
Validate a model with only 20 randomly selected samples from a large custom dataset
Specific options
--mode
Indicates the mode of validation - Optional
mode | description |
---|---|
'host' |
default value - Performs a validation on the host. |
'target' |
Perform a validation on target. |
'host-io-only' |
alias equivalent to ‘–mode host –io-only’ - deprecated - default behavior |
'target-io-only' |
alias equivalent to ‘–mode target –io-only’ |
--val-json
Indicates to the tool to use the user JSON file to perform the validation on target by passing all the unecessary generation process to perform validation faster (refer to “c_info.json code-generation output report” section) - Optional
-vi/--valinput
Indicates the custom test dataset which must be used. If not defined an internal self-generated random dataset is used (refer to “Input validation files” section) - Optional
-vo/--valoutput
Indicates the expected custom output values. If the data are
already provided in a simple file ('*.npz'
) through the
'-vi'
option this argument is skipped -
Optional
-b/--batches
Indicates how many random data samples is generated (default:
'10'
) or how many custom test data are used (default:
all) - Optional
-d/--desc
Describes the protocol and associated parameters to communicate
with the deployed c-model. Syntax:
'<driver>[:parameters]'
. This option is required
if the --mode target
is specified.
Describes the COM port which is used to communicate with a target board (see “Serial COM port configuration” section) - Optional
--full
- supported target:
stm32xx
,stellar-e
,stellar-pg[xx]
- unsupported target:
ispu
,stm32n6
with NPU,stm32mp
,mlc
DEPRECATED - Apply an extended validation process to report the L2r error layer-by-layer (Only supported for the floating-point Keras model, experimental for the other models). Else only the L2r is evaluated on the last or output layers. - Optional
Note that this option will be removed in the next release.
--io-only
Force the execution of the deployed model without instrumentation
to retrieve the intermediate data (alias to
'host-io-only'
and 'target-io-only'
mode) - Optional
--classifier
Consider the provided model as a classifier. This implies that
the computation of the 'CM'
and 'ACC'
metrics are evaluated, else an autodetection mechanism is used to
evaluate if the model is a classifier or not. -
Optional
--no-check
Combined with the 'target'
mode, reduce for debug
purpose the full preliminary check-list to make sure that the
flashed target C-model has been generated with the same tools and
options. Only the c-name and network I/O shape/format are checked. -
Optional
--no-exec-model
Do not execute the original model on the host with a deep learning framework runtime. Only the generated c-model is executed (see “Evaluation report and metrics” article)- Optional
--range
Indicates the min and max values (in float) for the generated
random data, default is '[0.0, 1.0['
. To generate
randomly and uniformly the data between '-1.0'
and
'1.0'
, following parameters should be passed:
'--range -1 1'
(Refer to “Random data
generation” section)- Optional
--seed
Define the seed which is used to initialize the pseudorandom number generator for the random data generation. Else a fixed seed is used - Optional
--save-csv
Save the whole data in the respective '*.csv'
files. By default, for performance reasons, only a limited part
are saved. - Optional
For 'ispu'
target an additional option is defined to
specify the file needed to load the ISPU program (see the “Validate
command extension” section of the ISPU specific
documentation).
At the end of the process, results are summarized in a simple table (see “Evaluation report and metrics” for a detailed description of the results).
Evaluation report (summary)
----------------------------------------------------------------------------------------------------------
Mode acc rmse mae l2r tensor
----------------------------------------------------------------------------------------------------------
x86 C-model #1 92.68% 0.053623 0.005785 0.340042 dense_4_nl [ai_float, [(1, 1, 36)], m_id=[10]]
original model #1 92.68% 0.053623 0.005785 0.340042 dense_4_nl [ai_float, [(1, 1, 36)], m_id=[10]]
X-cross #1 100.00% 0.000000 0.000000 0.000000 dense_4_nl [ai_float, [(1, 1, 36)], m_id=[10]]
----------------------------------------------------------------------------------------------------------
Serial COM port configuration
The '-d/--desc'
option should be used to indicate
how to configure the serial COM driver to access the board.
By default, an autodetection mechanism is applied to discover a
connected board at 115200 (default value:
default:115200
) or 921600 for ISPU
Set the baud rate to 921600
$ stedgeai validate -m <model_file_path> --target stm32 --mode target -d serial:921600
Set the COM port to
COM16
(Windows case) or/dev/ttyACM0
(Linux case)$ stedgeai validate -m <model_file_path> --target stm32 --mode target -d serial:COM16 $ stedgeai validate -m <model_file_path> --target stm32 --mode target -d serial:/dev/ttyACM0
Set the COM port to
COM16
and the baud rate to 921600$ stedgeai validate -m <model_file_path> --target stm32 --mode target -d serial:COM16:921600
Extended complexity report per layer
If '-v 2'
option is used, the “Complexity report per
layer” table is extended with a specific column to report the
metric according the data type: 'l2r'
for the
floating-point models and 'rmse'
for the integer or
quantized models.
$ stedgeai validate -m <model_f32p_file_path> --target stm32 -v 2
...
Complexity report per layer - macc=4,013 weights=15,560 act=192 ram_io=416
---------------------------------------------------------------------------------------------------------
id name c_macc c_rom c_id c_dur l2r (X-CROSS)
---------------------------------------------------------------------------------------------------------
0 dense_1 |||||||||||||||| 82.2% |||||||||||||||| 84.8% [0] 11.3%
1 activation_1 | 0.8% | 0.0% [1] 13.3%
2 dense_2 ||| 12.7% ||| 13.1% [2] 16.5%
3 activation_2 | 0.4% | 0.0% [3] 17.7%
4 dense_3 | 2.0% | 2.1% [4] 19.4%
5 activation_3 | 1.9% | 0.0% [5] 21.9% 3.95458301e-07 *
...
(*) indicates the max value
By default, the metric is computed only on the last layers
(outputs of the model), however for the Keras floating-point model,
the '--full'
option allows computing this error
layer-by-layer.
$ stedgeai validate -m <model_f32p_file_path> --target stm32 --full
...
Complexity report per layer - macc=4,013 weights=15,560 act=192 ram_io=416
---------------------------------------------------------------------------------------------------------
id name c_macc c_rom c_id c_dur l2r (X-CROSS)
---------------------------------------------------------------------------------------------------------
0 dense_1 |||||||||||||||| 82.2% |||||||||||||||| 84.8% [0] 11.0% 5.62010030e-08
1 activation_1 | 0.8% | 0.0% [1] 13.3% 5.57235715e-08
2 dense_2 ||| 12.7% ||| 13.1% [2] 16.3% 8.20674515e-08
3 activation_2 | 0.4% | 0.0% [3] 18.0% 8.00048383e-08
4 dense_3 | 2.0% | 2.1% [4] 19.6% 1.32168850e-07
5 activation_3 | 1.9% | 0.0% [5] 21.9% 3.95458301e-07 *
...
Warning
'--full'
option can be also used for validation on
target ('--mode target'
), to report the L2r
error per layer, however, be aware that the validation time is
significantly increased due to the download of the intermediate
results.
Execution time per layer
Validation on target
The validation on target allows to have a full and accurate profiling report including:
- inference-time
- number of CPU cycles by MACC
- execution time per layer
- Device HW settings/configurations (clock frequency, memory configuration)
...
Running the ST.AI c-model (AI RUNNER)...(name=network, mode=TARGET)
Proto-buffer driver v2.0 (msg v3.1) (Serial driver v1.0 - COM4:115200) ['network']
Summary 'network' - ['network']
-----------------------------------------------------------------------------------
I[1/1] 'input_1' : int8[1,1,28,28], 784 Bytes, QLinear(0.012722839,-95,int8),
activations
O[1/1] 'output_1' : f32[1,10], 40 Bytes,
activations
n_nodes : 9
activations : 32640
weights : 1200584
macc : 12052856
hash : 0x00f1e2478590bea3e6ed23bba954f39f
compile_datetime : Nov 5 2024 11:58:56
-----------------------------------------------------------------------------------
protocol : Proto-buffer driver v2.0 (msg v3.1)
(Serial driver v1.0 - COM4:115200)
tools : ST.AI (st-ai api) v2.0.0
runtime lib : v10.0.0-9a75ee0c compiled with GCC 12.3.1 (GCC)
capabilities : IO_ONLY, PER_LAYER, PER_LAYER_WITH_DATA, SELF_TEST
device.desc : stm32 family - 0x450 - STM32H743/53/50xx and
STM32H745/55/47/57xx @480/240MHz
device.attrs : fpu,art_lat=4,core_icache,core_dcache
-----------------------------------------------------------------------------------
ST.AI Profiling results v2.0 - "network"
---------------------------------------------------------------
nb sample(s) : 10
duration : 28.016 ms by sample (28.010/28.023/0.004)
macc : 12052856
cycles/MACC : 1.12
CPU cycles : [13,447,454]
---------------------------------------------------------------
Inference time per node
----------------------------------------------------------------------------------------------
c_id m_id type dur (ms) % cumul CPU cycles name
----------------------------------------------------------------------------------------------
0 11 Conv2D (0x103) 1.255 4.5% 4.5% [ 602,299 ] ai_node_0
1 17 Conv2dPool (0x109) 20.223 72.2% 76.7% [ 9,707,063 ] ai_node_1
2 20 Transpose (0x10a) 0.795 2.8% 79.5% [ 381,426 ] ai_node_2
3 20 NL (0x107) 0.580 2.1% 81.6% [ 278,565 ] ai_node_3
4 23 Dense (0x104) 5.147 18.4% 99.9% [ 2,470,516 ] ai_node_4
5 26 Dense (0x104) 0.009 0.0% 100.0% [ 4,214 ] ai_node_5
6 26 NL (0x107) 0.001 0.0% 100.0% [ 292 ] ai_node_6
7 29 Softmax (0x10c) 0.003 0.0% 100.0% [ 1,652 ] ai_node_7
8 30 NL (0x107) 0.003 0.0% 100.0% [ 1,427 ] ai_node_8
----------------------------------------------------------------------------------------------
n/a n/a Inter-nodal 0.000 0.0% 100.0% n/a
----------------------------------------------------------------------------------------------
total 28.016 [ 13,447,454 ]
----------------------------------------------------------------------------------------------
Statistic per tensor
----------------------------------------------------------------------------------
tensor # type[shape]:size min max mean std name
----------------------------------------------------------------------------------
I.0 10 i8[1,1,28,28]:784 -128 127 -1.681 73.679 input_1
O.0 10 f32[1,10]:40 -7.937 -0.033 -4.356 1.900 output_1
----------------------------------------------------------------------------------
...
This report can be used to identify the main contributors in
terms of inference time and to re-fine the model accordingly.
'c_id'
column references the index of the c-node (see
“C-graph description” section)
and the 'm_id'
identifies the index from the original
model.
out-of-the-box execution
When 'target-io-only'
mode or --io-only
options are used, the deployed model is only executed
out-of-the-box. Executing time or l2r per layer are no more
computed. This can be used to limit the traffic between the host and
the target reducing the validation time.
...
ST.AI Profiling results v2.0 - "network"
------------------------------------------------------------------
nb sample(s) : 10
duration : 28.016 ms by sample (28.007/28.044/0.010)
macc : 12052856
cycles/MACC : 1.12
CPU cycles : [13,447,610]
used stack/heap : 1300/0 bytes
------------------------------------------------------------------
Statistic per tensor
----------------------------------------------------------------------------------
tensor # type[shape]:size min max mean std name
----------------------------------------------------------------------------------
I.0 10 i8[1,1,28,28]:784 -128 127 -1.681 73.679 input_1
O.0 10 f32[1,10]:40 -7.937 -0.033 -4.356 1.900 output_1
----------------------------------------------------------------------------------
...
Validation on host
For validation on
host, the relative execution time per layer is not reported by
default; the '-v 2'
option should be used to display
them. Nevertheless, it is important to note that these values are
only indicators. They depend on the implementation of the
kernels, which are not optimized, and the workload of the
desktop/host machine (see device.desc
field). This
contrasts with the reported inference times for validation on the
target.
$ stedgeai validate -m <model_file_path> --target stm32 -v 2 [--mode host]
...
Running the ST.AI c-model (AI RUNNER)...(name=network, mode=HOST)
DLL Driver v2.0 - Direct Python binding
(<workspace-directory-path>\inspector_network\workspace\lib\libai_network.dll) ['network']
Summary 'network' - ['network']
-----------------------------------------------------------------------------------
I[1/1] 'input_1' : int8[1,28,28,1], 784 Bytes, QLinear(0.012722839,-95,int8),
in activations buffer
O[1/1] 'output_1' : f32[1,1,1,10], 40 Bytes, in activations buffer
n_nodes : 9
activations : 32640
weights : 1200584
macc : 12052856
hash : 0x00f1e2478590bea3e6ed23bba954f39f
compile_datetime : Nov 15 2024 12:49:14
-----------------------------------------------------------------------------------
protocol : DLL Driver v2.0 - Direct Python binding
tools : ST.AI (legacy api) v2.0.0
runtime lib : v10.0.0
capabilities : IO_ONLY, PER_LAYER, PER_LAYER_WITH_DATA
device.desc : AMD64, Intel64 Family 6 Model 165 Stepping 2, GenuineIntel,
Windows
-----------------------------------------------------------------------------------
NOTE: The duration and execution time per layer are just indications. They depend
on the host machine's workload.
ST.AI Profiling results v2.0 - "network"
------------------------------------------------------------------
nb sample(s) : 10
duration : 6.068 ms by sample (5.698/6.571/0.223)
macc : 12052856
------------------------------------------------------------------
DEVICE duration : 7.066 ms by sample (including callbacks)
HOST duration : 0.074 s (total)
used mode : Mode.PER_LAYER
number of c-node : 9
------------------------------------------------------------------
Inference time per node
--------------------------------------------------------------------------------
c_id m_id type dur (ms) % cumul name
--------------------------------------------------------------------------------
0 11 Conv2D (0x103) 0.144 2.4% 2.4% ai_node_0
1 17 Conv2dPool (0x109) 5.142 84.7% 87.1% ai_node_1
2 20 Transpose (0x10a) 0.035 0.6% 87.7% ai_node_2
3 20 NL (0x107) 0.009 0.2% 87.8% ai_node_3
4 23 Dense (0x104) 0.731 12.0% 99.9% ai_node_4
5 26 Dense (0x104) 0.002 0.0% 99.9% ai_node_5
6 26 NL (0x107) 0.001 0.0% 99.9% ai_node_6
7 29 Softmax (0x10c) 0.002 0.0% 100.0% ai_node_7
8 30 NL (0x107) 0.001 0.0% 100.0% ai_node_8
--------------------------------------------------------------------------------
n/a n/a Inter-nodal 0.001 0.0% 100.0% n/a
--------------------------------------------------------------------------------
total 6.068
--------------------------------------------------------------------------------
Statistic per tensor
----------------------------------------------------------------------------------
tensor # type[shape]:size min max mean std name
----------------------------------------------------------------------------------
I.0 10 i8[1,28,28,1]:784 -128 127 -1.681 73.679 input_1
O.0 10 f32[1,1,1,10]:40 -7.937 -0.033 -4.356 1.900 output_1
----------------------------------------------------------------------------------
...
'c_id'
designates the c-layer index in the “C-graph description”.
Generate command
Description
The 'generate'
command is used to generate the
specialized network and data C-files. According to the c-api
option, the used
target, and other additional options, the generated files can be
different.
Generated files with “legacy” C-API option
With the 'legacy'
C-API, the following files are
generated:
$ stedgeai generate -m <model_file_path> --target stm32 -o <output-directory-path> [--c-api legacy]
...
Generated files (7)
-----------------------------------------------------------
<output-directory-path>\<name>_config.h
<output-directory-path>\<name>.c
<output-directory-path>\<name>_data.c
<output-directory-path>\<name>_data_params.c
<output-directory-path>\<name>.h
<output-directory-path>\<name>_data.h
<output-directory-path>\<name>_data_params.h
Creating report file <output-directory-path>\network_generate_report.txt
...
'<name>.c/.h'
files contain the topology of the C-model (C-struct definition of the tensors and the operators), including the embedded inference client API (refer to “Embedded Inference Client API” article) to use the generated c-model on the top of the optimized inference runtime library.'<name>_data_params.c/.h'
files contain by default a simple C-array with the data of the weight/bias tensors. However, the'--split-weights'
option allows having a C-array by tensor (refer to “Split weights buffer” section) and the'--binary'
option creates a binary file with the data of the weight/bias tensors. The'--relocatable/-r'
option (available only for stm32) allows generating a relocatable binary model including the topology definition, the requested kernels, and the weights in a single binary file (refer to “Relocatable binary model support” article).'<name>_data.c/.h'
files contain the intermediate functions requested by the specialized init function to manage the C-array with the weights.
Generated files with “st-ai” C-API option
With the 'st-ai'
C-API, the following files are
generated:
$ stedgeai generate -m <model_file_path> --target stellar-e -o <output-directory-path> --c-api st-ai
or
$ stedgeai generate -m <model_file_path> --target stellar-pg -o <output-directory-path> --c-api st-ai
...
Generated files (5)
-----------------------------------------------------------
<output-directory-path>\<name>.c
<output-directory-path>\<name>_data.c
<output-directory-path>\<name>.h
<output-directory-path>\<name>_data.h
<output-directory-path>\<name>_details.h
Creating report file <output-directory-path>\network_generate_report.txt
...
'<name>.c/.h'
files contain the topology of the C-model (C-struct definition of the tensors and the operators), including the embedded inference client API (refer to “Embedded Inference Client ST Edge AI API” article) to use the generated c-model on the top of the optimized inference runtime library.
'<name>_data.c/.h'
files contain by default a simple C-array with the data of the weight/bias tensors. However, the'--split-weights'
option allows having a C-array by tensor (refer to “Split weights buffer” section)
'<name>_details.h'
files contain the debug information about the intermediate tensors (debug/advanced purpose)
For ISPU target, the generated output also contains the runtime library and its header files and is structured in a manner to correctly populate the provided templates. For more details refer to the “Generate command extension” section of the ISPU specific documentation.
Examples
Generate the specialized NN C-files (default options).
$ stedgeai generate -m <model_file_path> --target stellar-e or $ stedgeai generate -m <model_file_path> --target stellar-pg
Generate the specialized NN C-files for a 32b float model with compression factor.
$ stedgeai generate -m <model_file_path> --target stm32 -c medium
Specific options
- For
'stm32'
target a set of specific options is defined (see “Generate command extension” section) to address the additional UCs:- generation of a shared library to run the model locally (on the
host machine) through a specific Python module (see “How to use the AiRunner
package” article)
- generation of relocatable binary object to be installed and executed anywhere in an STM32 device (see “Relocatable binary model support” article)
- generation of a shared library to run the model locally (on the
host machine) through a specific Python module (see “How to use the AiRunner
package” article)
Supported-ops command
Description
The 'suppoorted-ops'
command is used to display the
list of the supported operators for a given deep learning framework
with the '-t/--type'
option.
Else by default, all operators are listed.
Specific arguments
--with-report
If defined, this flag allows generating a report, txt file (Markdown format) with the list of the operators and associated constraints. - Optional
This option has been used to generate the following articles: “Keras toolbox support”, “TFLite toolbox support and “ONNX toolbox support”
Examples
Generate the list of the supported operators (default)
$ stedgeai supported-ops.0.0 ST Edge AI Core v1281 operators found (ONNX), ABS (TFLITE), Acos (ONNX), Acosh (ONNX), Activation (KERAS), Abs (KERAS), Add (KERAS), Add (ONNX), ADD (TFLITE), ActivityRegularization (KERAS), And (ONNX), ARG_MAX (TFLITE), ARG_MIN (TFLITE), AlphaDropout (ONNX), ArgMin (ONNX), ArrayFeatureExtractor (ONNX), Asin (ONNX), ArgMax (ONNX),... Asinh
Generate the list of the supported Keras operators
-t keras $ stedgeai supported-ops .0.0 ST Edge AI Core v1Parsing operators for KERAS toolbox 62 operators found , ActivityRegularization, Add, AlphaDropout, Average, AveragePooling1D, AveragePooling2D, Activation, Bidirectional, Concatenate, Conv1D, Conv2D, Conv2DTranspose, Cropping1D, BatchNormalization, Dense, DepthwiseConv2D, Dropout, ELU, Flatten, GaussianDropout, GaussianNoise, Cropping2D, GlobalAveragePooling2D, GlobalMaxPooling1D, GlobalMaxPooling2D, GRU GlobalAveragePooling1D, .. InputLayer30 custom operators found , Acos, Acosh, Asin, Asinh, Atan, Atanh, Ceil, Clip, Cos, Exp, Fill, FloorDiv, FloorMod, Gather, Abs, Log, Pow, Reshape, Round, Shape, Sign, Sin, Split, Sqrt, Square, Tanh, Unpack, Where, CustomLambda TFOpLambda
Generate the list of the supported ONNX operators
-t onnx $ stedgeai supported-ops
Generate the list of the supported tflite operators
-t tflite $ stedgeai supported-ops
Generate the list of the supported Keras operators with a full report
-t keras --with-report $ stedgeai supported-ops ... .. Building report: <output-directory-path>/supported_ops_keras.md creating file