NNTool Python API Introduction

The NNTool Python API exposes all of the capabilities in the command line interface and more. It is very useful for more complicated experiments and graph validation scenarios. The main interface to the NNTool API is the nntool.api.NNGraph class returned when you open a graph via load_graph(). If a method or a class is not documented here then you should consider it private and subject to change.

Open and inspect a graph

Once loaded the model, the first thing to do is to inspect the structure and assert that is coherent to the original model. Then you can prepare it for the subsequent steps. Adjust will turn the tensors order to CHW (default). Do not worry, it might change in the next steps if the user requires a different backend (e.g. NE16 or HWC SW). Fusions will parse the graph and fuse together nodes that are supported in a more efficient way in the backend.

from nntool.api import NNGraph

G = NNGraph.load_graph('mygraph.tflite')
# original topology (tflite HWC, onnx CHW)
G.draw()
print(G.show())

G.adjust_order()
G.fusions("scaled_match_group")
# Topology changed after these commands
G.draw()
print(G.show())

Selecting Nodes

The API refers to operators in the graph as nodes. All nodes inherit from the nntool.api.types.NNNodeBase class. The NNGraph nodes() method can be used to select nodes by node classes. The NNGraph instance can be indexed by step index or node name to provide the instance of a particular node in the graph. For example:

from nntool.api import NNGraph
from nntool.api.types import Conv2DNode

G = NNGraph.load_graph('mygraph.tflite')

# get node by step index
print(G[12].name)

# get node by name
print(G["my_conv"].step_idx)

# get nodes by class. node_classes can be a single class or a tuple of classes
for node in G.nodes(node_classes=Conv2DNode):
   print(node.name)

Setting Node Options

Most node options are set on nodes through the at_options. The options are identical to those that you can set using the nodeoptions command and are exposed as properties of the NodeOptions class returned from the at_option property. For example:

from nntool.api import NNGraph

G = NNGraph.load_graph('mygraph.tflite')

G[12].at_options.dump_tensors = 1
G[0].at_option.allocate = 1
G[0].at_option.fixed_order = 1

They can also be set and read via the nntool.api.NNGraph.set_node_option() and nntool.api.NNGraph.get_node_option() methods. The available options are exposed via nntool.api.types.NNNodeBase.get_valid_at_options()

IMPORTANT NOTES:

  • allocate:

    this option allows the autotiler to allocate the space for the selected input/output. This will allow you to use better the available memory of the chip since the in/out space will be reused during the execution. If set to 0 instead it will be left untouched (a static block in memory). Note that if you allow the autotiler to reallocate an input space, after the execution of the network the input cannot be read anymore (the content might have changed). This option is particularly useful for NN that have big input/output tensors (like image NNs).

  • fixed_order:

    this fixes the shape of an input/output and will not NNTool change it if other adjust commands will be run. It is useful if you have an input coming from a sensor in a fixed order and you don’t want to let nntool change it.

Collecting Statistics

To quantize a graph NNTool needs to have statistics on the dynamic of each activation. If the graph is imported with the load_quantization parameter set to True (i.e. a quantized TFLITE graph or NNCF quantized ONNX graph) then these statistics will already be present. If not you need to collect them. The nntool.api.NNGraph.collect_statistics() method is used for this. It needs a dataloader which is just an iterable that returns either normalized input tensors (as numpy float arrays) if there is one input to the graph or a sequence of tensors if there is more than one input array. nntool.api.FileImporter is an example of a dataloader for sounds and images. Sometimes you just want to see what the performance of a network will be and are not worried about accuracy. In this case the nntool.api.RandomIter class can be used.

from nntool.api import NNGraph
from nntool.api.utils import RandomIter

G = NNGraph.load_graph('mygraph.tflite')

stats = G.collect_statistics(RandomIter.fake(G))

Here is an example of a custom datacollector:

from PIL import Image
import numpy as np

class MyDataLoader():
    def __init__(self, image_files, max_idx=None, transpose_to_chw=True):
        self._file_list = image_files
        self._idx = 0
        self._max_idx = max_idx if max_idx is not None else len(image_files)
        self._transpose_to_chw = transpose_to_chw

    def __iter__(self):
        self._idx = 0
        return self

    def __next__(self):
        if self._idx > self._max_idx:
            raise StopIteration()
        filename = self._file_list[self._idx]

        # Here we read the image and make it a numpy array
        image = Image.open(filename)
        img_array = np.array(image)

        # Apply some preprocessing
        img_array = img_array / 128 - 1.0

        # Transpose to CHW
        if self._transpose_to_chw:
            img_array = img_array.transpose(2, 0, 1)

        self._idx += 1
        return img_array

If you don’t want to redo the statistics cache them to disk with numpy.save() and numpy.load() or using pickle():

stats_file = f"/tmp/{os.path.splitext(os.path.basename(trained_model))[0]}.pickle"
if os.path.exists(stats_file):
    print(f"Loading stats dictionary from {stats_file}")
    with open(stats_file, 'rb') as fp:
        stats = pickle.load(fp)
else:
    stats = G.collect_statistics(MyDataLoader(images))
    print(f"Saving stats dictionary to {stats_file}")
    with open(stats_file, 'wb') as fp:
        pickle.dump(stats, fp, protocol=pickle.HIGHEST_PROTOCOL)

Quantization Options

Quantization options can be set at graph or node level using the node_options or graph_options optional arguments to the quantize() method. Graph options are just a dict of option name and value. Node options are a dictionary of node name to dictionary of options. There is a helper function to create the option dictionary nntool.api.utils.quantization_options().

from nntool.api import NNGraph
from nntool.api.utils import quantization_options

G = NNGraph.load_graph('mygraph.tflite')

# ... generate stats

G.quantize(
   stats,
   node_options={
      'conv1': quantization_options(weight_bits=2)
   },
   schemes=['scaled'],
   graph_options=quantization_options(use_ne16=True)
)

Once quantized, you can inspect the quantization of the graph via:

# this shows you instead of the shapes of each tensor, the quantization
G.draw(quant_labels=True)
# this shows you also the arithmetic quantization inside the expressions nodes
G.draw(expressions="quantized")

# for a more literal inspection
print(G.qshow())

Executing a graph in NNTool

NNTool features a bit accurate numpy backend library that allows the user to test the deployable graph in a full python environment. In this way the user can easily test the graph as he/she would do in the original ML framework environment. The tensors that the nntool.api.NNGraph.execute() returns are for each layer of the network so that you can inspect the QSNR degradation layer by layer if needed.

from nntool.api.utils import qsnr

inp, target = next(my_data_loader)

# take output from onnx/tflite
sess = ort.InferenceSession(model_path)
input_name = sess.get_inputs()[0].name
out_onnx = sess.run(None, {input_name: inp})
out_onnx = np.asarray(out_onnx)

# in nntool the tensor order might have changed (HWC/CHW order)
# depending on the options the user provided. The input must
# be prepared accordingly
nntool_inp = inp.transpose((1,2,0)) # CHW (onnx/pytorch) -> HWC (e.g. ne16)
# full precision execution
out_f32 = G.execute(nntool_inp)
out_nntool_f32 = out_f32[-1][0]

print(f"SNR onnx vs nntool full precision: {qsnr(out_onnx, out_nntool_f32)}")
print("(This should be very high, if not there might be an issue in the input preparation or a bug in the NNTool :) Please contact us in this case)")

# Now we run in quantized mode (whatever it is, the input is the same full precision)
out_quant = G.execute(nntool_inp, dequantize=True, quantize=True)
out_nntool_quant = out_quant[-1][0]

# The output has been dequantized and can be directly compared to the fp32 one
print(f"SNR nntool full precision vs nntool quantized: {qsnr(out_nntool_quant, out_nntool_f32)}")
print("(It really depends on the application but generally this should be >=10 otherwise you might want to change the quantization options e.g. using fp16 or mixed fp16/int8)")

layers_qsnrs = G.dict_qsnrs(out_nntool_f32, out_nntool_quant)
for k, v in layers_qsnrs.items():
    print(f"{k:50}: {v}")

Executing a graph on the simulator or chip

A loaded and quantized graph can be used to build a project and execute the graph on the GVSOC SoC simulator or actually on the chip directly from the API. The performance data from the execution of the graph can be returned as can all of the tensors from the execution of the graph. All this functionality is provided through the NNGraph execute_on_target() method.

IMPORTANT NOTES:

  • the returned object is a CompletedProcess

  • the most important argument is settings, it takes a dictionary with all the settings for the autotiler code generation. Use nntool.api.utils.model_settings() api to have documentation on available options.

  • the input_tensors provided here should be fully quantized

  • if you want to inspect the compilation process while it is running provide print_output=True

  • if output_tensors argument is provided, the layer outputs of each will

# nntool_inp is the same as before (fp32)
qout = G.execute([nntool_inp], quantize=True, dequantize=False)
res = G.execute_on_target(
    directory=worker_dir,
    input_tensors=[qout[n.step_idx] for n in G.input_nodes()], # list of all the inputs but fully quantized (coming from nntool execution)
    output_tensors=4, # 1: constants, 2: input, 4: output, (6 = 4|2: input+output, 7: all)
    at_loglevel=2,
    at_log=True,
    print_output=True,
    settings=model_settings(
        tensor_directory="tensors",
        model_directory="model_dir",
        l1_size=128000,
        l2_size=1000000,
        graph_const_exec_from_flash=True
    ),
)
# if execution is not successful res.returncode will not be 0
assert res.returncode == 0, f"Something went wrong {res.stderr}"

# qsnrs comparison between output_tensors and fully quantized execution
layers_qsnrs = G.dict_qsnrs(qout, res.output_tensors)
for k, v in layers_qsnrs.items():
    print(f"{k:50}: {v}")
print("This should be inf if fully integer, if float quantization is used the qsnrs won't be inf but should be very high")
print("(Floating point cannot be fully emulated in nntool as the target since the order of computation changes the results so the nntool vs target are slightly different)")

# show performance layer-by-layer
print(res.performance)
print(res.raw_performance)

Generate a template project

While execute_on_target() is useful for prototyping and getting performance metrics, the template project is quite messy and not really useful for the final application code. gen_project() is useful for that. It will generate a folder with everything you need for building your real application.

Once the folder is created you can go inside and do:

mkdir build && cd build
cmake ../
make run -j

By default it will raise an error saying that you need to adapt the python script that generates the autotiler model. You only need to change the python script template generated (named: gen_{your_project}_at_model.py with the code that you used to prototype the model, replacing execute_on_target() with gen_at_model() so that the python script won’t launch the execution but just generate the autotiler model. Edit that file and adapt it to your needs.

NOTE: if you change the memory configurations in the model_settings of gen_at_model(), you will need to change also the cmake apis in CMakeList.txt (e.g. PRI1_FLASH)

Longer example of NNTool Python API

Here is a longer example of some of the functionality in the NNTool Python API.

from nntool.api import NNGraph

model = NNGraph.load_graph(
            'face_detection_front.tflite',
            load_quantization=False # Whether tflite quant should be loaded or not (default: False)
        )

# Model show returns a table of information on the Graph
print(model.show())

# Model draw can open or save a PDF with a visual representation of the graph
model.draw()

# The equivalent of the adjust command
model.adjust_order()

# The equivalent of the fusions --scale8 command. The fusions method can be given a series of fusions to apply
# fusions('name1', 'name2', etc)
model.fusions('scaled_match_group')

# draw the model here again to see the adjusted and fused graph
model.draw()

# The executer returns all the layer output. Each layer output is an array of the outputs from each output of a layer
# Generally layers have one output but some (like a split for example) can have multiple outputs
# Here we select the first output of the last layer which in a graph with one output will always be the the
# graph output
data_loader = MyDataLoader(glob("input_images/*"))
layer_outputs = model.execute(data_loader)
last_layer = layer_outputs[-1][0]

# Now let's quantize the graph
statistics = model.collect_statistics(data_loader)
# The resulting statistics contain detailed information for each layer
statistics['input_1']

name_layer_2 = model[2].name
# quantize the model. quantization options can be supplied for a layer or for the whole model
model.quantize(
    statistics,
    schemes=['scaled'], # Schemes present in the graph
    graph_options={
        "use_ne16": True,
        "hwc": False,
        "force_output_size": 16,
    }, # QUANT_OPTIONS applied graph-wise
    node_options={
        name_layer_2: {
            "use_ne16": False,
            "hwc": True
        }
    }, # QUANT_OPTIONS applied layer-wise
    )
test_image = next(data_loader)

# Now execute the quantized graph outputing quantized values
print("execute model without dequantizing data")
print(model.execute(test_image, quantize=True)[-1][0])

# Now execute the graph twice with float and quantized versions and compare the results
print("execute model comparing float and quantized execution and showing Cosine Similarity")
cos_sim = model.cos_sim(model.execute(test_image), model.execute(test_image, quantize=True, dequantize=True))
print(cos_sim)

# the step idx can be used to index the model to find the layer with the worst cos_sim
model[np.argmin(cos_sim)]

Coming from command line NNTool

Here are some of the relationships between the command line interface and the NNGraph methods.

  • open - load_graph()

  • adjust - adjust_order()

  • fusions - fusions()

  • aquant and qtune - collect_statistics() and quantize()

  • dump - execute()