Autotiler: Mnist

Introduction

This document describes a MNINST hand written digit recognition application based on a convolutional neural network (CNN) on GAP9 processor deployed using the Autotiler.

NOTE: For a more complete use of NN deployment tools, we suggest to look into the nn/nntool examples where the GAPflow is used in its entirity to deploy a NN, including quantization, check of performance, etc. Here we want to show how the autotiler graph are described and what are the possible options (also available in NNTool scripting).

In the example, 28x28 pixels images are loaded from host PC in the L2 memory of the fabric controller. The images are then processed by the CNN on the 8 cores of the cluster. The CNN classifies the images in 10 categories (digit from 0 to 9), by selecting the maximum confidence value at the output as the inferred category.

The network topology (number of layers, number of inputs and outputs for the layers) has been taken from an open source project on the Internet. In this example, we demonstrate the implementation of this network on GAP9 using adapted data management through DMA and optimized parallelized filter kernels taking benefits of the 8 cores of the cluster.

The CNN has been trained off-line using the MNIST dataset. The recognition ratio reached after training is 96%, which is very close to the asymptotic ratio announced by the authors. Actually more sophisticated topologies can achieve a better rate, but this example is given more as a simpler tutorial for CNN implementation principles on GAP9 architecture. Specifically, it demonstrates how to perform data management between L2 memory in the fabric controller and the L1 cluster shared memory so that no memory overflow occurs while keeping all the compute resources busy. It also shows how the basic filter kernels (2D convolution filter and linear filters) can be described to use the 8 cores of the cluster efficiently so as to reduce computation time and hence power consumption of the application.

App Modes

This app runs a MNIST application using a graph generated by the autotiler. In the following sections it is showed in details how to program the autotiler and how it works. The app is showing the user how to run Autotiler graphs with different modalities, specifically:

Standard/Warm Mode: Mnist.c

Autotiler Graph Standard

Autotiler Graph w/ Warm Construct (menuconfig: “APP Mode -> Graph warm Xstructors”)

Reentrant Mode: MnistReentrant.c

Autotiler Graph w/ Reentrant mode (menuconfig: “App Mode -> Graph Reentrant”)

Autotiler Graph w/ Reentrant mode and custom yield hook (menuconfig: “App Mode -> Graph Reentrant” + “Use custom hook in Reentrant Mode”)

Both modes support the encryption of constants: (menuconfig: “Graph Encrypt Constants”)

This option enables the AES encryption of constants that have been promoted to L2 with consequent depryption before usage

With this mode the model has a special constructor and destructor function that allows the user to tune everything at app level (allocate/deallocate specific memory levels). Mnist.c contains the main application code. It is the same as the standard one but the construct/destruct functions take a list of boolean that selects if that memory area needs to be initialized/allocated or not.

    *** PMSIS Mnist Test ***

NN Warm Construct
Calling Cluster
Node Conv5x5ReLUMaxPool2x2_0 starts
Node Conv5x5ReLUMaxPool2x2_1 starts
Node LinearLayerReLU_0 starts
Node SoftMax_0 starts
Recognized number : 6

Test success with 0 error(s) !

With this mode the model run can be interrupted by higher priority tasks and resumed afterwards. Look into the Autotiler documentation for more details. MnistReentrant.c contains the main application code. After allocating the input and output buffers and read the image from file, the application opens the cluster and initializes the graph. Firstly it needs to explicitely configure and initialize the flash used for the file system. The pointer is then passed to the Construct via the Descr variable which allocates the static L2 buffer used for constant tensors used in the graph. Then the AllocCtxt function is called, here all the dynamic buffers are allocated (if not provided by the user - like in the warm mode of the model). After these initializations the model can be executed, two tasks are setup: one for the model RunMnist and another one for normal printfs RunPrint. The latter with higher priority: pi_cluster_task_priority(task_1, 1). They are enqueued one after the other asyncrounusely, so the RunMnist will start and right after it the higher priority one is triggered. As you can see from the log, after the first layer execution the RunMnist task is interrupted, the second task is served and eventually RunMnist is resumed to finish the execution.

CUSTOM HOOK

With this mode the autotiler graph can yield to leave the floor to a higher priority task but the yield can be modulated to take other metrics into account than static priority, namely: layer idx, number of layers left and performance metrics. Hook has 2 main kconfig options: CONFIG_CUSTOM_HOOK_AT_PERF and CONFIG_CUSTOM_HOOK_NAMES, respectively enabling the use of AT generated perf metrics and node names. Do not forget to adapt your AT model to generate related information if wanting to use those. If the options are disabled, related hook arguments are set to NULL.

    *** PMSIS MnistGraph Reentrant Test ***

Entering main controller
NN Construct
Flash is open
NN Constructed and ready to run
Calling Cluster

Waiting

Inside RunMnist for the First time
Node Conv5x5ReLUMaxPool2x2_0 starts
Suspending RunMnist
Higher priority cluster task execution 0/3
Higher priority cluster task execution 1/3
Higher priority cluster task execution 2/3
Higher priority cluster task execution 3/3
Resuming RunMnist
Node Conv5x5ReLUMaxPool2x2_1 starts
Node LinearLayerReLU_0 starts
Node SoftMax_0 starts
End RunMnist
Recognized number : 6

Test success with 0 error(s) !

Compile and run the app

mkdir build
cd build
cmake ..
make menuconfig # choose your options here
make run -j # run the app

Automatic Layer Code Generation

The low level code for the CNN layers has been generated using the auto-tiler tool available in the SDK. The generator assumes that the data and weights are stored in the L2 memory (in the Fabric Controller). The available size of L1 shared memory for processing is specified to the auto-tiler.

The data and weights are split into tiles, forming data sets that will be loaded through DMA to the L1 shared memory of the cluster. The cores will process the data and when finished, the results are stored back into the L2. The tool takes the “high level” parameters of the layers as input and generates the low level functions describing the data management and the run-time functions to execute the layers in parallel on the eight cores of the cluster.

The Autotiler model can also describe how the layers are connected to each other (the graph structure). This will allow it to compute the memory used at graph level by each tensor, accordingly to its liveness in the graph.

The following files are automatically generated:

MnistKernels.c: contains the generated code for the layers, with the calls to run time DMA process and processor synchronization and the Graph functions to construct/destruct/run the graph
MnistKernel.h: contains the header for layers and graph functions
Mnist_L3_Flash_Const.h: file containing the constants used by the graph

How to automatically generate the MNIST CNN code with tiling

The input file containing the “high level” description of the layers used in MNIST is MnistModel.c. Following is the description of the different sections in the input file:

SetInlineMode(ALWAYS_INLINE);

This instruction specifies that a function will be generated for each occurrence of the layer. To generate a single parameterized function, the SetInlineMode function must be called with “SINGLE_INLINE” parameter.

The following instructions specify the name of the library header files (SetUsedFilesNames), the names of the generated files (SetGeneratedFilesNames), and the declaration of the amount of L1 memory to be used by the tiler generator.

SetSymbolNames("Mnist_L1_Memory", "Mnist_L2_Memory");
// Standard data types are used, we import CNN basic kernels
SetUsedFilesNames(0, 1, "CNN_BasicKernels.h");
// Auto Tiler generated files
SetGeneratedFilesNames("MnistKernels.c", "MnistKernels.h");
// L1 shared memory given to Auto Tiler
SetL1MemorySize(L1Memory);

There are some options that can be enabled at graph level, for instance trace execution, or generate warm construct/destruct functions. These options can be defined here and will be applied to the whole graph:

AT_SetGraphCtrl(AT_GRAPH_PRODUCE_NODE_NAMES, AT_OPT_ON);
AT_SetGraphCtrl(AT_GRAPH_TRACE_EXEC, AT_OPT_ON);
#ifdef WARM_GRAPH
AT_SetGraphCtrl(AT_GRAPH_WARM_CONSTRUCT, AT_OPT_VAL(3));
#endif

The kernel libraries are loaded for usage by the tiler:

LoadCNNLibrary();

The layer parameters are defined by the code below:

// 5x5 Convolution followed by ReLu and then by 2x2 Max pooling. Pure SW.
// 1 input plane [28x28], 32 output planes [12x12]
CNN_ConvolutionPoolReLU("Conv5x5ReLUMaxPool2x2_0",   0, 2,2,2,2, 0,0,0,0,   1,32,28,28,  KOP_CONV, 5,5, 1,1, 1,1, 0,  KOP_MAXPOOL,   2,2, 1,1, 2,2, 1, KOP_RELU);
// 5x5 Convolution followed by ReLu and then by 2x2 Max pooling. Pure SW.
// 32 input planes [12x12], 64 output planes [4x4]
CNN_ConvolutionPoolReLU("Conv5x5ReLUMaxPool2x2_1",   0, 2,2,2,2, 0,0,0,0,   32,64,12,12,  KOP_CONV, 5,5, 1,1, 1,1, 0,  KOP_MAXPOOL,   2,2, 1,1, 2,2, 1, KOP_RELU);
// Linear Layer, Input, Output and Coeffs on 16bits.
// Input 64 x [4x4], Output 10
CNN_LinearReLU         ("LinearLayerReLU_1",        0, 2,2,2,2, 0,0,0,0,   64*4*4,10, KOP_LINEAR, KOP_NONE);

Then there is the graph topology description, i.e. how the nodes are connected together and how input/outputs/constants are passed through the graph.

//Open Graph Creation
CreateGraph("MnistCNN",
/* Arguments either passed or globals */
//Here goes input, output and layers parameters(weights and biases)
//ConstInfo function takes as input a tensor in CxHxW format and create a single binary file charged in gap8 flash
//It can take in input float and automaticcaly convert in fixed point format
CArgs(8,
    TCArgInfo("short int *__restrict__", "Input0", ARG_SCOPE_ARG, ARG_DIR_IN, AT_MEM_L2, AT_MEM_UNDEF, 0),
    TCArgInfo("short int *__restrict__", "Step1Weights", ARG_SCOPE_GLOBAL, ARG_DIR_CONSTIN, AT_MEM_L3_DEFAULTFLASH, AT_MEM_UNDEF, ConstInfo("Step1Weights.tensor", 1, 1, 16, 0)),
    ....
);
);
//Node Connections with arguments
AddNode("Conv5x5ReLUMaxPool2x2_0", //Name of the Generated Layer
Bindings(4,                //Number of parameters that genereted layers has and that needs to be connected
    //void Conv5x5ReLUMaxPool2x2_0(
    GNodeArg(GNA_IN, "Input0", 0),          //short int * __restrict__ In,
    GNodeArg(GNA_IN, "Step1Weights", 0),    //short int * __restrict__ Filter,
    GNodeArg(GNA_IN, "Step1Biases", 0),     //short int * __restrict__ Bias,
    GNodeArg(GNA_OUT, "OutputStep2", 0)    //short int * __restrict__ Out
)
);
....