Autotiler: Mnist

Introduction

This document describes a MNINST hand written digit recognition application based on a convolutional neural network (CNN) on GAP9 processor deployed using the Autotiler.

NOTE: For a more complete use of NN deployment tools, we suggest to look into the nn/nntool examples where the GAPflow is used in its entirity to deploy a NN, including quantization, check of performance, etc. Here we want to show how the autotiler graph are described and what are the possible options (also available in NNTool scripting).

In the example, 28x28 pixels images are loaded from host PC in the L2 memory of the fabric controller. The images are then processed by the CNN on the 8 cores of the cluster. The CNN classifies the images in 10 categories (digit from 0 to 9), by selecting the maximum confidence value at the output as the inferred category.

The network topology (number of layers, number of inputs and outputs for the layers) has been taken from an open source project on the Internet. In this example, we demonstrate the implementation of this network on GAP9 using adapted data management through DMA and optimized parallelized filter kernels taking benefits of the 8 cores of the cluster.

The CNN has been trained off-line using the MNIST dataset. The recognition ratio reached after training is 96%, which is very close to the asymptotic ratio announced by the authors. Actually more sophisticated topologies can achieve a better rate, but this example is given more as a simpler tutorial for CNN implementation principles on GAP9 architecture. Specifically, it demonstrates how to perform data management between L2 memory in the fabric controller and the L1 cluster shared memory so that no memory overflow occurs while keeping all the compute resources busy. It also shows how the basic filter kernels (2D convolution filter and linear filters) can be described to use the 8 cores of the cluster efficiently so as to reduce computation time and hence power consumption of the application.

App Modes

This app runs a MNIST application using a graph generated by the autotiler. In the following sections it is showed in details how to program the autotiler and how it works. The app is showing the user how to run Autotiler graphs with different modalities, specifically:

Standard/Warm Mode: Mnist.c

  • Autotiler Graph Standard

  • Autotiler Graph w/ Warm Construct (menuconfig: “APP Mode -> Graph warm Xstructors”)

Reentrant Mode: MnistReentrant.c

  • Autotiler Graph w/ Reentrant mode (menuconfig: “App Mode -> Graph Reentrant”)

  • Autotiler Graph w/ Reentrant mode and custom yield hook (menuconfig: “App Mode -> Graph Reentrant” + “Use custom hook in Reentrant Mode”)

Both modes support the encryption of constants: (menuconfig: “Graph Encrypt Constants”)

  • This option enables the AES encryption of constants that have been promoted to L2 with consequent depryption before usage

With this mode the model has a special constructor and destructor function that allows the user to tune everything at app level (allocate/deallocate specific memory levels). Mnist.c contains the main application code. It is the same as the standard one but the construct/destruct functions take a list of boolean that selects if that memory area needs to be initialized/allocated or not.

    *** PMSIS Mnist Test ***

NN Warm Construct
Calling Cluster
Node Conv5x5ReLUMaxPool2x2_0 starts
Node Conv5x5ReLUMaxPool2x2_1 starts
Node LinearLayerReLU_0 starts
Node SoftMax_0 starts
Recognized number : 6

Test success with 0 error(s) !

Compile and run the app

mkdir build
cd build
cmake ..
make menuconfig # choose your options here
make run -j # run the app

Automatic Layer Code Generation

The low level code for the CNN layers has been generated using the auto-tiler tool available in the SDK. The generator assumes that the data and weights are stored in the L2 memory (in the Fabric Controller). The available size of L1 shared memory for processing is specified to the auto-tiler.

The data and weights are split into tiles, forming data sets that will be loaded through DMA to the L1 shared memory of the cluster. The cores will process the data and when finished, the results are stored back into the L2. The tool takes the “high level” parameters of the layers as input and generates the low level functions describing the data management and the run-time functions to execute the layers in parallel on the eight cores of the cluster.

The Autotiler model can also describe how the layers are connected to each other (the graph structure). This will allow it to compute the memory used at graph level by each tensor, accordingly to its liveness in the graph.

The following files are automatically generated:

  • MnistKernels.c: contains the generated code for the layers, with the calls to run time DMA process and processor synchronization and the Graph functions to construct/destruct/run the graph

  • MnistKernel.h: contains the header for layers and graph functions

  • Mnist_L3_Flash_Const.h: file containing the constants used by the graph

How to automatically generate the MNIST CNN code with tiling

The input file containing the “high level” description of the layers used in MNIST is MnistModel.c. Following is the description of the different sections in the input file:

SetInlineMode(ALWAYS_INLINE);

This instruction specifies that a function will be generated for each occurrence of the layer. To generate a single parameterized function, the SetInlineMode function must be called with “SINGLE_INLINE” parameter.

The following instructions specify the name of the library header files (SetUsedFilesNames), the names of the generated files (SetGeneratedFilesNames), and the declaration of the amount of L1 memory to be used by the tiler generator.

SetSymbolNames("Mnist_L1_Memory", "Mnist_L2_Memory");
// Standard data types are used, we import CNN basic kernels
SetUsedFilesNames(0, 1, "CNN_BasicKernels.h");
// Auto Tiler generated files
SetGeneratedFilesNames("MnistKernels.c", "MnistKernels.h");
// L1 shared memory given to Auto Tiler
SetL1MemorySize(L1Memory);

There are some options that can be enabled at graph level, for instance trace execution, or generate warm construct/destruct functions. These options can be defined here and will be applied to the whole graph:

AT_SetGraphCtrl(AT_GRAPH_PRODUCE_NODE_NAMES, AT_OPT_ON);
AT_SetGraphCtrl(AT_GRAPH_TRACE_EXEC, AT_OPT_ON);
#ifdef WARM_GRAPH
AT_SetGraphCtrl(AT_GRAPH_WARM_CONSTRUCT, AT_OPT_VAL(3));
#endif

The kernel libraries are loaded for usage by the tiler:

LoadCNNLibrary();

The layer parameters are defined by the code below:

// 5x5 Convolution followed by ReLu and then by 2x2 Max pooling. Pure SW.
// 1 input plane [28x28], 32 output planes [12x12]
CNN_ConvolutionPoolReLU("Conv5x5ReLUMaxPool2x2_0",   0, 2,2,2,2, 0,0,0,0,   1,32,28,28,  KOP_CONV, 5,5, 1,1, 1,1, 0,  KOP_MAXPOOL,   2,2, 1,1, 2,2, 1, KOP_RELU);
// 5x5 Convolution followed by ReLu and then by 2x2 Max pooling. Pure SW.
// 32 input planes [12x12], 64 output planes [4x4]
CNN_ConvolutionPoolReLU("Conv5x5ReLUMaxPool2x2_1",   0, 2,2,2,2, 0,0,0,0,   32,64,12,12,  KOP_CONV, 5,5, 1,1, 1,1, 0,  KOP_MAXPOOL,   2,2, 1,1, 2,2, 1, KOP_RELU);
// Linear Layer, Input, Output and Coeffs on 16bits.
// Input 64 x [4x4], Output 10
CNN_LinearReLU         ("LinearLayerReLU_1",        0, 2,2,2,2, 0,0,0,0,   64*4*4,10, KOP_LINEAR, KOP_NONE);

Then there is the graph topology description, i.e. how the nodes are connected together and how input/outputs/constants are passed through the graph.

//Open Graph Creation
CreateGraph("MnistCNN",
/* Arguments either passed or globals */
//Here goes input, output and layers parameters(weights and biases)
//ConstInfo function takes as input a tensor in CxHxW format and create a single binary file charged in gap8 flash
//It can take in input float and automaticcaly convert in fixed point format
CArgs(8,
    TCArgInfo("short int *__restrict__", "Input0", ARG_SCOPE_ARG, ARG_DIR_IN, AT_MEM_L2, AT_MEM_UNDEF, 0),
    TCArgInfo("short int *__restrict__", "Step1Weights", ARG_SCOPE_GLOBAL, ARG_DIR_CONSTIN, AT_MEM_L3_DEFAULTFLASH, AT_MEM_UNDEF, ConstInfo("Step1Weights.tensor", 1, 1, 16, 0)),
    ....
);
);
//Node Connections with arguments
AddNode("Conv5x5ReLUMaxPool2x2_0", //Name of the Generated Layer
Bindings(4,                //Number of parameters that genereted layers has and that needs to be connected
    //void Conv5x5ReLUMaxPool2x2_0(
    GNodeArg(GNA_IN, "Input0", 0),          //short int * __restrict__ In,
    GNodeArg(GNA_IN, "Step1Weights", 0),    //short int * __restrict__ Filter,
    GNodeArg(GNA_IN, "Step1Biases", 0),     //short int * __restrict__ Bias,
    GNodeArg(GNA_OUT, "OutputStep2", 0)    //short int * __restrict__ Out
)
);
....