Example applications
Example Applications

Canny Edge Detection

Introduction

Canny Edge Detection is a popular edge detection algorithm. It is a technique to extract useful structural information from different vision objects and dramatically reduce the amount of data to be processed. It has been widely applied in various computer vision systems.

With this application example, we show how this algorithm can be implemented on GAP8. We compare the performance of the algorithm when run on 1, 2, 4 and 8 cores with the same image. Of course, you can deactivate this test mode, and run the application only on 8 cores by changing the value of variable "CoreCountDynamic" to 0.

Thanks to GAP8's extended instruction set, we can finish a MAC operation (Multiply-Accumulate) of 2 four-vectors in only one cycle. To achieve maximum performance we use this by vectorizing the main operations of algorithm.

For instance, in the first step of the algorithm, which applies a Gaussian Filter (5 * 5) to smooth the image we have used seven 4 element vectors to represent the 5 * 5 filter. Please read the comments in the source code for further information.

What in this folder:

Name Description
CannyDetect.c Source code of the application
Makefile Makefile of this application
Mills.h A pre-defined image buffer
README.md This readme file

How to build and execute the application

Please configure your shell for the SDK by running the GAP command as indicated in the GAP8 SDK Getting Started Guide.

To run the example using the simulator type:

1 make clean all run

Result

The result shows you the number of cycles used in each part of the algorithm when running 1, 2, 4 and 8 cores.

An example of the output is:

1 Start of application
2 cluster master start
3 Canny Edge Detector running on 1 cores, Source Mono image[W=360, H=202]
4 Entering Master, W=360, H=202, 15 iterations, read stripes: 21 lines, wrote stripe: 13 lines
5 Conv5x5_Byte_Vectorial : 1984659 Cycles
6 EdgeIntensityAndOrientation_Vectorial : 3280079 Cycles
7 CannyRemoveNonMax : 2723484 Cycles
8 CannyBlobAnalysis : 1293926 Cycles
9 CannyRemoveWeakEdges : 116464 Cycles
10 Total : 9398612 Cycles
11 Total with Master : 9403548 Cycles
12 Canny Edge Detector running on 2 cores, Source Mono image[W=360, H=202]
13 Entering Master, W=360, H=202, 15 iterations, read stripes: 21 lines, wrote stripe: 13 lines
14 Conv5x5_Byte_Vectorial : 994220 Cycles
15 EdgeIntensityAndOrientation_Vectorial : 1643662 Cycles
16 CannyRemoveNonMax : 1365630 Cycles
17 CannyBlobAnalysis : 656683 Cycles
18 CannyRemoveWeakEdges : 58784 Cycles
19 Total : 4718979 Cycles
20 Total with Master : 4723521 Cycles
21 Canny Edge Detector running on 4 cores, Source Mono image[W=360, H=202]
22 Entering Master, W=360, H=202, 15 iterations, read stripes: 21 lines, wrote stripe: 13 lines
23 Conv5x5_Byte_Vectorial : 500909 Cycles
24 EdgeIntensityAndOrientation_Vectorial : 829353 Cycles
25 CannyRemoveNonMax : 689137 Cycles
26 CannyBlobAnalysis : 338531 Cycles
27 CannyRemoveWeakEdges : 29796 Cycles
28 Total : 2387726 Cycles
29 Total with Master : 2392387 Cycles
30 Canny Edge Detector running on 8 cores, Source Mono image[W=360, H=202]
31 Entering Master, W=360, H=202, 15 iterations, read stripes: 21 lines, wrote stripe: 13 lines
32 Conv5x5_Byte_Vectorial : 256015 Cycles
33 EdgeIntensityAndOrientation_Vectorial : 417930 Cycles
34 CannyRemoveNonMax : 348603 Cycles
35 CannyBlobAnalysis : 175752 Cycles
36 CannyRemoveWeakEdges : 15290 Cycles
37 Total : 1213590 Cycles
38 Total with Master : 1218337 Cycles

With a core frequency of 100MHz this would execute in12ms if all 8 cores are used.

Multiscale Pedestrian Detection

Introduction

The Pedestrian detection example included in the GAP8 SDK is implemented using a fixed point implementation of Histogram of Oriented Gradients (HOG) (See citation 1 below). The machine learning scheme used to classify HoG descriptor is a window-based boosting technique (See citation 2 below). [Figure 1] shows an algorithm overview.

Figure 1 - Algorithm overview

The diagram shows the main steps of Pedestrian Detection on GAP8. First the image is sub-sampled multiple times to create an pyramid of images. Then on each layer of the pyramid the HOG features are computed. A fixed size sliding windows detector is employed on each of the feature pyramid layers to determine if the features window belongs to a pedestrian or not. The detector compares the feature with a pre-trained model with a fast boosted decision tree scheme.

Histogram of Oriented Gradients (HoG)

Histogram of oriented gradients (HOG) is a well known features extractor algorithm. The algorithm is composed of two hierarchical parts. The image is divided in tiles (in this example 8 x 8 pixels) called Cells and then on each cell the Histogram of Gradients is computed on 9 different angles. Thus the histogram has 9 bins.

Once this step is done for each cell, the cells are grouped in blocks (in this example 2 x 2 cells with stride of one cell). For each block the feature is normalized and an output of 36 bins (4 * 9) is produced.

For a more detailed explanation of the algorithm you can refer to these resources:

Sliding Window Detector

The sliding windows detector tests for the existence of a pedestrian in each location in the input feature with a given stride. The windows is 128 x 64 pixels which gives a feature size of 3780 values. The machine learning algorithm used to classify pedestrians is based on small boosted trees.

Algorithm configuration:

The detector is configured to recognize pedestrian from 96x48 pixel size to 176x88 pixel size. This is achieved using 6 pyramid layers.

How to Build and Execute the Application

Please configure your shell for the SDK by running the GAP command as indicated in the GAP8 SDK Getting Started Guide.

To run the example type:

1 make clean all run

Change Input-Output Configuration

The example can run with two input-output configurations:

  1. Input and Output from file on host PC (default configuration)
  2. Input from Camera and output to LCD screen

There are no additional requirements for mode 1, while for mode two the Himax camera must be connected to cpi camera interface and an additional LCD screen must be mounted on the arduino connector. The driver included in this example supports 2.8 TFT LCD Shield based on ili9341 controller.

To select between confiration 1 and 2 a define must be changed within the file HoGEstim.c. To select configuration 1:

#define FROM_FILE 1

and to select configuration 2

#define FROM_FILE 0

After changing configuration the project must be rebuilt and run:

1 make clean all run

Change Input Image

To change the input image the following two steps are required.

In the main.c file change the name of the input file:

char *Imagefile = "MyNewImage.ppm";

So far only images of 324x244 size are supported.

Citations

  1. Dalal, Triggs, "Histograms of Oriented Gradients for Human Detection", Link
  2. Y. Freund, R.E. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting", Link

CIFAR10 image recognition

Introduction

This document describes a CIFAR10 image recognition application based on a convolutional neural network (CNN) on the GAP8 processor. The example can be compiled and executed on the virtual model of GAP8 available in the SDK.

In the example, 32x32 pixels images are stored in the L2 memory of the fabric controller. The images are then processed by the CNN on the 8 cores of the cluster. The CNN classifies the images in 10 categories, by selecting the maximum confidence value at the output as the inferred category.

The network topology (number of layers, number of inputs and outputs for the layers) has been taken from an open source project on the Internet. In this example, we demonstrate the implementation of this network on GAP8 using adapted data management through DMA and optimized parallelized filter kernels taking benefits of the 8 cores of the cluster.

The CNN has been trained off-line using the CIFAR10 dataset. The recognition ratio reached after training is 68%, which is very close to the asymptotic ratio announced by the authors. Actually more sophisticated topologies can achieve a better rate, but this example is given more as a simpler tutorial for CNN implementation principles on GAP8 architecture. Specifically, it demonstrates how to perform data management between L2 memory in the fabric controller and the L1 cluster shared memory so that no memory overflow occurs while keeping all the compute resources busy. It also shows how the basic filter kernels (2D convolution filter and linear filters) can be described to use the 8 cores of the cluster efficiently so as to reduce computation time and hence power consumption of the application.

In the next sections, we will describe the structure of the example directory, the structure of the CNN and its software components and finally how to run the application on the GAP8 simulator.

Structure of the directory

In the GAP8 installation repository, the cifar10 example is located in:

examples/pulp-examples/autotiler_examples/Cifar10

The directory contains the following files:

Application files:

Name Description
Cifar10.c Application entry code
Makefile Makefile of the application
CifarCoeff.def CNN Coefficients
CifarData.h Test Image

Auto-tiler files:

Name Description
Cifar10Model.c Auto-tiler Model

Automatic layer code generation

Cifar10 with automatic tiling generation

For this cifar10 example, the low level code for the CNN layers (convolutional and dense linear layers) has been generated using the auto-tiling tool available in this SDK.

The generator assumes that the data and weights are stored on the L2 memory (in the Fabric Controller). The available size of L1 shared memory for processing is specified to the tiler. The data and weights are split into tiles, forming data sets that will be loaded through DMA to the L1 shared memory of the cluster. The cores will process the data and when finished, the results are stored back into the L2. The tool takes the "high level" parameters of the layers as input and generates the low level functions describing the data management and the sync run time functions to execute the layers in parallel on the eight cores of the cluster.

Data transfers and processing are pipelined. Doing so, the DMA processes and the data processing can be executed in parallel on different data sets to optimize the execution time.

The following files are automatically generated:

  • Cifar10Kernels.c : contains the generated code for the layers, with the calls to run time DMA process and processor synchronization
  • Cifar10Kernel.h : contains the header for Convlayer and dense layer functions
  • Cifar10KernelInit.c : Contains the parameters for the DMA transfers and processing, including the tiling management and pipeline processing
  • Cifar10KernelsInit.h : contains data-type definition for the parameter structures

How to automatically generate the CIFAR10 CNN code with tiling

The input file containing the "high level" description of the layers used in cifar10 is Model_kernel.c.

Following is the description of the different sections in the input file:

SetInlineMode(SINGLE_INLINE);

This instruction specifies that a single function will be generated if several occurrences of it are generated.

In this case, a structure containing the execution parameters is passed to the function. The parameters are defined in the CnnKernelInit.c file. To Generate one function per call the argument to pass to the SetInlineMode function is "ALWAYS_INLINE".

The following instructions specify the name of the library header files (SetUsedFIlesNames), the names of the generated files (SetGeneratedFilesNames), and the declaration of the amount of L1 memory to be used by the auto-tiler generator.

SetUsedFilesNames("KernelLibStdTypes.h",1 , "CNN_BasicKernels.h");
SetGeneratedFilesNames("CnnKernelsInit.c", "CnnKernelsInit.h", "CnnKernels.c", "CnnKernels.h");
SetL1MemorySize(L1Memory);

The kernel libraries are loaded for usage by the tiler:

CNN_LoadSoftwareKernelLibrary();
CNN_LoadHWCEKernelLibrary();

The layer parameters are defined by the code below:

// cifar10 config
CNN_TiledConvNxNReLUPool2x2_SW_fp("ConvLayer", 5, 1, 8, 32, 32, 3);
CNN_TiledConvNxNReLUPool2x2_SW_fp("ConvLayer", 5, 8, 12, 14, 14, 3);
CNN_TiledLinearLayer ("Dense", 12, 5, 5, 10, 1, 0, 0);

Compile and run the application

Single execution

To execute the application on the specific image contained in RGB.img, just type the following in the cifar10 directory:

1 make clean all run

The output of the program is a set of 10 soft confidence values, one for each category. The maximal value is the category detected by the CNN.

1 Start execution on GAP8
2 FC Launched
3 Allocating 3136: Ok
4 Allocating 600: Ok
5 Allocating 20: Ok
6 Layer0: 41003 Cycles, 34691 instruments
7 Layer1: 82442 Cycles, 74336 instruments
8 Layer2: 1421 Cycles, 412 instruments
9 Test success
10 Loop exited
11 commands completed

Face Detection

Introduction

The face detection example included in the GAP8 SDK is based on Viola-Jones face detection algorithm.

Algorithms kernels:

  1. Image Resize
  2. Integral Image
  3. Squared Integral Image
  4. Cascade Detector

Algorithm configuration:

The detector is configured to detect faces from 64x48 px size to 40x30 px size. This is achieved using 3 pyramid layers.

How to Build and Execute the Application

Please configure your shell for the SDK by running the GAP command as indicated in the GAP8 SDK Getting Started Guide.

To run the example type:

1 make clean all run

Change Input-Output Configuration

The example can run with two input-output configurations:

  1. Input and Output from file on host PC (default configuration)
  2. Input from Camera and output to LCD screen

There are no additional requirements for mode 1, while for mode two the Himax camera must be connected to cpi camera interface and an additional LCD screen must be mounted on the arduino connector. The driver included in this example supports 2.8 TFT LCD Shield based on ili9341 controller.

To select between confiration 1 and 2 a define must be changed within the file main.c. To select configuration 1 (deafult) set macro:

#define FROM_FILE 1

while to select configuration 2:

#define FROM_FILE 0

After changing configuration the project must be rebuilt and run:

1 make clean all run

Change Input Image

To change the input image the following two steps are required.

In the main.c file change the name of the input file:

char *Imagefile = "MyNewImage.ppm";

So far only images of 324x244 size are supported.

Pedestrian Detection (HoG)

Introduction

The Pedestrian detection example included in the GAP8 SDK is implemented using a fixed point implementation of Histogram of Oriented Gradients (HOG) (See citation 1 below). The machine learning scheme used to classify HoG descriptor is a window-based boosting technique (See citation 2 below). [Figure 1] shows an algorithm overview.

Figure 1 - Algorithm overview

The diagram shows the main steps of Pedestrian Detection on GAP8. First the image is sub-sampled multiple times to create an pyramid of images. Then on each layer of the pyramid the HOG features are computed. A fixed size sliding windows detector is employed on each of the feature pyramid layers to determine if the features window belongs to a pedestrian or not. The detector compares the feature with a pre-trained model with a fast boosted decision tree scheme.

Histogram of Oriented Gradients (HoG)

Histogram of oriented gradients (HOG) is a well known features extractor algorithm. The algorithm is composed of two hierarchical parts. The image is divided in tiles (in this example 8 x 8 pixels) called Cells and then on each cell the Histogram of Gradients is computed on 9 different angles. Thus the histogram has 9 bins.

Once this step is done for each cell, the cells are grouped in blocks (in this example 2 x 2 cells with stride of one cell). For each block the feature is normalized and an output of 36 bins (4 * 9) is produced.

For a more detailed explanation of the algorithm you can refer to these resources:

Sliding Window Detector

The sliding windows detector tests for the existence of a pedestrian in each location in the input feature with a given stride. The windows is 128 x 64 pixels which gives a feature size of 3780 values. The machine learning algorithm used to classify pedestrians is based on small boosted trees.

Structure of the directory

Application files:

Name Description
Main.c Application entry code
Makefile Makefile of the application
HoGParameters.h HOG configuration parameters
Pedestrian.ppm A sample input image
HoGIO {.c .h} PPM/PGM img format handle
HyperFlashFS {.c .h} Flash API to load the image
helper.h I/O helper functions
BoostModel.h Boosting model
Boosting.h Boosting typedefs
README.md This read me file

Auto-tiler files:

Name Description
HoGBoostingModelExample.c Generator caller
HoGBoostingGenerator {.c .h} Auto-tiler model
HoGBasicKernels {.c .h} HOG basic kernels
HoGBoostingBasicKernels {.c .h} Boosting basic kernels

How to Build and Execute the Application

Please configure your shell for the SDK by running the GAP command as indicated in the GAP8 SDK Getting Started Guide.

To run the example using the simulator type:

1 make clean all run

The make file calls the auto-tiler to generate the kernel calls and then runs the application on the simulator.

Auto-Tiler Code Generation

To generate the code for the auto-tiler we need to think at the data-flow of the algorithm and how this can be stacked up on the basis of its data flow.

The application is composed of two main parts: the HoG feature computation and the detection made by an boosting algorithm on several feature images. These two parts are presented in the following sections.

HOG

Auto-tiler Model

The HoG feature extraction model is divided into two User Kernels, grouped in a kernel group, we explain this architecture below.

HoG has two different tile sizes to work on. The first one is the cells, usually (but not limited to) 8 x 8 pixels on which the histograms are computed. The second one are the blocks, a group of cells (usually but not limited to 2 x 2 cells, so 16 x 16 px). In order to have a HoG feature we need to compute the histogram for each cell and then group them into a block.

There are two possible main solutions to compute the HoG feature from an image: create two separate user kernels, first computing all the cells and store into memory and then compute the block and the HoG feature. The second solution is while computing the cells to compute the blocks as well. The latter is the fastest in term of cycles and requires less memory and thus less memory transfers. This is the solution described here.

The tiler, by its nature, is working on a fixed size of the input image in each loop of each user kernel and producing a fixed size output. This means that the relationship between input and output is fixed. To compute the blocks while computing the cell we need a start-up user kernel, that we call Prime. This user kernel takes care of computing the first n lines of cells, where n is the vertical size of a block in the cells. Then in the body for each new line of cells that arrives we can compute a new line of blocks and so a new chunk of the output features.

Prime and Body User Kernel

The prime kernel is in charge of computing the first two lines of cells, and then in the epilog the first line of blocks. After this initialization step the body user kernel is called. The body user kernel computes a line of cells and a line of blocks to produce a chunk of the output feature in each loop.

Basic Kernels

HoG is composed of two basic kernels:

  1. KerProcessCellLine, in charge of computing a line of Cells.

Lines from the input image (In) comes in full width (W) strips, height of the strip is CellLineCount * HOG_CELLSIZE + 2, the extra 2 comes from the Sobel filter extension.

Before calling the function that computes the cells ( ProcessOneCell ) the kernel reorders the output lines. LineCellArray contains HOG_BLOCK_SIZE of evaluated lines of cells. Each time a new line is evaluated we pushed it into LineCellArray, if CellLineIndex is >= HOG_BLOCK_SIZE we shift all the lines down by one position.

  1. KerProcessBlockLine, in charge of computing a line of blocks and the corresponding output feature.

Before creating the output feature the kernel normalizes all the cells of block. This is implemented in the ProcessOneBlock function.

Boosting Window Detection

The Boosting Window detection passes through each fixed size window of the feature output (in the example 128 x 64 px, 15 x 7 blocks) checking the response of the boosting model.

Auto-tiler Model

The auto-tiler model for this kernel is quite simple. It is a single user kernel taking the feature and the model as inputs and returning a response map. This response map is the response of the model on each window. The single user kernel is tiled horizontally to save on dynamic memory.

Basic Kernels

The Boosting Window Detection is composed of just one kernel, KerBoost. This kernel receives one vertical slice of the image per time and computes the model's response for each window.

Main code: Image Loading and Kernel Calls

The image loading can be done with the HoGIO api:

unsigned char *ReadImageFromFile(char *ImageName, unsigned int *W, unsigned int *H,
unsigned char *InBuffer, unsigned int BuffSize);

This function takes care of reading an image by calling the HyperFlashFS API.

All the kernel calls all in the function:

int ProcessOneLevel(unsigned char *ImageIn, unsigned int W, unsigned int H,
unsigned short *HoGFeatures, int *EstimOut, Kernel_T *ConfHOG, Kernel_T *ConfHOGEstim)

Change Input Image

To change the input image the following two steps are required.

In the main.c file change the name of the input file:

char *ImageName = "MyNewImage.ppm";

Citations

  1. Dalal, Triggs, "Histograms of Oriented Gradients for Human Detection", Link
  2. Y. Freund, R.E. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting", Link

MNIST image recognition

Introduction

This document describes a MNINST hand written digit recognition application based on a convolutional neural network (CNN) on GAP8 processor.

In the example, 28x28 pixels images are loaded from host PC in the L2 memory of the fabric controller. The images are then processed by the CNN on the 8 cores of the cluster. The CNN classifies the images in 10 categories (digit from 0 to 9), by selecting the maximum confidence value at the output as the inferred category.

The network topology (number of layers, number of inputs and outputs for the layers) has been taken from an open source project on the Internet. In this example, we demonstrate the implementation of this network on GAP8 using adapted data management through DMA and optimized parallelized filter kernels taking benefits of the 8 cores of the cluster.

The CNN has been trained off-line using the MNIST dataset. The recognition ratio reached after training is 96%, which is very close to the asymptotic ratio announced by the authors. Actually more sophisticated topologies can achieve a better rate, but this example is given more as a simpler tutorial for CNN implementation principles on GAP8 architecture. Specifically, it demonstrates how to perform data management between L2 memory in the fabric controller and the L1 cluster shared memory so that no memory overflow occurs while keeping all the compute resources busy. It also shows how the basic filter kernels (2D convolution filter and linear filters) can be described to use the 8 cores of the cluster efficiently so as to reduce computation time and hence power consumption of the application.

In the next sections, we will describe the structure of the example directory, the structure of the CNN and its software components and finally how to run the application on the GAP8 simulator.

Structure of the directory

In the GAP8 installation repository, the mnist example is located in examples/pulp-examples/autotiler_examples/Mnist

Application files:

Name Description
Mnist.c Application entry code
Makefile Makefile of the application
MnistCoeffs.def CNN Coefficients
MnistCoeffs_HWCE.def CNN HWCE Coefficients
136.ppm A sample input image
ImgIO {.c .h} PPM/PGM img format handle
test_img/ Additional Test Images

Auto-tiler files:

Name Description
MnistModel.c Auto-tiler Model

Automatic Layer Code Generation

The low level code for the CNN layers has been generated using the auto-tiler tool available in the SDK. The generator assumes that the data and weights are stored in the L2 memory (in the Fabric Controller). The available size of L1 shared memory for processing is specified to the auto-tiler.

The data and weights are split into tiles, forming data sets that will be loaded through DMA to the L1 shared memory of the cluster. The cores will process the data and when finished, the results are stored back into the L2. The tool takes the "high level" parameters of the layers as input and generates the low level functions describing the data management and the run-time functions to execute the layers in parallel on the eight cores of the cluster.

In MNIST example, a different version is generated for each instance, and the parameters are hard coded in the function.

The following files are automatically generated:

  • MnistKernels.c : contains the generated code for the layers, with the calls to run time DMA process and processor synchronization
  • MnistKernel.h : contains the header for Convlayer and dense layer functions
  • MnistKernelInit.c : contains the parameters for the DMA transfers and processing, including the tiling management and pipeline processing
  • MnistKernelsInit.h : contains data-type definition for the parameter structures

How to automatically generate the MNIST CNN code with tiling

The input file containing the "high level" description of the layers used in MNIST is MnistModel.c. Following is the description of the different sections in the input file:

SetInlineMode(ALWAYS_INLINE);

This instruction specifies that a function will be generated for each occurrence of the layer. To generate a single parameterized function, the SetInlineMode function must be called with "SINGLE_INLINE" parameter.

The following instructions specify the name of the library header files (SetUsedFilesNames), the names of the generated files (SetGeneratedFilesNames), and the declaration of the amount of L1 memory to be used by the tiler generator.

SetUsedFilesNames(0, 1, "CNN_BasicKernels.h");
SetGeneratedFilesNames("MnistKernelsInit.c", "MnistKernelsInit.h", "MnistKernels.c", "MnistKernels.h");
SetL1MemorySize(L1Memory);

The kernel libraries are loaded for usage by the tiler:

LoadCNNLibrary();
CNN_LoadHWCEKernelLibrary();

The layer parameters are defined by the code below:

// MNIST HWCE
CNN_TiledConvNxNReLUPool2x2_HWCE_fp ("Conv5x5ReLUMaxPool2x2_HWCE_0", 5, 1, 32, 28, 28, 1);
CNN_TiledConvNxNReLUPool2x2_HWCE_fp ("Conv5x5ReLUMaxPool2x2_HWCE_1", 5, 32, 64, 12, 12, 1);
CNN_LinearLayerReLU ("LinearLayerReLU_1", HALF_WORD,HALF_WORD,HALF_WORD,HALF_WORD, FROM_L2,FROM_L2,FROM_L2,TO_L2, 64*4*4, 10, NO_RELU);
// MNIST SOFT
CNN_SmallParOutFeatConvolutionPoolReLU("Conv5x5ReLUMaxPool2x2_0", HALF_WORD,HALF_WORD,HALF_WORD,HALF_WORD, FROM_L2, FROM_L2, FROM_L2, TO_L2, 1, 32, 28, 28, 5, 1, NO_PADDING, NO_RELU, 2, 2, NO_PADDING, RELU, MAX_POOLING);
CNN_SmallParOutFeatConvolutionPoolReLU("Conv5x5ReLUMaxPool2x2_1", HALF_WORD,HALF_WORD,HALF_WORD,HALF_WORD, FROM_L2, FROM_L2, FROM_L2, TO_L2, 32, 64, 12, 12, 5, 1, NO_PADDING, NO_RELU, 2, 2, NO_PADDING, RELU, MAX_POOLING);
CNN_LinearLayerReLU ("LinearLayerReLU_1", HALF_WORD,HALF_WORD,HALF_WORD,HALF_WORD, FROM_L2,FROM_L2,FROM_L2,TO_L2, 64*4*4, 10, NO_RELU);

The generated files using the actual MnistModel present in the MNIST directory are generated in the example root directory.

Compile and run the application

To execute the application on the specific image contained in RGB.img, just type the following in the MNIST directory:

1 make clean all run

If you want to execute the code on the Hardware Convolutional Engine (HWCE) you can change this line in the Makefile:

1 USE_HARDWARE_CE = -DRT_HAS_HWCE=1

The output of the program is a set of 10 soft confidence values, one for each category. The max value is the category detected by the CNN.

1 Entering main controller
2 Image ../../../test_img/4/1301.pgm, [W: 28, H: 28], Gray, Size: 784 bytes, Loaded sucessfully
3 Allocating 36864: Ok
4 Allocating 2048: Ok
5 Allocating 20: Ok
6 Recognized: 4
7 Layer0: 69756 Cycles
8 Layer1: 747266 Cycles
9 Layer2: 9702 Cycles
10 Total: 826724 Cycles
11 Test success
12 Loop exited
13 commands completed