MatMul Autotiler

This example computes a Matrix Multiplication with different data types (float32, Fixed point 16 bits, float16). The code is generated by the autotiler to maximize the performance.

The Matrixes are initially stored in L2 and the code generated tiles and double buffers the data movement between L2 and L1. To check the results a sequential, non-optimized MatMul is computed by the FC core in the main code.

The example shows both a standard matrix multiplication (in fp32, fp16 and fixed point 16 bits data types) and a complex matrix multiplication (in fp32 and fp16 data types) where it computes the matrix multiplication between 2 complex matrices (in this order: real0, imag0, real1, imag1, …).

For both standard and complex MatMul, you have a normal mode and a transposed mode where the second matrix is saved in a transposed format, avoiding to do a transposition in the basic kernel, hence executing faster, especially for small matrices.

How to run

To run the code with Mat1 [64x64] and Mat2 [64x64] –> Out [64x64]:

mkdir build && cd build && cmake ../
make run -j

To change the size of the matrixes, you can modify the following lines in the CMakeLists.txt file or via menuconfig:

list(APPEND MATMUL_OPT_FLAG -DW_M1=64
                            -DH_M1=64
                            -DW_M2=64)

Via menuconfig you can also choose to test the Cplx Matrix Multiplication available in the Autotiler library.