Implementing a large matrix-matrix multiplication on FPGA

Approach

Using divide-and-conquer techniques to describe the matrix multiplication algorithm and then using SDSoC for high-level synthesis

Benefits

High-performance implementation, short time-to-market design

Credit

This work has been done under the ENPOWER project (funded by EPSRC) at the University of Bristol.

Matrix multiplication is one of the operators that have a wide range of applications in image processing, scientific computing, simulation, robotics, and so on. Therefore, providing a fast speed implementation using CPU, GPU or FPGA has always been a challenge.

Here, I briefly explain how to implement this operator on FPGA.

As FPGAs have limited resources in terms of internal memory or logics, transferring all the data to an FPGA and then performing the multiplication is not possible. Therefore, the FPGA should collaborate with the main memory to complete the task. However, the low latency of the data access in the main memory is the main bottleneck in this collaboration. To tackle this problem, the approach explained here tries to minimize this collaboration overhead.

The key idea is using the divide-and-conquer technique. The following figure shows this approach by horizontally and vertically dividing matrix A and B, respectively.

It implements without problems but after running the software and hardware results do not match:

Exeution time running matrix multiplication in software: 5.85704e+06 msec
Error at index=256 sw = 1021.82, hw=0
Speed up: 4974.81
TEST FAILED
Bye Matrix Mult

Please enclose expressions defined by macros (e.g, #define) in parenthesis, then everything would be OK. and the output would be as follows
——————————————–
sh-4.3# ./large_latmux_sdsoc_linux_new.elf
Hello Matrix Mult
floating point matrix multiplication
Exeution time running matrix multiplication in hardware: 300918 msec
Exeution time running matrix multiplication in software: 6.04064e+06 msec
Speed up: 20.074
TEST PASSED
Bye Matrix Mult
sh-4.3#
———————————————-
….
#define N (1024*4)
#define M (1024*4)
#define P (1024*4)

#define A_HEIGHT N
#define A_WIDTH M

#define B_HEIGHT M
#define B_WIDTH P

#define C_HEIGHT N
#define C_WIDTH P

#define A_HEIGHT_BLOCK (512/4)
#define B_WIDTH_BLOCK (128/4)
…..
Thanks,
Mohammad

For some reason when I enclose in parenthesis compilation fails in my sdsoc 2016.2 but if I just write the number directly so 4096 instead of (1024*4) it compiles and works.

I have realised this problem in the SDSoC 2016.2 before.
If you add “#pragma SDS data sys_port(A:ACP, B:ACP, C:ACP)” (or similar pragma) before the “mmult_accel” function prototype in the matrix_mult.h then it can be compiled even with complex expressions passed to the zero_copy pragma.
It seems the new version of the SDSoC cannot decide on the memory port when the array size in the zero_copy pragma is an expression. Probably they will correct that in future versions.

Hi.
I was trying your method.
What is the difference between standalone and under linux code?
Is there any difference in approach or method used?
Please explain.

Hi,
Conceptually, both are the same. However, in Linux we had a limitation for allocating the cached contiguous memory to be accessed in the accelerator. It is possible to use the standalone version in the Linux if cache coherency is not necessary.

Jose Nunez-YanezI have tried with a matrix with size 4096×4096. I changed the parameters like:

#define N 1024*4

#define M 1024*4

#define P 1024*4

#define A_HEIGHT N

#define A_WIDTH M

#define B_HEIGHT M

#define B_WIDTH P

#define C_HEIGHT N

#define C_WIDTH P

#define A_HEIGHT_BLOCK 512/4

#define B_WIDTH_BLOCK 128/4

It implements without problems but after running the software and hardware results do not match:

Exeution time running matrix multiplication in software: 5.85704e+06 msec

Error at index=256 sw = 1021.82, hw=0

Speed up: 4974.81

TEST FAILED

Bye Matrix Mult

MohammadPlease enclose expressions defined by macros (e.g, #define) in parenthesis, then everything would be OK. and the output would be as follows

——————————————–

sh-4.3# ./large_latmux_sdsoc_linux_new.elf

Hello Matrix Mult

floating point matrix multiplication

Exeution time running matrix multiplication in hardware: 300918 msec

Exeution time running matrix multiplication in software: 6.04064e+06 msec

Speed up: 20.074

TEST PASSED

Bye Matrix Mult

sh-4.3#

———————————————-

….

#define N (1024*4)

#define M (1024*4)

#define P (1024*4)

#define A_HEIGHT N

#define A_WIDTH M

#define B_HEIGHT M

#define B_WIDTH P

#define C_HEIGHT N

#define C_WIDTH P

#define A_HEIGHT_BLOCK (512/4)

#define B_WIDTH_BLOCK (128/4)

…..

Thanks,

Mohammad

Jose Nunez-YanezThanks,

For some reason when I enclose in parenthesis compilation fails in my sdsoc 2016.2 but if I just write the number directly so 4096 instead of (1024*4) it compiles and works.

MohammadI have realised this problem in the SDSoC 2016.2 before.

If you add “#pragma SDS data sys_port(A:ACP, B:ACP, C:ACP)” (or similar pragma) before the “mmult_accel” function prototype in the matrix_mult.h then it can be compiled even with complex expressions passed to the zero_copy pragma.

It seems the new version of the SDSoC cannot decide on the memory port when the array size in the zero_copy pragma is an expression. Probably they will correct that in future versions.

Large Matrix-Matrix Multiplication on Dual-Core Cortex-A9+NEON – High-Level Synthesis & Embedded Systems Blog[…] one of my previous posts I introduced an implementation for the large matrix-matrix multiplication on FPGA and it was mush […]

Shikha GoelHi.

I was trying your method.

What is the difference between standalone and under linux code?

Is there any difference in approach or method used?

Please explain.

MohammadHi,

Conceptually, both are the same. However, in Linux we had a limitation for allocating the cached contiguous memory to be accessed in the accelerator. It is possible to use the standalone version in the Linux if cache coherency is not necessary.