In many engineering applications, matrix multiplication is an essential linear algebra operation mode. At present, there are various hardware structures for matrix multiplication. In terms of FPGA-based platform implementation, the systolic array is one of the most important structures. However, due to resource constraints, the existing systolic array has the problem of the small scale of the calculated matrix. This paper first proposes a parallel block algorithm based on systolic arrays, which can efficiently handle large-scale matrix multiplication operations. Then, by optimizing data flow and increasing data reuse, data access overhead is reduced. Finally, we use the matrix multiplication unit to build a complete matrix multiplication acceleration system and deploy it on Xilinx 325T FPGA. Experimental results show that the accelerator achieves a maximum frequency of 125 Mhz through a 25×25 systolic array, which can efficiently process convolution operations in neural networks. Compared with the traditional systolic array, our structure has stronger data processing capabilities. Simultaneously, compared with the linear array and other structures, our structure has the characteristics of low complexity and high efficiency.
|