News

Chen Zhang Successfully Defended his Ph.D. Thesis

2017-06-08

 Chen Zhang successfully defended his Ph.D. Thesis at CECA on June 5, 2017. Congratulations!
For more information (in Chinese), please refer to: 
http://ceca.pku.edu.cn/news.php?action=detail&article_id=691

 

Title: High Performance and Energy-efficient Computing for Deep Learning Algorithms

Abstract: With the recent advancement of artificial intelligence, deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. Deep learning applications, including deep neural networks, convolutional neural networks, as well as many applications based on depth learning algorithms (such as object tracking and recognition), involve a large volume of computation and thus put overwhelming computing pressure the computer system computing performance and energy requirements increased dramatically. However, at the same time, due to the stagnation of Moore's Law and the Von Neumann architecture in recent years, the existing general-purpose processors can not efficiently compute deep learning algorithms. First, the general-purpose computing architecture can not fully exploit algorithms' internal parallelism. Second, the general-purpose computer architecture's cache structure is less efficient in dealing with artificial intelligence algorithms; However, processor that are too much too much dedicated for certain algorithm can not meet the actual application scenarios in various situations; Fourth, how to full take advantage of reconfigurable computing potentials at large-scale clusters has become an important issue. To solve the above problems, this paper revolves around the deep learning algorithms for the customized computer architecture design, computation and memory access balance optimization, hardware and software co-design, dedicated accelerator design automation. We further explored the optimization of the hardware specialized accelerator for the deep learning algorithm in the large-scale distributed reconfigurable computing system. The research content and innovation of this paper are as follows:

 

1. Reconfigurable and extensible deep learning accelerator microarchitecture and cache design optimization. There are many difficulties in the design of the microarchitecture for the special accelerator of the neural network algorithm. On the one hand, there is a large degree of parallelism in the calculation of the multi-dimensional tensor, in addition to the complex data dependency in the multidimensional tensor calculation. We fully analyze all the dimensions involved in calculating the tensor, and propose a dedicated processing unit design based on the data dependency in the multidimensional tensor calculation. And a loop tiling and scheduling method is proposed based on the roofline model to optimize the large tensor calculation. On the other hand, multidimensional parallel computing on tensors also requires corresponding optimizations of its cache design, as well as data organization in the main memory. According to the processing unit design, we proposed specialized cache design. Based on FPGA platform specific DRAM access characteristics and efficient tensor partitioning and scheduling, we proposed efficient memory data organization.

 

2. Computing model for neural networks and hardware and software co-design for specialized hardware. At present, the application of artificial intelligence algorithm on the existing computer architecture presents a huge challenge. On the one hand, current general-purpose processor architecture can not efficiently process artificial intelligence algorithms. On the other hand, artificial intelligence algorithms have been changing for different scenarios and new improvements in algorithms. The overly customized accelerator architecture is difficult to cope with the algorithmic flexibility requirements of the actual application scenario. Therefore, this paper proposed a set of hardware and software co-design of the accelerator, designed to retain the basis of programmability, to exploit underlying parallelism in the deep learning algorithms as much as possible.

 

3. Compilation flow from neural network specific language to custom hardware and accelerator hardware design automation. In order to further use the specialized hardware accelerator, we designed a set of automated flow that automatically compiled the neural network model defined by the industry standard (Proto-buffer) into a dedicated accelerator and encapsulated by the user's easy-to-use standard C++, allowing the user to easily invoke the dedicated accelerator hardware pair Calculate for acceleration. For the FPGA hardware developers, since different FPGA devices have different resources, the specialized accelerator hardware will be different. In order to facilitate the accelerator hardware design in different FPGA platforms, we provide a set of high-level integrated code based on the hardware template library. Through the hardware parameter configuration file and HLS integrated process, it is extremely convenient to instantiate specialized FPGA hardware.

 

4. Design and optimization of large scale distributed reconfigurable computing architecture for neural networks. Due to the computational performance requirements and power limitations in the data centers in recent years, large-scale distributed reconfigurable computing has attracted enthusiastic interest from a large number of companies. Microsoft, Baidu, Tencent and so on deployed a large number of FPGA devices in the data center. However, how to deploy a neural network on a large scale FPGA cluster is an issue that has not yet been discussed. Considering that each machine is optimized for the calculation of certain layers, the customized accelerator of the neural network algorithm can have further optimization space on the distributed reconfigurable computing architecture. We have set up a mapping model from multi-layer neural network to multi-FPGA, and proposed a set of algorithm based on dynamic programming to obtain the minimum calculation delay and maximum throughput of hardware design.

 

Neural networks are a very computational and data intensive algorithm. Meeting the delay, throughput and power consumption requirements of many application in real scenarios is a very complex problem. This paper finds that customized accelerator design is an promising method. By the means of customized hardware accelerator for the neural network algorithms, we can be greatly optimize the required delay and power consumption to perform the same calculation. In addition, hardware and software co-optimization can give customized hardware a certain degree of programming ability to meet the practical needs for flexible application scenarios. The automated design process increases the efficiency of the accelerator deployment on the FPGA. In this paper, a set of accelerator microstructure, compilation and design automation process are proposed for neural network algorithm. The optimization method based on the roofline model allows us to find the optimal design for the FPGA platform under the constraints of computing and bandwidth resources. Related chapters of this article have been published in the area of reconfigurable computing and computer design automation, and is expected to benefit the further development of customized computing and deep learning applications. Part of this work is done during author's internship at Falcon Computing Solutions Inc. and transfer to real products.