On May 31, 2019, the Center for Energy-Efficient Computing and Application (CECA) organized a PhD dissertation defense by Xiuhong Li in Conference Room 410 of the 5th Science Building. Xiuhong's advisor is Prof. Yun Liang, a tenured associate professor of CECA. The defense committee include Prof. Xiaobing Feng from the CAS Institute of Computing Technology, Prof. Jidong Zhai from Tsinghua University, Prof. Ming Jiang from the School of Mathematics at Peking University, as well as Prof. Yun Liang, Prof. Guojie Luo, and Prof. Guangyu Sun Yun from CECA.
Xiuhong defensed his dissertation on the topic of "GPU Optimization Technology for Irregular Applications". He gave a clear report on the research background and significance of the doctoral dissertation, as well as the innovative technology proposed, and made clear answers on the questions raised by the defense committe. The defense committee unanimously agreed to pass the defense of Li Xiuhong's doctoral dissertation and recommended that he be awarded a doctorate of science.
During his PhD program, Xiuhong published 5 academic papers, including PPoPP and ICS, the top conferences in the field of computer architecture. His innovative achievements in GPU high-performance computing have been unanimously approved by the defense committee. Xiuhong also won the National Scholarship, Peking University Excellence Award, etc. during his PhD and won the title of "SenseTime Internship Star" during his internship at SenseTime.
Attachment: abstract of PhD dissertation
In recent years, a large number of new applications have continuously emerged, and these applications have also put forward higher demands on the computing capabilities of computing devices. The architecture of the traditional central processing unit CPU focuses on versatility, and it is difficult to fully explore the parallelism of the application itself. A general-purpose graphics processor GPU with high parallelism and many-core structure as its main features has emerged as the times require. GPU redesigns the chip resources originally used for complex logic and on-chip cache on the CPU into a large number of computing cores, making it a many-core processor, and then through thousands of threads in parallel execution to obtain very high computing power And memory access bandwidth. Therefore, GPU is very suitable for applications that have a large amount of data parallelism, continuous and regular memory access. For applications with insufficient data parallelism or irregular memory access behavior, although GPUs can still be used for acceleration, it is very difficult to make full use of GPU computing resources and memory access resources to achieve the ultimate performance.
This dissertation first pointed out that the mismatch between the characteristics of the application itself and the architectural characteristics of the GPU is the root cause of irregular applications on the GPU. Therefore, the core of this article is to deeply analyze the architectural features of GPU and point out the crux of the optimization of irregular applications on GPU. Around this core, this paper proposes a set of thread-level parallelism management schemes. This paper analyzes the characteristics of irregular applications related to the GPU system structure from three aspects: lack of parallelism, irregular memory access, and unbalanced resource utilization. First of all, when the application scale is small, it is inconsistent with the architectural feature that GPU has a large number of thread parallelism, which leads to the inability to fully utilize the GPU's computing power. We call this problem the problem of insufficient parallelism. For this reason, this article proposes a batch execution technology for the same type of small tasks, and introduces the unified thread structure and block decision algorithm using small matrix multiplication as an example. Secondly, when the application has a fragmented and unaligned memory access mode, it is inconsistent with the architectural feature of GPU memory access aggregation, which results in the inability to make full use of the GPU's high bandwidth capabilities. We call this problem an irregular memory access problem. For this reason, this article proposes a thread mapping technology. Through a unified graph algorithm, the mapping relationship between threads and tasks can be established to reduce the performance loss caused by irregular memory access. . Finally, because the application itself has different degrees of preference for computing resources or memory access resources, considering that the GPU is equipped with a large number of computing and memory access resources at the same time, a single application often cannot fully utilize GPU resources in a balanced manner. . We call this problem the problem of unbalanced resource utilization. For this reason, this article proposes concurrent execution techniques for different tasks, and realizes the balanced and full utilization of GPU resources by complementing different application resources.
In fact, these three technologies not only revolve around the same core: the inconsistency of GPU architecture characteristics and application characteristics; they also have the same optimization level: thread-level parallelism management. First, we use batch execution technology to improve thread parallelism; then, we use thread mapping to change the correspondence between threads and tasks in a targeted manner; finally, we use multi-task concurrent execution technology to thread different tasks Allocate to regulate the complementary use of resources and conflict mitigation. Therefore, these three technologies are closely related. Moreover, these three technologies are very versatile, not only for a certain application. Finally, this article finally uses deep neural network as an example to apply these three technologies to the optimization of deep neural network performance, further verifying the versatility and relevance of these three technologies.