News

Dr. Zizhong Chen from UC Riverside Visited CECA

2016-12-08

Dec. 8, 2016 - Dr. Zizhong Chen from University of California, Riverside (UC Riverside) visited CECA and gave a talk on "Reliable Matrix Computations via Algorithm-Based Fault Tolerance".

 

 

Abstract: Errors are common in today's computer systems. When an error occurs, if the affected application continues, we call it a fail-continue error. Otherwise, we call it a fail-stop error. In this talk, I will discuss our recent work on algorithm-based fault tolerance for reliable matrix computations. We have developed some highly efficient error correction techniques for selected widely used matrix computation algorithms to tolerate both fail-continue and fail-stop errors according to their specific algorithmic characteristics. By leveraging the algorithmic characteristics of these algorithms, the proposed techniques can achieve much higher efficiency than the traditional general techniques (i.e., Triple Modular Redundancy for fail-continue errors and checkpoint/restart for fail-stop errors).

Biography: Dr. Zizhong (Jeffrey) Chen is a faculty member in the Department of Computer Science and Engineering at the University of California, Riverside. He is interested in high performance computing, parallel and distributed systems, big data analytics, cluster and cloud computing, algorithm-based fault tolerance (ABFT), power and energy efficient computing, numerical algorithms and software, and large scale computer simulations. His research has been supported by National Science Foundation, Department of Energy, CMG Reservoir Simulation Foundation, Abu Dhabi National Oil Company, Nvidia, and Microsoft Corporation. He has published over 70 papers with many in highly competitive conferences and journals such as HPDC, PPoPP, SC, ICS, IPDPS, TPDS, TC, JPDC, PARCO, SIMAX, SISC, and IBMRD. He has received a CAREER Award from the U.S. National Science Foundation and a Best Paper Award from the International Supercomputing Conference. Dr. Chen is a Senior Member of the IEEE and a Life Member of the ACM. He currently serves as Subject Area Editor for Elsevier journal and Associate Editor for IEEE Transcations on Parallel and Distributed Systems. 

 

click here to download poster.