S Williams et al 2008 J. Phys.: Conf. Ser. 125 012038 doi:10.1088/1742-6596/125/1/012038
S Williams1,2, K Datta2, J Carter1, L Oliker1,2, J Shalf1, K Yelick1,2 and D Bailey1
Show affiliationsWe present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to sparse matrix vector multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the high-performance computing literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4× improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.
07.05.Kf Data analysis: algorithms and implementation; data management
02.30.Jr Partial differential equations
07.05.Rm Data presentation and visualization: algorithms and implementation
Issue 1 (2008)
S Williams et al 2008 J. Phys.: Conf. Ser. 125 012038