Material for the Lecture and Exercises

Material for the lecture and exercise is on an external web site.

Introduction to High Performance Computing (Softwaregrundlagen HPC) - SS 2016

In High Performance Computing we are interested in realizing numerical algorithms in such a way that we get as close as possible to the theoretical peak performance of the underlying hardware. This requires adapting the numerical algorithms. In the usual lectures on numerical analysis or numerical linear algebra this does not get covered.

It turns out the performance only depends on a few elementary linear algebra operations. These operations became known as BLAS (Basic Linear Algebra Subroutines). So obviously an efficient BLAS implementation is crucial for scientific computing and scientific applications. Both, commercial (Intel MKL, AMD ACML, Sun Performance Library, ...) and open source (BLIS, ATLAS, ...) implementations of BLAS are available.

So if a lecture is called High Performance Computing it has to deal with BLAS! One way of dealing with BLAS is merely reading papers and using existing BLAS implementations as black box. But if you really want to understand it you have to implement your own BLAS! We will call this implementation ulmBLAS and it will be on par with commercial implementations.

It is important to know that developing your own ulmBLAS means that you will start with an empty source file. So this is a hands-on class. The development during the semester happens step-by-step (of course these steps are guided) and in two phases:

Phase 1: In the first half of the semester the students develop a pure C implementation for the most important BLAS functions. In this phase students learn about the relevant numerical algorithms, the programming language C, programming tools and fundamental concepts of modern hardware architecture. At the end of this phase the implementation reaches about 30 percent of the theoretical peak performance. The code for the matrix-matrix product (one of the most important BLAS functions) only has less than 450 lines of code.
Phase 2: We now focus on hardware optimization using techniques like loop unrolling, instruction pipelining, SIMD (single instruction, multiple data). In this phase students get introduced to assembly language for more sophisticated low-level programming. At the end we can achieve about 98 percent of the performance of Intel MKL which is the fastest implementation for the Intel platform. In this phase about 10 lines of the original C code where replaced by about 400 lines of assembly code.