### Refine

#### Document Type

- Article (2)
- Conference Object (2)

#### Keywords

- Blocking (2)
- Single Instruction Multiple Data (SIMD) (2)
- Sparse Matrix Vector Multiplication (2)
- Sparse Matrix Vector multiply (SpMV) (2)
- intrinsics (2)
- Autotuning (1)
- OpenMP (1)
- OpenMP, unrolling (1)
- SpMV (1)
- Vector Intrinsics (1)

In this paper, several blocking techniques are applied to matrices that do not have a strong blocked structure. The aim is to efficiently use vectorization with current CPUs, even for matrices without an explicit block structure on nonzero elements. Different approaches are known to find fixed or variable sized blocks of nonzero elements in a matrix. We present a new matrix format for 2D rectangular blocks of variable size, allowing fill-ins per block of explicit zero values up to a user definable threshold. We give a heuristic to detect such 2D blocks in a sparse matrix. The performance of a Sparse Matrix Vector Multiplication for chosen block formats is measured and compared. Results show that the benefit of blocking formats depend – as to be expected – on the structure of the matrix and that variable sized block formats can have advantages over fixed size formats.

The SpMV operation -- the multiplication of a sparse matrix with a dense vector -- is used in many simulations in natural and engineering sciences as a computational kernel. This kernel is quite performance critical as it is used, e.g.,~in a linear solver many times in a simulation run. Such performance critical kernels of a program may be optimized on certain levels, ranging from using a rather coarse grained and comfortable single compiler optimization switch down to utilizing architecural features by explicitly using special instructions on an assembler level. This paper discusses a selection of such program optimization techniques in this spectrum applied to the SpMV operation. The achievable performance gain as well as the additional programming effort are discussed. It is shown that low effort optimizations can improve the performance of the SpMV operation compared to a basic implementation. But further than that, more complex low level optimizations have a higher impact on the performance, although changing the original program and the readability / maintainability of a program significantly.

The Sparse Matrix Vector Multiplication is an important operation on sparse matrices. This operation is the most time consuming operation in iterative solvers and therefore an efficient execution of that operation is of great importance for many applications. Numerous different storage formats that store sparse matrices efficiently have already been established. Often, these storage formats utilize the sparsity pattern of a matrix in an appropiate manner. For one class of sparse matrices the nonzero values occur in small dense blocks and appropriate block storage formats are well suited for such patterns. But on the other side, these formats perform often poor on general matrices without an explicit / regular block structure. In this paper, the newly developed sparse matrix format DynB is introduced. The aim is to efficiently use several optimization approaches and vectorization with current processors, even for matrices without an explicit block structure of nonzero elements. The DynB matrix format uses 2D rectangular blocks of variable size, allowing fill-ins per block of explicit zero values up to a user controllable threshold. We give a simple and fast heuristic to detect such 2D blocks in a sparse matrix. The performance of the Sparse Matrix Vector Multiplication for a selection of different block formats and matrices with different sparsity structures is compared. Results show that the benefit of blocking formats depend – as to be expected – on the structure of the matrix and that variable sized block formats like DynB can have advantages over fixed size formats and deliver good performance results even for general sparse matrices.

SpMV Runtime Improvements with Program Optimization Techniques on Different Abstraction Levels
(2016)

The multiplication of a sparse matrix with a dense vector is a performance critical computational kernel in many applications, especially in natural and engineering sciences. To speed up this operation, many optimization techniques have been developed in the past, mainly focusing on the data layout for the sparse matrix. Strongly related to the data layout is the program code for the multiplication. But even for a fixed data layout with an accommodated kernel, there are several alternatives for program optimizations. This paper discusses a spectrum of program optimization techniques on different abstraction layers for six different sparse matrix data format and kernels. At the one end of the spectrum, compiler options can be used that hide from the programmer all optimizations done by the compiler internally. On the other end of the spectrum, a multiplication kernel can be programmed that use highly sophisticated intrinsics on an assembler level that ask for a programmer with a deep understanding of processor architectures. These special instructions can be used to efficiently utilize hardware features in processors like vector units that have the potential to speed up sparse matrix computations. The paper compares the programming effort and required knowledge level for certain program optimizations in relation to the gained runtime improvements.