### Refine

#### Document Type

- Conference Object (5)
- Article (1)
- Part of a Book (1)

#### Keywords

- SpMV (3)
- GPU (2)
- Single Instruction Multiple Data (SIMD) (2)
- Sparse Matrix Vector multiply (SpMV) (2)
- intrinsics (2)
- CPU (1)
- CSR5BC (1)
- CUDA (1)
- Gradient-boosting (1)
- HCSS (1)

There exist various different high- and low-level approaches for GPU programming. These include the newer directive based OpenACC programming model, Nvidia’s programming platform CUDA and existing libraries like cuSPARSE with a fixed functionality. This work compares the attained performance and development effort of different approaches based on the example of implementing the SpMV operation, which is an important and performance critical building block in many application fields. We show that the main differences in development effort using CUDA and OpenACC are related to the memory management and the thread mapping.

The SpMV operation -- the multiplication of a sparse matrix with a dense vector -- is used in many simulations in natural and engineering sciences as a computational kernel. This kernel is quite performance critical as it is used, e.g.,~in a linear solver many times in a simulation run. Such performance critical kernels of a program may be optimized on certain levels, ranging from using a rather coarse grained and comfortable single compiler optimization switch down to utilizing architecural features by explicitly using special instructions on an assembler level. This paper discusses a selection of such program optimization techniques in this spectrum applied to the SpMV operation. The achievable performance gain as well as the additional programming effort are discussed. It is shown that low effort optimizations can improve the performance of the SpMV operation compared to a basic implementation. But further than that, more complex low level optimizations have a higher impact on the performance, although changing the original program and the readability / maintainability of a program significantly.

In this paper, a set of micro-benchmarks is proposed to determine basic performance parameters of single-node mainstream hardware architectures for High Performance Computing. Performance parameters of recent processors, including those of accelerators, are determined. The investigated systems are Intel server processor architectures as well as the two accelerator lines Intel Xeon Phi and Nvidia graphic processors. Results show similarities for some parameters between all architectures, but significant differences for others.

Predicting the runtime of a sparse matrix-vector multiplication (SpMV) for different sparse matrix formats and thread mappings allows the dynamic selection of the most appropriate matrix format and thread mapping for a given matrix. This paper introduces two new generally applicable performance models for SpMV – for linear and non-linear relationships – based on machine learning techniques. This approach supersedes the common manual development of an explicit performance model for a new architecture or for a new format based on empirical data. The two new models are compared to an existing explicit performance model on different GPUs. Results show that the quality of performance prediction results, the ranking of the alternatives, and the adaptability to other formats/architectures of the two machine learning techniques is better than that of the explicit performance model.

SpMV Runtime Improvements with Program Optimization Techniques on Different Abstraction Levels
(2016)

The multiplication of a sparse matrix with a dense vector is a performance critical computational kernel in many applications, especially in natural and engineering sciences. To speed up this operation, many optimization techniques have been developed in the past, mainly focusing on the data layout for the sparse matrix. Strongly related to the data layout is the program code for the multiplication. But even for a fixed data layout with an accommodated kernel, there are several alternatives for program optimizations. This paper discusses a spectrum of program optimization techniques on different abstraction layers for six different sparse matrix data format and kernels. At the one end of the spectrum, compiler options can be used that hide from the programmer all optimizations done by the compiler internally. On the other end of the spectrum, a multiplication kernel can be programmed that use highly sophisticated intrinsics on an assembler level that ask for a programmer with a deep understanding of processor architectures. These special instructions can be used to efficiently utilize hardware features in processors like vector units that have the potential to speed up sparse matrix computations. The paper compares the programming effort and required knowledge level for certain program optimizations in relation to the gained runtime improvements.

The Sparse Matrix-Vector Multiplication (SpMV) is an important building block in High Performance Computing. Performance improvements for the SpMV are often gained by the development of new optimized sparse matrix formats either by utilizing special sparsity patterns of a matrix or by taking bottlenecks of a hardware architecture into account. In this work a requirements analysis is done for sparse matrix formats with an emphasis on the parallel SpMV for large general sparse matrices. Based on these requirements, three new sparse matrix formats were developed, each combining several optimization techniques and addressing different optimization goals/hardware architectures. The CSR5 Bit Compressed (CSR5BC) format is an extension to the existing CSR5 format and optimized for GPUs. The other two formats, Hybrid Compressed Slice Storage (HCSS) and Local Group Compressed Sparse Row (LGCSR), are new formats optimized for multi-core/-processor architectures including the Xeon Phi Knights Landing. Results show that all three storage formats deliver good parallel SpMV performance on their target architectures over a large set of test matrices compared to other well performing formats in vendor and research libraries.

Improving the Performance of Parallel SpMV Operations on NUMA Systems with Adaptive Load Balancing
(2018)

For a parallel Sparse Matrix Vector Multiply (SpMV) on a multiprocessor, rather simple and efficient work distributions often produce good results. In cases where this is not true, adaptive load balancing can improve the balance and performance. This paper introduces a low overhead framework for adaptive load balancing of parallel SpMV operations. It uses statistical filters to gather relevant runtime performance data and detects an imbalance situation. Three different algorithms were compared that adaptively balance the load with high quality and low overhead. Results show that for sparse matrices, where the adaptive load balancing was enabled, an average speedup of 1.15 (regarding the total execution time) could be achieved with our best algorithm over 4 different matrix formats and two different NUMA systems.