### Refine

#### Document Type

- Conference Object (12)
- Article (5)
- Part of a Book (1)

#### Keywords

- SpMV (4)
- parallel breadth-first search (3)
- BFS (2)
- Blocking (2)
- GPU (2)
- Intel Xeon Phi (2)
- NUMA (2)
- Nvidia graphic processors (2)
- Single Instruction Multiple Data (SIMD) (2)
- Sparse Matrix Vector Multiplication (2)

There exist various different high- and low-level approaches for GPU programming. These include the newer directive based OpenACC programming model, Nvidia’s programming platform CUDA and existing libraries like cuSPARSE with a fixed functionality. This work compares the attained performance and development effort of different approaches based on the example of implementing the SpMV operation, which is an important and performance critical building block in many application fields. We show that the main differences in development effort using CUDA and OpenACC are related to the memory management and the thread mapping.

The SpMV operation -- the multiplication of a sparse matrix with a dense vector -- is used in many simulations in natural and engineering sciences as a computational kernel. This kernel is quite performance critical as it is used, e.g.,~in a linear solver many times in a simulation run. Such performance critical kernels of a program may be optimized on certain levels, ranging from using a rather coarse grained and comfortable single compiler optimization switch down to utilizing architecural features by explicitly using special instructions on an assembler level. This paper discusses a selection of such program optimization techniques in this spectrum applied to the SpMV operation. The achievable performance gain as well as the additional programming effort are discussed. It is shown that low effort optimizations can improve the performance of the SpMV operation compared to a basic implementation. But further than that, more complex low level optimizations have a higher impact on the performance, although changing the original program and the readability / maintainability of a program significantly.

Updating a shared data structure in a parallel program is usually done with some sort of high-level synchronization operation to ensure correctness and consistency. The realization of such high-level synchronization operations is done with appropriate low-level atomic synchronization instructions that the target processor architecture provides. These instructions are costly and often limited in their scalability on larger multi-core / multi-processor systems. In this paper, a technique is discussed that replaces atomic updates of a shared data structure with ordinary and cheaper read/write operations. The necessary conditions are specified that must be fulfilled to ensure overall correctness of the program despite missing synchronization. The advantage of this technique is the reduction of access costs as well as more scalability due to elided atomic operations. But on the other side, possibly more work has to be done caused by missing synchronization. Therefore, additional work is traded against costly atomic operations. A practical application is shown with level-synchronous parallel Breadth-First Search on an undirected graph where two vertex frontiers are accessed in parallel. This application scenario is also used for an evaluation of the technique. Tests were done on four different large parallel systems with up to 64-way parallelism. It will be shown that for the graph application examined the amount of additional work caused by missing synchronization is neglectible and the performance is almost always better than the approach with atomic operations.

Breadth-First Search is a graph traversal technique used in many applications as a building block, e.g., to systematically explore a search space or to determine single source shortest paths in unweighted graphs. For modern multicore processors and as application graphs get larger, well-performing parallel algorithms are favorable. In this paper, we systematically evaluate an important class of parallel algorithms for this problem and discuss programming optimization techniques for their implementation on parallel systems with shared memory. We concentrate our discussion on level-synchronous algorithms for larger multicore and multiprocessor systems. In our results, we show that for small core counts many of these algorithms show rather similar performance behavior. But, for large core counts and large graphs, there are considerable differences in performance and scalability influenced by several factors, including graph topology. This paper gives advice, which algorithm should be used under which circumstances.

In this paper, a set of micro-benchmarks is proposed to determine basic performance parameters of single-node mainstream hardware architectures for High Performance Computing. Performance parameters of recent processors, including those of accelerators, are determined. The investigated systems are Intel server processor architectures as well as the two accelerator lines Intel Xeon Phi and Nvidia graphic processors. Results show similarities for some parameters between all architectures, but significant differences for others.

Predicting the runtime of a sparse matrix-vector multiplication (SpMV) for different sparse matrix formats and thread mappings allows the dynamic selection of the most appropriate matrix format and thread mapping for a given matrix. This paper introduces two new generally applicable performance models for SpMV – for linear and non-linear relationships – based on machine learning techniques. This approach supersedes the common manual development of an explicit performance model for a new architecture or for a new format based on empirical data. The two new models are compared to an existing explicit performance model on different GPUs. Results show that the quality of performance prediction results, the ranking of the alternatives, and the adaptability to other formats/architectures of the two machine learning techniques is better than that of the explicit performance model.

In this paper, several blocking techniques are applied to matrices that do not have a strong blocked structure. The aim is to efficiently use vectorization with current CPUs, even for matrices without an explicit block structure on nonzero elements. Different approaches are known to find fixed or variable sized blocks of nonzero elements in a matrix. We present a new matrix format for 2D rectangular blocks of variable size, allowing fill-ins per block of explicit zero values up to a user definable threshold. We give a heuristic to detect such 2D blocks in a sparse matrix. The performance of a Sparse Matrix Vector Multiplication for chosen block formats is measured and compared. Results show that the benefit of blocking formats depend – as to be expected – on the structure of the matrix and that variable sized block formats can have advantages over fixed size formats.

In this paper, a set of micro-benchmarks is proposed to determine basic performance parameters of single-node mainstream hardware architectures for High Performance Computing. Performance parameters of recent processors, including those of accelerators, are determined. The investigated systems are Intel server processor architectures and the two accelerator lines Intel Xeon Phi and Nvidia graphic processors. Additionally, the performance impact of thread mapping on multiprocessors and Intel Xeon Phi is shown. The results show similarities for some parameters between all architectures, but significant differences for others.

SpMV Runtime Improvements with Program Optimization Techniques on Different Abstraction Levels
(2016)

The multiplication of a sparse matrix with a dense vector is a performance critical computational kernel in many applications, especially in natural and engineering sciences. To speed up this operation, many optimization techniques have been developed in the past, mainly focusing on the data layout for the sparse matrix. Strongly related to the data layout is the program code for the multiplication. But even for a fixed data layout with an accommodated kernel, there are several alternatives for program optimizations. This paper discusses a spectrum of program optimization techniques on different abstraction layers for six different sparse matrix data format and kernels. At the one end of the spectrum, compiler options can be used that hide from the programmer all optimizations done by the compiler internally. On the other end of the spectrum, a multiplication kernel can be programmed that use highly sophisticated intrinsics on an assembler level that ask for a programmer with a deep understanding of processor architectures. These special instructions can be used to efficiently utilize hardware features in processors like vector units that have the potential to speed up sparse matrix computations. The paper compares the programming effort and required knowledge level for certain program optimizations in relation to the gained runtime improvements.

The Sparse Matrix-Vector Multiplication (SpMV) is an important building block in High Performance Computing. Performance improvements for the SpMV are often gained by the development of new optimized sparse matrix formats either by utilizing special sparsity patterns of a matrix or by taking bottlenecks of a hardware architecture into account. In this work a requirements analysis is done for sparse matrix formats with an emphasis on the parallel SpMV for large general sparse matrices. Based on these requirements, three new sparse matrix formats were developed, each combining several optimization techniques and addressing different optimization goals/hardware architectures. The CSR5 Bit Compressed (CSR5BC) format is an extension to the existing CSR5 format and optimized for GPUs. The other two formats, Hybrid Compressed Slice Storage (HCSS) and Local Group Compressed Sparse Row (LGCSR), are new formats optimized for multi-core/-processor architectures including the Xeon Phi Knights Landing. Results show that all three storage formats deliver good parallel SpMV performance on their target architectures over a large set of test matrices compared to other well performing formats in vendor and research libraries.

The Sparse Matrix Vector Multiplication is an important operation on sparse matrices. This operation is the most time consuming operation in iterative solvers and therefore an efficient execution of that operation is of great importance for many applications. Numerous different storage formats that store sparse matrices efficiently have already been established. Often, these storage formats utilize the sparsity pattern of a matrix in an appropiate manner. For one class of sparse matrices the nonzero values occur in small dense blocks and appropriate block storage formats are well suited for such patterns. But on the other side, these formats perform often poor on general matrices without an explicit / regular block structure. In this paper, the newly developed sparse matrix format DynB is introduced. The aim is to efficiently use several optimization approaches and vectorization with current processors, even for matrices without an explicit block structure of nonzero elements. The DynB matrix format uses 2D rectangular blocks of variable size, allowing fill-ins per block of explicit zero values up to a user controllable threshold. We give a simple and fast heuristic to detect such 2D blocks in a sparse matrix. The performance of the Sparse Matrix Vector Multiplication for a selection of different block formats and matrices with different sparsity structures is compared. Results show that the benefit of blocking formats depend – as to be expected – on the structure of the matrix and that variable sized block formats like DynB can have advantages over fixed size formats and deliver good performance results even for general sparse matrices.

Improving the Performance of Parallel SpMV Operations on NUMA Systems with Adaptive Load Balancing
(2018)

For a parallel Sparse Matrix Vector Multiply (SpMV) on a multiprocessor, rather simple and efficient work distributions often produce good results. In cases where this is not true, adaptive load balancing can improve the balance and performance. This paper introduces a low overhead framework for adaptive load balancing of parallel SpMV operations. It uses statistical filters to gather relevant runtime performance data and detects an imbalance situation. Three different algorithms were compared that adaptively balance the load with high quality and low overhead. Results show that for sparse matrices, where the adaptive load balancing was enabled, an average speedup of 1.15 (regarding the total execution time) could be achieved with our best algorithm over 4 different matrix formats and two different NUMA systems.

Although processor speed and memory bandwidth as well as capacities of persistent storage devices evolved rapidly in the last years, the bandwidth between memory and persistent storage devices could not match that pace. As current scientific applications tend to process an enormous amount of data at runtime, the access to a local disk might become the main performance bottleneck.
The communication infrastructure for local area networks has also evolved rapidly, thus modern filesystems for supercomputing are using storage area network solutions to distribute the load of application i/o to several special purpose i/o nodes. These san solutions, however, are often bound to a specific organizational structure, such as different locations of the same company or several institutes of a university. This often implies a common user base and accounting information at each site. In a highly variant grid environment these demands might be hard to meet.
This paper describes the definition of two adio devices for romio, a publicly available mpii/o implementation, to provide transparent access to remote parallel i/o and additionally access to remote i/o on files in the memory of a remote server. The architecture of these devices allows for a definition of remote servers on a per job basis, and can be configured by the user before runtime.

With growing computational power of current supercomputers, scientific computing applications can work on larger problems. The corresponding increase in dataset size is often correlated to an increase in needed storage for the results. Current storage area networks (sans) balance i/o load on multiple disks using high speed networks, but are integrated on the operating system level, demanding administrative intervention if the usage topology changes. While this is practical for single sites or fairly static grid environments, it is hard to extend to a user defined per-job basis. Reconfigurable grid environments, where computing and storage resources are coupled on a per-job basis, need a more flexible approach for parallel i/o on remote locations.
This paper gives a detailed overview of the abilities of the transparent remote access provided by tunnelfs, a part of the viola parallel i/o project. We show how tunnelfs manages flexible and transparent access to remote i/o resources in a reconfigurable grid environment, supporting the definition of the amount and location of persistent storage services on a per-job basis.

The enormous advance in computational power of supercomputers enables scientific applications to process problems of increasing size. This is often correlated with an increasing amount of data stored in (parallel) filesystems. As the increase in bandwith of common disk based I/O devices can not keep up with the evolution of computational power, the access to this data becomes the bottleneck in many applications. memfs takes the approach to distribute I/O data among multiple dedicated remote servers on a user-level basis. It stores files in the accumulated main memory of these I/O nodes and is able to deliver this data with high bandwidth.
We describe how memfs manages a memory based distributed filesystem, how it stores data among the participating I/O servers and how it assigns servers to application clients. Results are given for a usage in a grid project with high-bandwidth wan connections.

Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems
(2014)

Breadth-First Search (BFS) is a graph traversal technique used in many applications as a building block, e.g.,~to systematically explore a search space. For modern multicore processors and as application graphs get larger, well-performing parallel algorithms are favourable. In this paper, we systematically evaluate an important class of parallel BFS algorithms and discuss programming optimization techniques for their implementation. We concentrate our discussion on level-synchronous algorithms for larger multicore and multiprocessor systems. In our results, we show that for small core counts many of these algorithms show rather similar behaviour. But, for large core counts and large graphs, there are considerable differences in performance and scalability influenced by several factors. This paper gives advice, which algorithm should be used under which circumstances.

Updating a shared data structure in a parallel program is usually done with some sort of high-level synchronization operation to ensure correctness and consistency. However, underlying synchronization instructions in a processor architecture are costly and rather limited in their scalability on larger multi-core/multi-processors systems. In this paper, we examine work queue operations where such costly atomic update operations are replaced with non-atomic modifiers (simple read+write). In this approach, we trade the exact amount of work with atomic operations against doing more and redundant work but without atomic operations and without violating the correctness of the algorithm. We show results for the application of this idea to the concrete scenario of parallel Breadth First Search (BFS) algorithms for undirected graphs on two large NUMA shared memory system with up to 64 cores.

Parallel systems leverage parallel file systems to efficiently perform I/O to shared files. These parallel file systems utilize different client-server communication and file data distribution strategies to optimize the access to data stored in the file system. In many parallel file systems, clients access data that is striped across multiple I/O devices or servers. Striping, however, results in poor access performance if the application generates a different stride pattern. This work analyzes optimization approaches of different parallel file systems and proposes new strategies for the mapping of clients to servers and the distribution of file data with special respect to strided data access. We evaluate the results of a specific approach in a parallel file system for main memory.