Refine
H-BRS Bibliography
- yes (20)
Departments, institutes and facilities
Document Type
- Conference Object (12)
- Article (5)
- Part of a Book (1)
- Report (1)
- Working Paper (1)
Year of publication
Keywords
- SpMV (4)
- parallel breadth-first search (3)
- BFS (2)
- Blocking (2)
- GPU (2)
- Intel Xeon Phi (2)
- NUMA (2)
- Nvidia graphic processors (2)
- Single Instruction Multiple Data (SIMD) (2)
- Sparse Matrix Vector Multiplication (2)
Updating a shared data structure in a parallel program is usually done with some sort of high-level synchronization operation to ensure correctness and consistency. However, underlying synchronization instructions in a processor architecture are costly and rather limited in their scalability on larger multi-core/multi-processors systems. In this paper, we examine work queue operations where such costly atomic update operations are replaced with non-atomic modifiers (simple read+write). In this approach, we trade the exact amount of work with atomic operations against doing more and redundant work but without atomic operations and without violating the correctness of the algorithm. We show results for the application of this idea to the concrete scenario of parallel Breadth First Search (BFS) algorithms for undirected graphs on two large NUMA shared memory system with up to 64 cores.
Parallel systems leverage parallel file systems to efficiently perform I/O to shared files. These parallel file systems utilize different client-server communication and file data distribution strategies to optimize the access to data stored in the file system. In many parallel file systems, clients access data that is striped across multiple I/O devices or servers. Striping, however, results in poor access performance if the application generates a different stride pattern. This work analyzes optimization approaches of different parallel file systems and proposes new strategies for the mapping of clients to servers and the distribution of file data with special respect to strided data access. We evaluate the results of a specific approach in a parallel file system for main memory.
Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems
(2014)
Breadth-First Search (BFS) is a graph traversal technique used in many applications as a building block, e.g.,~to systematically explore a search space. For modern multicore processors and as application graphs get larger, well-performing parallel algorithms are favourable. In this paper, we systematically evaluate an important class of parallel BFS algorithms and discuss programming optimization techniques for their implementation. We concentrate our discussion on level-synchronous algorithms for larger multicore and multiprocessor systems. In our results, we show that for small core counts many of these algorithms show rather similar behaviour. But, for large core counts and large graphs, there are considerable differences in performance and scalability influenced by several factors. This paper gives advice, which algorithm should be used under which circumstances.
Abschlussbericht zum BMBF-Fördervorhaben Enabling Infrastructure for HPC-Applications (EI-HPC)
(2020)
The SpMV operation -- the multiplication of a sparse matrix with a dense vector -- is used in many simulations in natural and engineering sciences as a computational kernel. This kernel is quite performance critical as it is used, e.g.,~in a linear solver many times in a simulation run. Such performance critical kernels of a program may be optimized on certain levels, ranging from using a rather coarse grained and comfortable single compiler optimization switch down to utilizing architecural features by explicitly using special instructions on an assembler level. This paper discusses a selection of such program optimization techniques in this spectrum applied to the SpMV operation. The achievable performance gain as well as the additional programming effort are discussed. It is shown that low effort optimizations can improve the performance of the SpMV operation compared to a basic implementation. But further than that, more complex low level optimizations have a higher impact on the performance, although changing the original program and the readability / maintainability of a program significantly.
Breadth-First Search is a graph traversal technique used in many applications as a building block, e.g., to systematically explore a search space or to determine single source shortest paths in unweighted graphs. For modern multicore processors and as application graphs get larger, well-performing parallel algorithms are favorable. In this paper, we systematically evaluate an important class of parallel algorithms for this problem and discuss programming optimization techniques for their implementation on parallel systems with shared memory. We concentrate our discussion on level-synchronous algorithms for larger multicore and multiprocessor systems. In our results, we show that for small core counts many of these algorithms show rather similar performance behavior. But, for large core counts and large graphs, there are considerable differences in performance and scalability influenced by several factors, including graph topology. This paper gives advice, which algorithm should be used under which circumstances.
Improving the Performance of Parallel SpMV Operations on NUMA Systems with Adaptive Load Balancing
(2018)
For a parallel Sparse Matrix Vector Multiply (SpMV) on a multiprocessor, rather simple and efficient work distributions often produce good results. In cases where this is not true, adaptive load balancing can improve the balance and performance. This paper introduces a low overhead framework for adaptive load balancing of parallel SpMV operations. It uses statistical filters to gather relevant runtime performance data and detects an imbalance situation. Three different algorithms were compared that adaptively balance the load with high quality and low overhead. Results show that for sparse matrices, where the adaptive load balancing was enabled, an average speedup of 1.15 (regarding the total execution time) could be achieved with our best algorithm over 4 different matrix formats and two different NUMA systems.