Libraries


Authors: Maciej Cytowskia,*, Maciej Szpindlera
Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw, Poland

Abstract: Modern high performance computing architectures are based on multi-core and multi-threaded computing nodes. The mixed MPI and OpenMP programming is currently a reference model for obtaining high scalability on large computing systems. In such a model, MPI processes contain many OpenMP parallel regions. Scalability and performance of those parallel regions may differ between various computing systems and between each run of the code. The control of the number of threads used by different OpenMP regions, by users of the HPC systems, is very often limited to setting a single environment variable – OMP NUM THREADS. In this work we present a tool called SOMPARlib which is based on OpenMP Monitoring Interface (POMP) and is capable of controlling the execution of various OpenMP parallel regions introduced in computational codes during run time. The tool is particularly useful in the case of architectures that introduce the multithreading mechanisms like Simultaneous multithreading (SMT) or Hyper-Threading (HT).

Download paper: PDF


Authors:Miroslaw Kupczyk, Damian Kaliszan, Vincent BergeaudPoznan Supercomputing and Networking Center, Dabrowskiego 79a, 60-529 PoznanCEA Saclay, DM2S, STMF, LGLS, 91191 Gif-sur-Yvette Cedex, France

AbstractThe work undertaken in the PRACE project consists of making the software evolve so that itis suitable for exploitation of design of experiments implying thousands of cores in multiplecontexts: serial codes, parallel codes, coupled simulations.The main work which is focused on a serial code context is about to optimize the applicationto process more efficiently on a bigger number of cores and the corresponding amount ofinput data. We rely on a strategy that uses the fork mechanism provided by the Linux kernelto deploy different computations and checking finalization of each of the forked processes.Due to URANIE’s internal architecture, which makes a single computation indivisible, it isdifficult to properly examine the scalability of the system in terms of the strong and weakscalability definition. However, the promising results were obtained after tens of test-runs.The work carried out in this project allowed running URANIE codes on thousands cores onTier-0 architecture. The scenario with processing a huge amount of data on a very limitednumber of cores was the starting point and the reason to tackle with the optimization.After enhancements of a certain part of the code, it was tested on up to 4096 cores, themaximum core number we have been granted access to. The tests showed that the URANIEcode is ready to be run on Tier-0 machine.


Authors: Peicho Petkov, Elena Lilkova, Stoyan Markov, Damyan Grancharov, Nevena Ilieva, Leandar Litov, National Centre for Supercomputing Applications, Bulgaria

Abstract: A library, implementing the AGBNP2 [1,2] implicit solvation model, was developed, which calculates the nonbonded interactions of a solute with the solvent — implicit water. The library is to be used in Molecular Dynamics (MD) packages for estimation of solvation free energies and studying of hydration effects.The library was based on a prevoiusly developed Fortran code [3]. The presented in this paper code was rewritten in C and parallelized with OpenMP. The main objective was to parallelize the code in a way that allows it to run efficiently on Intel Xeon Phi in native mode. However, no efficient utilization of the MIC was observed, as the obtained performance on the coprocessor was an order of magnitute lower than on the CPU.

Download PDF



Authors: Valentin Pavlov, Nikola Andonov, Georgi Kremenliev, National Center for Supercomputing Applications, Bulgaria

Abstract: ExaFMM is a highly scalable implementation of the Fast Multipole Method (FMM) – an O(N) algorithm forsolving N-body interaction with applications in gravitational and electrostatic simulations. The authors reportscaling on large systems with O(100k) cores with support for MPI, OpenMP and SIMD vectorization. Thelibrary also includes GPU kernels capable of running on a multi-GPU system. The objective of the project is toenable the use of the ExaFMM solver in the MIC architecture by performing porting, verification, scalabilitytesting and providing configuration suggestions to its potential users.

Download PDF


Authors: M. SerdarCelebia,c, Ahmet Duranb,a*, MehmetTuncela,c,Bora Akayd?na,c, Figen Oztopraka,c
aIstanbul TechnicalUniversity, National Center for High Performance Computing of Turkey(UHeM), Istanbul 34469, Turkey
bIstanbul TechnicalUniversity,Department of Mathematics, Istanbul 34469, Turkey
cIstanbul TechnicalUniversity, Informatics Institute, Istanbul 34469, Turkey July 11,2013

Abstract: SuperLU_DIST is a distributed memory parallel solver for sparse linear systems. The solver makes several calls to BLAS library routines in its numerical factorization phase. The performance of the BLAS library can significantly affect the overall performance of the solver as the required BLAS operations are typically computationally dense. In this regard, we examine how the overall performance of the SuperLU_DIST solver can be improved by employing optimized BLAS libraries. In particular, we try using Intel Math Kernel Library (MKL) and Parallel Linear Algebra Subroutines for Multicore Architecture (PLASMA) libraries. Using MKL can provide an approximate performance improvement of 50 %, and using PLASMA can improve the performance by around 10 % for the best tile size. Based on our findings, we have improved SuperLU_MCDT solver.

Download PDF


Authors: AndrewSunderlanda, Stephen Picklesa, Milos Nikolicb,Aleksandar Jovicb, Josip Jakicb, VladimirSlavnicb, Ivan Girottoc, Peter Nashc,Michael Lysaghtc
aSTFC Daresbury Laboratory,Warrington, UK;
bInstitute of Physics,Belgrade, Serbia;
cIrish Centre for High-EndComputing, Dublin, Ireland

Abstract: The Fast Fourier Transform (FFT) is one of the most widely used algorithms in engineering and scientific applications and therefore its analysis and performance is of much importance to a range of research fields. On PRACE Tier-0 systems a parallel environment with a great deal of processing power (large number of CPU cores or accelerators such as GPUs) is at disposal for researchers. The FFTs investigated are both in-code and through various numerical libraries, where the algorithm is implemented in both serial and parallel form. The implementations of FFT investigated range from pure MPI, OpenMP versions for multicore, hybrid (OpenMP/MPI) to GPU-based. The objective of this project is to assess the suitability, performance and scalability of various implementations of FFT for the PRACE large-scale scientific applications Quantum ESPRESSO and DL_POLY.

Download PDF


Authors: KadirAkbudak, Enver Kayaaslan, Cevdet Aykanat
Computer Engineering Department,Bilkent University, Ankara, Turkey

Abstract: Sparse matrix-vector multiplication (SpMxV) is a kernel operation widely used in iterative linear solvers. The samesparse matrix is multiplied by a dense vector repeatedly in these solvers. Matrices with irregular sparsity patternsmake it di-cult to utilize cache locality e-ectively in SpMxV computations. In this work, we investigate single- andmultiple-SpMxV frameworks for exploiting cache locality in SpMxV computations. For the single-SpMxV framework, we propose two cache-size-aware top-down row/column-reordering methods based on 1D and 2D sparse matrix partitioningby utilizing the column-net and enhancing the row-column-net hypergraph models of sparse matrices. The multiple-SpMxV framework depends on splitting a given matrix into a sum of multiple nonzero-disjoint matrices so that theSpMxV operation is performed as a sequence of multiple input- and output-dependent SpMxV operations. For ane-ective matrix splitting required in this framework, we propose a cache-size-aware top-down approach based on 2Dsparse matrix partitioning by utilizing the row-column-net hypergraph model. The primary objective in all of the threemethods is to maximize the exploitation of temporal locality. We evaluate the validity of our models and methods ona wide range of sparse matrices by performing actual runs through using OSKI (Optimized Sparse Kernel InterfaceLibrary) [17]. Experimental results show that proposed methods and models outperform state-of-the-art schemes.

Download PDF


Authors: M. SerdarCelebia,c, Ahmet Duranb,a*, Mehmet Tuncela,c,Bora Akayd?na,c
aIstanbul TechnicalUniversity, National Center for High Performance Computing of Turkey(UHeM), Istanbul 34469, Turkey
bIstanbul TechnicalUniversity,Department of Mathematics, Istanbul 34469, Turkey
cIstanbul TechnicalUniversity, Informatics Institute, Istanbul 34469, Turkey July 13,2012

Abstract: We consider a fast, robust and scalable solver using graphic processing units (GPU) as accelerators for a sparse linear system AX=B. In this project, a new parallel hybrid direct solver is designed and implemented on GPU for large sparse linear systems. In particular, we work on GPU programming using directive based Open ACC in order to obtain a scalable and improved SuperLU on CPU+GPU heterogeneous systems.

Download PDF


Authors: AhmetDurana,b*, M. Serdar Celebia,c, MehmetTuncela,c, Bora Akayd?na,c
aIstanbul TechnicalUniversity, National Center for High Performance Computing of Turkey(UHeM), Istanbul 34469, Turkey
bIstanbul TechnicalUniversity,Department of Mathematics, Istanbul 34469, Turkey
cIstanbul TechnicalUniversity, Informatics Institute, Istanbul 34469, Turkey

Abstract: One-dimensional (1D) partitioning of sparse matrices results in lower quality partitioning than two-dimensional (2D) partitionings in the context of parallel sparse matrix vector multiply (SpMxV) operations. However, 2D partitioning schemes incur two communication phases. We propose a novel sparse matrix partitioning scheme which achieves a single communication phase as in 1D schemes and partitions the nonzeros with a flexibility close to that in 2D schemes. In the proposed partitioning scheme, a nonzero is assigned to either the receiver or the sender processor associated with the related input- or output-vector entries. We observe that a fine-grain partitioning should satisfy this constraint for most of the nonzeros. Based on this observation, we propose a simple yet effective heuristic to obtain such a partition in two steps. In the first step, we obtain partitions on the rows, columns, and nonzeros of the given matrix using some known methods. In the second step, we refine the nonzero partition such that the aforementioned constraint is met while keeping the row and column partitions obtained in the first step intact. We demonstrate that the proposed partitioning scheme improves the performance of parallel SpMxV operations both in theory and practice with respect to 1D and 2D partitionings.

Download PDF


Authors: EnverKayaaslana, Bora Ucarb, Ozcan Ozturka,Cevdet Aykanata
aBilkent University,Computer Engineering Department, 06800 Ankara, Turkey
bCNRS, Lyon, France

Abstract: One-dimensional (1D) partitioning of sparse matrices results in lower quality partitioning than two-dimensional (2D) partitionings in the context of parallel sparse matrix vector multiply (SpMxV) operations. However, 2D partitioning schemes incur two communication phases. We propose a novel sparse matrix partitioning scheme which achieves a single communication phase as in 1D schemes and partitions the nonzeros with a flexibility close to that in 2D schemes. In the proposed partitioning scheme, a nonzero is assigned to either the receiver or the sender processor associated with the related input- or output-vector entries. We observe that a fine-grain partitioning should satisfy this constraint for most of the nonzeros. Based on this observation, we propose a simple yet effective heuristic to obtain such a partition in two steps. In the first step, we obtain partitions on the rows, columns, and nonzeros of the given matrix using some known methods. In the second step, we refine the nonzero partition such that the aforementioned constraint is met while keeping the row and column partitions obtaine d in the first step intact. We demonstrate that the proposed partitioning scheme improves the performance of parallel SpMxV operations both in theory and practice with respect to 1D and 2D partitionings.

Download PDF


Authors: Seher Acer,Enver Kayaaslan, Tugrul Dayar, Cevdet Aykanat
Bilkent University, ComputerEngineering Department, 06800 Ankara, Turkey

Abstract: In this whitepaper, we describe the problem of permuting sparse square matrices into block diagonal form with overlap (BDO) and propose a graph partitioning algorithm for solving this problem. A block diagonal matrix with overlap is a block diagonal matrix whose consecutive diagonal blocks may overlap. The objective in this permutation problem is to minimize the total overlap size, whereas the permutation constraint is to maintain balance on the number of nonzeros in the diagonal blocks. This permutation problem arises in the parallelization of an explicit formulation of multiplicative Schwarz preconditioner. We define ordered Graph Partitioning by Vertex Separator (oGPVS) problem as an equivalent problem to this permutation problem. oGPVS problem is a restricted version of Graph Partitioning by Vertex Separator (GPVS) problem and the aim is to find a partition of the vertices into K ordered vertex parts and K-1 ordered separators where each two consecutive parts can be connected through only a separator, a separator can only connect two consecutive parts, and each two consecutive separators can be adjacent. The objective in the oGPVS problem is to minimize the total number of vertices in the separators, whereas the partitioning objective is to maintain balance on the part weights where part weight is defined as the sum of the weights of vertices in that part. To solve oGPVS problem, we utilized recursive bipartitioning paradigm and fixed vertices in our proposed oGPVS algorithm. We tested the performance of our algorithm in a wide range of matrices in comparison to another graph partitioning algorithm that solves the same problem. Results showed that the oGPVS algorithm performs better than the other algorithm in terms of overlap size.

Download PDF


Authors: MustafaGundogana, Cevdet Aykanata, Murat Manguoglub
aBilkent University,Computer Engineering Department, 06800 Ankara, Turkey
bMiddle East TechnicalUniversity, Computer Engineering Department, 06800 Ankara, Turkey

Abstract:In this whitepaper, we review the state-of-the-art hybrid solver, which uses generalized form DS factorization, for solving system of equations of the form Ax = f, and this solver’s relations to graphs and hypergraphs. We investigate two different reordering strategies for the DS factorization preconditioning scheme: reordering via graph partitioning (GP) and reordering via hypergraph partitioning (HP).In the GP scheme, the partitioning objective of minimizing the edge cutsize corresponds to minimizing the total number of nonzeros in the off-diagonal blocks of the reordered matrix. In the HP scheme, the partitioning objective of minimizing the cutsize, according to the cut-net metric, corresponds to minimizing the total number of nonzero columns in the off-diagonal blocks of the reordered matrix. In both of the two schemes, partitioning constraint of maintaining balance on the part weights corresponds to maintaining balance on the nonzero counts of the diagonal blocks of the reordered matrix. The partitioning objective of GP relates to minimizing the number of nonzeros in the reduced system, whereas the partitioning objective of HP exactly models minimizing the size of the reduced system. We tested the performance of two partitioning schemes on a wide range of matrices for 4-, 8-, 16-, 32-, and 64-way permutations. Results showed that HP scheme performs better than GP scheme in terms of solution times.

Download PDF


Authors: DimitriLecasa, Richard Chevaliera,Pascal Jolyb
aIDRIS, Bat 506 BP 167,91403 ORSAY CEDEX, FRANCE
bLaboratoire Jacques-LouisLions, 4 place Jussieu, 75252 PARIS CEDEX 5, FRANCE

Abstract:The goal of this work is to evaluate how to parallelize the block cyclic reduction using MPI and OpenMP. This algorithm is used to solve elliptic problems much faster than the traditional iterative methods. We explain the parallelism that we exhibit and show the performance of the MPI and the OpenMP version that we have made.

Download PDF


Author: Chandan Basu
National Supercomputer Center,Linkoping University, Linkoping 581 83, Sweden

Abstract:A new MPI_Alltoallv algorithm is tested on the Euroben kernel routine mod2f. The new MPI_Alltoallv routine utilizes the high bandwidth within a node more effectively and puts less stress on inter node bandwidth. The MPI_Alltoallv is achieved first by inter-node communication and then followed by intra-node communication. A comparison of the modified mod2f with the original mod2f is made for different data sizes and run sizes on CURIE system. The results show substantial improvement using the modified MPI_Alltoallv on CURIE system

Download PDF


Author: AndrewSunderland
STFC Daresbury Laboratory, Warrington,UK

Abstract:Parallel eigensolver operations are at the computational core of many large-scale scientific and engineering application codes. This project analyses parallel performance of established and newly developed parallel dense symmetric eigensolver numerical library routines on PRACE Tier-0 systems using real datasets from large-scale application codes. This whitepaper builds upon the research report ‘Dense Linear Algebra Performance Analysis on the Fujitsu BX900 OPL Machine†’ (from the same author) which can be found at: http://www.openpetascale.org/docume…. This version of the report is updated with results from ScaLAPACK and with the recently released ELPA library routines on the PRACE Tier-0 machines CURIE and JUGENE.

Download PDF


Disclaimer

These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.

They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.

Copyright notices

© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.

All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.

Share: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+Email this to someone