Application scalability


Quantum MD applications


Authors: Soon-Heum Koa, Simen Reine, Thomas Kjargaard
National Supercomputing Centre, Linkoping University, 581 83 Linkoping, Sweden
Centre for Theoretical and Computational Chemistry, Department of Chemistry, Oslo University, Postbox 1033, Blindern, 0315, Oslo, Norway
qLEAP – Center for Theoretical Chemistry, Department of Chemistry, Aarhus University, Langelandsgade 140, Aarhus C, 8000, Denmark

Abstract: In this paper, we present the performance of LSDALTON’s DFT method in large molecular simulations of biological interest. We primarily focus on evaluating the performance gain by applying the density fitting (DF) scheme and the auxiliary density matrix method (ADMM). The enabling effort is put towards finding the right build environment (composition of the compiler, an MPI and extra libraries) which generates a full 64-bit integer-based binary. Using three biological molecules varying in size, we verify that the DF and the ADMM schemes provide much gain in the performance of the DFT code, at the cost of large memory consumption to store extra matrices and a little change on scalability characteristics with the ADMM calculation. In the insulin simulation, the parallel region of the code accelerates by 30 percent with the DF calculation and 56 percent in the case of the DF-ADMM calculations.

Download PDF


Authors: Mariusz Uchronski, Agnieszka Kwiecien, Marcin Gebarowski
WCSS, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland

Abstract: CP2K is an application for atomistic and molecular simulation and, with its excellent scalability, is particularly important with regards to use on future exascale systems. The code is well parallelized using MPI and hybrid MPI/OpenMP, typically scaling well to 1 core per atom in the system. The research on CP2K done within PRACE-1IP stated that due to heavy usage of sparse matrix multiplication for large systems, there is a place for improvement of performance. The main goal of this work, undertaken within PRACE-3IP, was to investigate the most time-consuming routines and port them to accelerators, particularly GPGPUs. The relevant areas of the code that can be effectively accelerated are the matrix multiplications (DBCSR library). A significant amount of work has already been done on DBCSR library using CUDA. We focused on enabling the library on a potentially wider range of computing resources using OpenCL and OpenACC technologies, to bring the overall application closer to exascale. We introduce the ports and promising performance results. The work done has led to the identification of a number of issues with using OpenACC in CP2K, which need to be further investigated and resolved to make the application and technology work better together.

Download PDF


Authors: J.A. Astrom
CSC – It-centre for science, Esbo, Finland

Abstract: NUMFRAC is a generic particle based code for simulation of non-linear mechanics in disordered solids. The generic theory of the code is outlined and examples are given by glacier calving and fretting. This text is a to a large degree a part of the publication: J. A. °Astr?om, T. I. Riikil?a, T. Tallinen, T. Zwinger, D. Benn, J. C. Moore, and J. Timonen, A particle based simulation model for glacier dynamics, The Cryosphere Discuss, 7, 921-941, 2013.

Download PDF


Authors: A. Calzolaria, C. Cavazzonib
a Istituto Nanoscienze CNR-NANO-S3, I-41125 Modena Italy
b CINECA – Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)

Abstract: This work regards the enabling of the Time-Dependent Density Functional Theory kernel (TurboTDDFT) of Quantum-ESPRESSO package on petascale systems. TurboTDDFT is a fundamental tool to investigate nanostructured materials and nanoclusters, whose optical properties are determined by their electronic excited states. Enabling of TurboTDDFT on petascale system will open up the possibility to compute optical properties for large systems relevant for technological applications. Plasmonic excitations in particular are important for a large range of applications from biological sensing, over energy conversion to subwavelength waveguides. The goal of the present project was the implementation of novel strategies for reducing the memory requirements and improving the weak scalability of the TurboTDDFT code, aiming at obtaining an important improvement of the code capabilities and to be able to study the plasmonic properties of metal nanoparticle (Ag, Au) and their dependence on the size of the system under test.

Download PDF


Authors: Massimiliano Guarrasia, Sandro Frigiob, Andrew Emersona and Giovanni Erbaccia
a CINECA, Italy
b University of Camerino, Italy

Abstract: In this paper we will present part of the work carried out by CINECA in the framework of the PRACE-2IP project aimed to study the effect on performance due to the implementation of a 2D Domain Decomposition algorithm in DFT codes that use standard 1D (or slab) Parallel Domain Decomposition. The performance of this new algorithm are tested on two example applications: Quantum Espresso, a popular code used in materials science, and , the CFD code BlowupNS.

In the first part of this paper we will present the codes that we use. In the last part of this paper we will show the increase of performance obtained using this new algorithm.

Download PDF



Authors: Al. Charalampidoua,b, P. Korosogloua,b, F. Ortmannc, S. Rochec
a Greek Research and Technology Network, Athens, Greece
b Scientific Computing Center, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
c Catalan Institute of Nanotechnology, Spain

Abstract: This study has focused on an application for Quantum Hall Transport simulations and more specifically on how to overcome an initially identified potential performance bottleneck related to the I/O of wave functions. These operations are required in order to enable and facilitate continuation runs of the code. After following several implementations for performing these I/O operations in parallel (using the MPI I/O library) we showcase that a performance gain in the range 1.5 – 2 can be achieved when switching from the initial POSIX only approach to the parallel MPI I/O approach on both CURIE and HERMIT PRACE Tier-0 systems. Moreover, we showcase that because I/O throughput scales with an increasing number of cores overall the performance of the code is efficient up to at least 8192 processes.

Download PDF


Authors: Martti Louhivuoria, Jussi Enkovaaraa,b
a CSC – IT Center for Science Ltd., PO Box 405, 02101 Espoo, Finland
b Aalto University, Department of Applied Physics, PO Box 11100, 00076 Aalto, Finland

Abstract: In recent years, graphical processing units (GPUs) have generated a lot of excitement in computational sciences by promising a signi-cant increase in computational power compared to conventional processors. While this is true in many cases for small-scale computational problems that can be solved using the processing power of a single computing unit, the e-cient usage of multiple GPUs in parallel over multiple inter-connected computing units has been problematic. Increasingly the real-life problems tackled by computational scientist require large-scale parallel computing and thus it is crucial that GPU-enabled software reach good parallel scalability to reap the bene-ts of GPU acceleration. This is exactly what has been achieved for GPAW, a popular quantum chemistry program, by Hakala et al. in their recent work [1].

Download PDF


Authors: Simen Reinea , Thomas Kjrgaarda, Trygve Helgakera, Ole Widar Saastadb, Andrew Sunderlandc
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, Oslo, Norway,
b University Center for Information Technology, University of Oslo, Oslo, Norway
c STFC Daresbury Laboratory, Warrington, United Kingdom

Abstract: Linear Scaling DALTON (LSDALTON) is a powerful molecular electronic structure program that is the focus of software optimization projects in PRACE 1IP-WP7.2 and PRACE 1IP-WP7.5. This part of the project focuses on the introduction of parallel diagonalization routines from the ScaLAPACK library into the latest MPI version of LSDALTON. The parallelization work has involved three main tasks: i) Redistribution of the matrices assembled for the SCF cycle from a serial / distributed state to the two dimensional block-cyclic data distribution used for PBLAS and ScaLAPACK; ii) Interfacing of LSDALTON data structures to parallel diagonalization routines in ScaLAPACK; iii) Performance testing to determine the favoured ScaLAPACK eigensolver methodology

Download PDF


Authors: Fabio Affinitoa, Emanuele Cocciab, Sandro Sorellac, Leonardo Guidonib,
a CINECA, Casalecchio di Reno, Italy
b Universita dell’Aquila,L’Aquila,Italy
c SISSA, Trieste, Italy

Abstract: Quantum Monte Carlo (QMC) methods are a promising technique for the study of the electronic structure of correlated molecular systems. The technical of the present project is to demonstrate the scalability of the TurboRVB code for a series of systems having different properties in terms of number of electrons, number of variational parameters and size of the basis set.

Download PDF


Authors:Iain Bethunea, Adam Cartera, Xu Guoa, Paschalis Korosogloub,c
a EPCC, The University of Edinburgh, James Clerk Maxwell Building, The King’s Buildings, Edinburgh EH9 3JZ, United Kingdom
b AUTH, Aristotle University of Thessaloniki, Thessaloniki 52124, Greece,
c GRNET, Greek Research & Technology Network, L. Mesogeion 56, Athens 11527, Greece

Abstract: CP2K is a powerful materials science and computational chemistry code and is widely used by research groups across Europe and beyond. The recent addition of a linear scaling KS-DFT method within the code has made it possible to simulate systems of an unprecedented size – 1,000,000 atoms or more – making full used of Petascale computing resources. Here we report on work undertaken within PRACE 1-IP WP 7.1 to port and test CP2K on Jugene, the PRACE Tier 0 BlueGene/P system. In addition, development work was performed to reduce the memory usage of a key data structure within the code, to make it more suitable for the limited memory environment of the BlueGene/P. Finally we present a set of benchmark results and analysis of a large test system.

Download PDF


Authors: Luigi Genovesea,b, Brice Videaua, Thierry Deutscha, Huan Tranc, Stefan Goedeckerc
a Laboratoire de Simulation Atomistique, SP2M/INAC/CEA, 17 Av. des Martyrs, 38054 Grenoble, France
b European Synchrotron Radiation Facility, 6 rue Horowitz, BP 220, 38043 Grenoble, France
c Institut fur Physik, Universitat Basel, Klingelbergstr.82, 4056 Basel, Switzerland

Abstract: Electronic structure calculations (DFT codes) are certainly among the disciplines for which an increasing of the computational power correspond to an advancement in the scienti-c results. In this report, we present the ongoing advancements of DFT code that can run on massively parallel, hybrid and heterogeneous CPU-GPU clusters. This DFT code, named BigDFT, is delivered within the GNU-GPL license either in a stand-alone version or integrated in the ABINIT software package. Hybrid BigDFT routines were initially ported with NVidia’s CUDA language, and recently more functionalities have been added with new routines writeen within Kronos’ OpenCL standard. The formalism of this code is based on Daubechies wavelets, which is a systematic real-space based basis set. The properties of this basis set are well suited for an extension on a GPU-accelerated environment. In addition to focusing on the performances of the MPI and OpenMP parallelisation the BigDFT code, this presentation also relies of the usage of the GPU resources in a complex code with di-erent kinds of operations. A discussion on the interest of present and expected performances of Hybrid architectures computation in the framework of electronic structure calculations is also adressed.

Download PDF


Authors: Ruyman Reyesa, Iain Bethunea
a EPCC, The University of Edinburgh, James Clerk Maxwell Building, Mayfield Road, Edinburgh, EH9 3JZ,UK

Abstract: This report describes the results of a PRACE Preparatory Access Type Cb project to optimise the implementation of Moller-Plesset second order perturbation theory (MP2) in CP2K, to allow it to be used efficiently on the PRACE Research Infrastructure. The work consisted of three stages: firstly serial optimisation of several key computational kernels; secondly, OpenMP implementation of parallel 3D Fourier Transform to support mixed-mode MPI/OpenMP use of CP2K; and thirdly – benchmarking the performance gains achieved by new code on HERMIT for a test case representative of proposed production simulations. Consistent speedups of 8% were achieved in the integration kernel routines as a result of the serial optimisation. When using 8 OpenMP threads per MPI process, speedups of up to 10x for the 3D FFT were achieved, and for some combinations of MPI processes and OpenMP threads, overall speedups of 66% for the whole code were measured. As a result of this work, a proposal for full PRACE Project Access has been submitted.

Download PDF


Authors: Peicho Petkova, Petko Petkovb,*, Georgi Vayssilovb, Stoyan Markovc
a Faculty of Physics, University of Sofia, 1164 Sofia, Bulgaria
b Faculty of Chemistry, University of Sofia, 1164 Sofia, Bulgaria
c Natioanl Centre for Supercomputing Aplications, Sofia, Bulgaria

Abstract: The reported work aims at implementation of a method allowing realistic simulation of large or extra-large biochemical systems (of 106 to 107 atoms) with first-principle quantum chemical methods. The current methods treat the whole system simultaneously. In this way the comput time increases rapidly with the size of the system and does not allow efficient parallelization of the calculations due to the mutual interactions between the electron density in all parts of the system. In order to avoid these problems we implemented a version of the Fragment Orbital Method (FMO) in which the whole system is divided into fragments calculated separately. This approach assures nearly linear scaling of the compute time with the size of the system and provides efficient parallelization of the job. The work includes development of pre- and post-processing components for automatic division of the system into monomers and reconstructing of the total energy and electron density of the whole system.

Download PDF


Authors: Iain Bethunea, Adam Cartera, Kevin Stratforda, Paschalis Korosogloub,c
a EPCC, The University of Edinburgh, James Clerk Maxwell Buidling, The King’s Buildings, Edinburgh, EH9 3JZ, United Kingdom

Abstract: This report describes the work undertaken under PRACE-1IP to support the European scientific communities who make use of CP2K in their research. This was done in two ways – firstly, by improving the performance of the code for a wide range of usage scenarios. The updated code was then tested and installed on the PRACE CURIE supercomputer. We believe this approach both supports existing user communities by delivering better application performance, and demonstrates to potential users the benefits of using optimized and scalable software like CP2K on the PRACE infrastructure.

Download PDF


Authors: Jussi Enkovaaraa,?, Martti Louhivuoria, Petar Jovanovicb, Vladimir Slavnicb, Mikael R ?annarc
a CSC
- IT Center for Science, P.O. Box 405 FI-02101 Espoo Finland
b Scientific Computing Laboratory, Institute of Physics Belgrade, Pregrevica 118, 11080 Belgrade, Serbia
c Department of Computing Science, Umea University, SE-901 87 Umea, Sweden

Abstract: GPAW is a versatile software package for ?rst-principles simulations of nanostructures utilizing density-functional theory and time-dependent density-functional theory. Even though GPAW is already used for massively parallel calculations in several supercomputer systems, some performance bottlenecks still exist. First, the implementation based on the Python programming language introduces an I/O bottleneck during initialization which becomes serious when using thousands of CPU cores. Second, the current linear response time-dependent density-functional theory implementation contains a large matrix, which is replicated on all CPUs. When reaching for larger and larger systems, memory runs out due to the replication. In this report, we discuss the work done on resolving these bottlenecks. In addition, we have also worked on optimization aspects that are directed more to the future usage. As the number of cores in multicore CPUs is still increasing, an hybrid parallelization combining shared memory and distributed memory parallelization is becoming appealing. We have experimented with hybrid OpenMP/MPI and report here the initial results. GPAW also performs large dense matrix diagonalizations with the ScaLAPACK library. Due to limitations in ScaLAPACK these diagonalizations are expected to become a bottleneck in the future, which has led us to investigate alternatives for the ScaLAPACK.

Download PDF


Abstract: The work aims at evaluating the performance of DALTON on different platforms and implementing new strategies to enable the code for petascaling. The activities have been organized into four tasks within PRACE project: (i) Analysis of the current status of the DALTON quantum mechanics (QM) code and identification of bottlenecks, implementation of several performance improvements of DALTON QM and first attempt of hybrid parallelization; (ii) Implementation of MPI integral components into LSDALTON, improvements of optimization and scalability, interface of matrix operations to PBLAS and ScaLAPACK numerical library routines; (iii) Interfacing the DALTON and LSDALTON QM codes to the ChemShell quantum mechanics/molecular mechanics (QM/MM) package and benchmarking of QM/MM calculations using this approach; (vi) Analysis of the impact of DALTON QM system components with Dimemas. Part of the results reported here has been achieved through the collaboration with ScalaLife project.

Download PDF


Authors: Simen Reine(a), Thomas Kj?rgaard(a), Trygve Helgaker(a), Olav Vahtras(b,d), Zilvinas Rinkevicius(b,g), Bogdan Frecus(b), Thomas W. Keal(c), Andrew Sunderland(c), Paul Sherwood(c),
Michael Schliephake(d), Xavier Aguilar(d), Lilit Axner(d), Maria Francesca Iozzi(e), Ole Widar Saastad(e), Judit Gimenez(f)
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, P.O.Box 1033 Blindern, N-0315 Oslo, Norway
b KTH Royal Institute of Technology, School of Biotechnology, Division of Theoretical Chemistry
& Biology, S-106 91 Stockholm, Sweden
c Computational Science & Engineering Department, STFC Daresbury Laboratory, Daresbury Science and Innovation Campus, Warrington, Cheshire, WA4 4AD, UK
d PDC Center for High Performance Computing at Royal Institute of Technology (KTH), Teknikringen 14, 100 44 Stockholm, Sweden
e University center for Information technology, University of Oslo, P.O.Box 1059 Blindern, N-0316 Oslo, Norway
f Computer Sciences – Performance Tools, Barcelona Supercomputing Center, Campus Nord UP C6, C/ Jordi Girona, 1-3, Barcelona, 08034
g KTH Royal Institute of Technology, Swedish e-Science Center (SeRC), S-100 44, Stockholm, Sweden

Abstract: In this paper we present development work carried out on Quantum ESPRESSO [1] software package within PRACE-1IP. We describe the different activities performed to enable the Quantum ESPRESSO user community to challenge frontiers of science running extreme computing simulation on European Tier-0 system of current and next generation. There main sections are described: 1) the improvement of parallelization efficiency on two DTF-based applications: Nuclear Magnetic Resonance (NMR) and EXact-eXchange (EXX) calculation; 2) introduction of innovative van der Waals interaction at the ab-initio level; 3) porting of PWscf code to hybrid system equipped with NVIDIA GPU technology.

Download PDF


Authors: Simen Reinea, Thomas Kj?rgaarda, Trygve Helgakera, Ole Widar Saastadb, Andrew Sunderlandc
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, Oslo, Norway
b University Center for Information Technology, University of Oslo, Oslo, Norway
c STFC Daresbury Laboratory, Warrington, United Kingdom

Abstract: Linear Scaling DALTON (LSDALTON) is a powerful molecular electronic structure program that is the focus of software optimization projects in PRACE 1IP-WP7.2 and PRACE 1IP-WP7.5. This part of the project focuses on the introduction of parallel diagonalization routines from the ScaLAPACK library into the latest MPI version of LSDALTON. The parallelization work has involved three main tasks: i) Redistribution of the matrices assembled for the SCF cycle from a serial / distributed state to the two dimensional block-cyclic data distribution used for PBLAS and ScaLAPACK; ii) Interfacing of LSDALTON data structures to parallel diagonalization routines in ScaLAPACK; iii) Performance testing to determine the favoured ScaLAPACK eigensolver methodology.

Download PDF


Authors: Dusan Stankovic*a, Aleksandar Jovica, Petar Jovanovica, Dusan Vudragovica, Vladimir Slavnica
a Institute of Physics Belgrade, Serbia

Abstract: In this whitepaper we report work that was done on enabling support for FFTE library for Fast Fourier Transform in Quantum ESPRESSO, enabling threading for FFTW3 library already supported in Quantum ESPRESSO (only a serial version), benchmarking and comparing their performance with existing implementations of FFT in Quantum ESPRESSO.

Download PDF


Classical MD applications


Authors: Mariusz Uchronskia, Agnieszka Kwieciena,*, Marcin Gebarowskia, Justyna Kozlowskaa
a WCSS, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland

Abstract: The prototypes evaluated within PRACE-2IP project provide a number of different computing hardware, including general purpose Graphics Processing Units (GPUs) and accelerators like Intel Xeon Phi. In this work we evaluated the performance and energy consumption of two prototypes when used for a real case simulation. Due to the heterogeneity of the prototypes we decided to use the DL_POLY molecular simulation package and its OpenCL port for the tests. The DL_POLY OpenCL port implements one of the methods – the Constraints Shake (CS) component. SHAKE is a two stage algorithm based on the leapfrog Verlet integration scheme. We used four test cases for the evaluation, one from the DL_POLY application test-suite – H2O, and three real cases, provided by a user. We show the performance results and discuss the usage experience with prototypes in a context of ease of use, porting effort required, and energy consumption.

Download paper: PDF

Authors: D. Grancharov, N. Ilieva, E. Lilkova, L. Litov, S. Markov, P. Petkov, I. Todorov NCSA, Akad. G. Bonchev 25A, Sofia 1311, Bulgaria STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK

Abstract: A library, implementing the AGBNP2 [1, 2] implicit solvent model that was developed within PRACE-2IP [3] is integrated into the DL_POLY_4 [4] molecular dynamics package in order to speed up the time to solution for protein solvation processes. Generally, implicit solvent models lighten the computational loads by reducing the degrees of freedom of the model, removing those of the solvent and thus only concentrating on the protein dynamics that is facilitated by the absence of friction with solvent molecules. Furthermore, periodic boundary conditions are no longer formally required, since long-range electrostatic calculations cannot be applied to systems with variable dielectric permittivity. The AGBNP2 implicit solvation model improves the conformational sampling of the protein dynamics by including the influence of solvent accessible surface and water-protein hydrogen bonding effects as interactive force corrections to the atoms of protein surface. This requires the development of suitable bookkeeping data structures, in accordance with the domain decomposition framework of DL_POLY, with dynamically adjustable inter-connectivity to describe the protein surface. The work also requires the use of advanced b-tree search libraries as part of the AGBNP library, in order to reduce the memory and compute requirements, and the automatic derivation of the van der Waals radii of atoms from the self-interaction potentials.

Download: PDF


Authors: P. Petkov, I. Todorov, D. Grancharov, N. Ilieva, E. Lilkova, L. Litov, S. Markov
NCSA, Akad. G. Bonchev 25A, Sofia 1311, Bulgaria
STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK

Abstract: Electrostatic interactions in molecular simulations are usually evaluated by employing the Ewald summation method, which splits the summation into a short-ranged part, treated in real space and a long-range part treated in reciprocal space. For performance purposes in molecular dynamics software the latter is usually handled by SPME or P3M grid based methods both relying on 3D fast Fourier transform (FFT) as their central operation. However, the Ewald summation method is derived for model systems that are subject to 3D periodic boundary conditions (PBC) while there are many models of scientific as well as commercial interest, where geometry implies a 1D or 2D structures. Thus for systems, such as membranes, interfaces, linear protein complexes, thin layers, nanotubes, etc.; employing Ewald summation based techniques is either very disadvantageous computationally or impossible at all. Another approach to evaluate the electrostatics interactions is to solve the Poisson equation of the model-system charge distribution on a 3D special grid. The formulation of the method allows an elegant way to switch on and off the dependence on periodic boundary conditions in a simple manner. Furthermore, 3D FFT kernels are known to scale poorly at large scale due to excessive memory and communication overheads, which makes the Poisson solvers a viable alternative for DL_POLY on the road to exascale. This paper describes the work undertaken to integrate a Poisson solver library, developed in PRACE-2IP [1], within the DL_POLY_4 domain decomposition framework. The library relies on a unique combination of bi-conjugated gradient (BiCG) and conjugated gradient (CG) methods to warrant both independence on initial conditions with a rapid convergence of the solution on the one hand and stabilization of possible fluctuations of the iterative solution on the other. The implementation involves the development of procedures for generating charge density and electrostatic potential grids in real space over all domains in a distributed manner as well as halo exchange routines and functions to calculate the gradient of the potential in order to recover electrostatic forces on point charges.

Download PDF


Authors: Buket Benek Gursoy, Henrik Nagel
Irish Center for High End Computing, Ireland, bNorwegian University of Science and Technology, Norway

Abstract: This whitepaper investigates the potential benefit of using the OpenACC directive-based programming tool for enabling DL_POLY_4 on GPUs. DL_POLY is a well-known general-purpose molecular dynamics simulation package, which has already been parallelised using MPI-2. DL_POLY_3 was accelerated using the CUDA framework by the Irish Centre for High-End Computing (ICHEC) in collaboration with Daresbury Laboratory. In this work, we have been inspired by the existing CUDA port to evaluate the effectiveness of OpenACC in further enabling DL_POLY_4 on the road to Exascale. We have been particularly concerned with investigating the benefits of OpenACC in terms of maintainability, programmability and portability issues that are becoming increasingly challenging as we advance to the Exascale era. The impact of the OpenACC port has been assessed in the context of a change in the reciprocal vector dimension for the calculation of SPME forces. Moreover, the interoperability of OpenACC with the existing CUDA port has been analysed.

Download PDF


Authors: Mariusz Uchronskia, Marcin Gebarowskia, Agnieszka Kwieciena,*
a Wroclaw Centre for Networking and Supercomputing (WCSS), Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland

Abstract: SHAKE and RATTLE algorithms are widely used in molecular dynamics simulations and for this reason are relevant for a broad range of scientific applications. In this work, an existing CPU+GPU implementations of the SHAKE and RATTLE algorithms from the DL_POLY application are investigated. DL_POLY is a general purpose parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and I.T. Todorov. OpenCL code of the SHAKE algorithm for DL_POLY application is analyzed for further optimization possibilities. Our work with RATTLE algorithm is focused on porting of the algorithm from Fortran to OpenCL and adjusting it to the GPGPU architecture.

Download PDF


Authors: Michael Lysaghta, Mariusz Uchronskib, Agnieszka Kwiecienb, Marcin Gebarowskib, Peter Nasha, Ivan Girottoa and Ilian T.Todorovc
a Irish Centre for High End Computing, Tower Building, Trinity Technology and Enterprise Campus, Grand Canal Quay, Dublin 2, Ireland
b Wroclaw Centre for Network and Supercomputing, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland
c STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, United Kingdom

Abstract: We describe recent development work carried out on the GPU-enabled classical molecular dynamics software package, DL_POLY. We describe how we have updated the original GPU port of DL_POLY 3 in order to align the ‘CUDA+OpenMP’-based code with the recently released MPI-based DL_POLY 4 package. In the process of updating the code we have also fixed several bugs which allows us to benchmark the GPU-enabled code on many more GPU-nodes than was previously possible. We also describe how we have recently initiated the development of an OpenCL-based implementation of DL_POLY and present a performance analysis of the set of DL_POLY modules that have so far been ported to GPUs using the OpenCL framework.

Download PDF



Authors: Sadaf Alam, Ugo Varetto
Swiss National Supercomputing Centre, Lugano, Switzerland

Abstract: This report introduces hybrid implementation of the Gromacs application, and provides instructions on building and executing on PRACE prototype platforms with Grahpical Processing Units (GPU) and Many Intergrated Cores (MIC) accelerator technologies. GROMACS currently employs message-passing MPI parallelism, multi-threading using OpenMP and contains kernels for non-bonded interactions that are accelerated using the CUDA programming language. As a result, the execution model is multi-faceted where end users can tune the application execution according to the underlying platforms. We present results that have been collected on the PRACE prototype systems as well as on other GPU and MIC accelerated platforms with similar configurations. We also report on the preliminary porting effort that involves a fully portable implementation of GROMACS using OpenCL programming language instead of CUDA, which is only available on NVIDIA GPU devices.

Download PDF


Authors: Fabio Affinitoa, Andrew Emersona, Leandar Litovb, Peicho Petkovb, Rossen Apostolovc,d, Lilit Axnerc, Berk Hessdand Erik Lindahld, Maria Francesca Iozzie
a CINECA Supercomputing, Applications and Innovation Department, via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
b National Center for Supercomputing Applications, Sofia, Bulgaria
c PDC Center for High Performance Computing at Royal Institute of Technology (KTH),
Teknikringen 14, 100 44 Stockholm, Sweden
d Department of Theoretical Physics, Royal Institute of Technology (KTH), Stockholm, Sweden
e) Research Computing Services Group, University of Oslo,Postboks 1059 Blindern 0316 Oslo, Norway

Abstract: The work aims at evaluating the performance of GROMACS on different platforms and and determine the optimal set of conditions for given architectures for petascaling molecular dynamics simulations. The activities have been organized into three tasks within PRACE project: (i) Optimization of GROMACS performance on Blue Gene systems; (ii) Parallel scaling of the OpenMP implementation; (iii) Development of a multiple step-size symplectic integrator adapted to the large biomolecule systems. Part of the results reported here has been achieved through the collaboration with ScalaLife project.

Download PDF


Computational Fluid Dynamics (CFD) applications


Authors: Ahmet Durana,b,*, M. Serdar Celebia,c, Senol Piskina,c and Mehmet Tuncela,c
a Istanbul Technical University, National Center for High Performance Computing of Turkey (UHeM), Istanbul 34469, Turkey
b Istanbul Technical University,Department of Mathematics, Istanbul 34469, Turkey
c Istanbul Technical University, Informatics Institute, Istanbul 34469, Turkey

Abstract: We study a bio-medical fluid flow simulation using the incompressible, laminar OpenFOAM solver icoFoam and other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT (Many-Core Distributed) for the large penta-diagonal and hepta-diagonal matrices coming from the simulation of blood flow in arteries with a structured mesh domain. A realistic simulation for the sloshing of blood in the heart or vessels in the whole body is a complex problem and may take a very long time, thousands of hours, for the main tasks such as pre-processing (meshing), decomposition and solving the large linear systems. We generated the structured mesh by using blockMesh as a mesh generator tool. To decompose the generated mesh, we used the decomposePar tool. After the decomposition, we used icoFoam as a flow simulator/solver. For example, the total run time of a simple case is about 1500 hours without preconditioning on one core for one period of the cardiac cycle, measured on the Linux Nehalem Cluster (see [28]) available at the National Center for High Performance Computing (UHeM) (see [5]). Therefore, this important problem deserves careful consideration for usage on multi petascale or exascale systems. Our aim is to test the potential scaling capability of the fluid solver for multi petascale systems. We started from the relatively small instances for the whole simulation and solved large linear systems. We measured the wall clock time of single time steps of the simulation. This version gives important clues for a larger version of the problem. Later, we increase the problem size and the number of time steps to obtain a better picture gradually, in our general strategy. We test the performance of the solver icoFoam at TGCC Curie (a Tier-0 system) at CEA, France (see [21]). We consider three large sparse matrices of sizes 8 million x 8 million, 32 million x 32 million, and 64 million x 64 million. We achieved scaled speed-up for the largest matrices of 64 million x 64 million to run up to 16384 cores. In other words, we find that the scalability improves as the problem size increases for this application. This shows that there is no structural problem in the software up to this scale. This is an important and encouraging result for the problem. Moreover, we imbedded other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT in addition to the solvers provided by OpenFOAM. Since future exascale systems are expected to have heterogeneous and many-core distributed nodes, we believe that our SuperLU_MCDT software is a good candidate for future systems. SuperLU_MCDT worked up to 16384 cores for the large penta-diagonal matrices for 2D problems and hepta-diagonal matrices for 3D problems, coming from the incompressible blood flow simulation, without any problem.

Download paper: PDF


Authors: Sebastian Szkodaa,c, Zbigniew Kozaa, Mateusz Tykierkob,c
a sebastian.szkoda@ift.uni.wroc.pl Faculty of Physics and Astronomy, University of Wroclaw, Poland
b Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Poland
c Wroclaw Centre for Networking and Supercomputing, Wroclaw University of Technology, Poland

Abstract: The aim of this research it to examine the possibility of parallelizing the Frish-Hasslacher-Pomeau (FHP) model, a cellular automata algorithm for modeling fluid flow, on clusters of modern graphics processing units (GPUs). To this end an Open Computing Language (OpenCL) implementation for GPUs was written and compared with a previous, semi-automatic one based on the OpenACC compiler pragmas (S. Szkoda, Z. Koza, and M. Tykierko, Multi-GPGPU Cellular Automata Simulations using OpenACC, http://www.prace-project.eu/IMG/pdf/wp154.pdf). Both implementations were tested on up to 16 Fermi-class GPUs using the MPICH3 library for the inter-process communication. We found that for both of the multi- GPU implementations the weak scaling is practically linear for up to 16 devices, which suggests that the FHP model can be successfully run even on much larger clusters. Secondly, while the pragma-based OpenACC implementation is much easier to develop and maintain, it gives as good performance as the manually written OpenCL code.

Download paper: PDF


Authors: Charles Moulinec, David R. Emerson
STFC Daresbury, Laboratory, Warrington, WA4 4AD, UK

Abstract: Understanding the influence of wave distribution, hydrodynamics and sediment transport is crucial for the placement of off-shore energy generating platforms. The TELEMAC suite is used for this purpose, and the performance of the triple coupling between TOMAWAC for wave propagation, TELEMAC-3D for hydrodynamics and SISYPHE for sediment transport is investigated for several mesh sizes, the largest grid having over 10 million elements. The coupling has been tested up to 3,072 processors and good performance is in general observed.

Download PDF


Authors: E. Casonia, J. Aguadoa, M. Riveroa, M. Vazquez, G. Houzeaux
Department of Computer Applications for Scientific Engineering. BSC, Nexus I, Gran Capita 2-4, 08034 Barcelona, Spain

Abstract: This paper describes the work done in Alya multiphysics code, which is an open source software developed at Barcelona Supercomputing Center BSC-CNS. The main activities of this socio-economic application project concern the development of a coupled uid-electro-mechanical model to simulate the cardiac computational mechanical problem of the heart. Several aspects involved in the simulation process, methodology and performance of the code are carefully shown.

Download PDF


Authors: J. Donners, M. Guarrasi, A. Emerson, M. Genseberger
SURFsara, Amsterdam, The Netherlands
CINECA, Bologna, Italy
Deltares, Delft, The Netherlands

Abstract: The applications Delft3D-FLOW and SWAN are used to simulate respectively water flow and water waves. These two applications have been coupled with Delft3D-WAVE and the combination of these three executables has been optimized on the Bull cluster “Cartesius”. The runtime could be decreased by a factor 4 with hardly any additional hardware. Over 80% of the total runtime consists of unnecessary I/O operations for the coupling, of which 70% could be removed. Both I/O optimizations and replacement with MPI were used. The Delft3DFLOW application has also been ported to and benchmarked on the IBM Blue Gene/Q system “Fermi”.

Download PDF


Authors: J. Donners, M. Guarrasi, A. Emerson, M. Genseberger
SURFsara, Amsterdam, The Netherlands
CINECA, Bologna, Italy
Deltares, Delft, The Netherlands

_Abstract : The applications Delft3D-FLOW and SWAN are used to simulate respectively water flow and water waves. These two applications have been coupled with Delft3D-WAVE and the combination of these three executables has been optimized on the Bull cluster “Cartesius”. The runtime could be decreased by a factor 4 with hardly any additional hardware. Over 80% of the total runtime consists of unnecessary I/O operations for the coupling, of which 70% could be removed. Both I/O optimizations and replacement with MPI were used. The Delft3DFLOW application has also been ported to and benchmarked on the IBM Blue Gene/Q system “Fermi”.

Download PDF


Authors: Thomas Ponweiser, Peter Stadelmeyer, Tomas Karasek
Johannes Kepler University Linz, RISC Software GmbH, Austria
VSB-Technical University of Ostrava, IT4Innovations, Czech Republic

_Abstract: Multi-physics, high-fidelity simulations become an increasingly important part of industrial design processes. Simulations of fluid-structure interactions (FSI) are of great practical significance – especially within the aeronautics industry – and because of their complexity they require huge computational resources. On the basis of OpenFOAM a partitioned, strongly coupled solver for transient FSI simulations with independent meshes for the fluid and solid domains has been implemented. Using two different kinds of model sets, a geometrically simple 3D beam with quadratic cross section and a geometrically complex aircraft configuration, runtime and scalability characteristics are investigated. By modifying the implementation of OpenFOAM’s inter-processor communication the scalability limit could be increased by one order of magnitude (from below 512 to above 4096 processes) for a model with 61 million cells.

Download PDF


Authors: Seren Soner, Can Ozturan
Computer Engineering Department, Bogazici University, Istanbul, Turkey

Abstract: OpenFOAM is an open source computational uid dynamics (CFD) package with a large user base from many areas of engineering and science. This whitepaper documents an enablement tool called PMSH that was developed to generate multi-billion element unstructured tetrahedral meshes for OpenFOAM. PMSH is developed as a wrapper code around the popular open source sequential Netgen mesh generator. Parallelization of the mesh generation process is carried out in five main stages: (i) generation of a coarse volume mesh (ii) partitioning of the coarse mesh to get sub-meshes, each of which is processed by a processor (iii) extraction and refinement of coarse surface sub-meshes to produce fine surface sub-meshes (iv) re-meshing of each fine surface sub-mesh to get the final fine volume mesh (v) matching of partition boundary vertices followed by global vertex numbering. An integer based barycentric coordinate method is developed for matching distributed partition boundary vertices. This method does not have precision related problems of oating point coordinate based vertex matching. Test results obtained on an SGI Altix ICE X system with 8192 cores and 14 TB of total memory confirm that our approach does indeed enable us to generate multi-billion element meshes in a scalable way. PMSH tool is available at https://code.google.com/p/pmsh/.

Download PDF


Authors: Tomas Karasek, David Horak, Vaclav Hapla, Alexandros Markopoulos, Lubomir Riha, Vit Vondrak, Tomas Brzobohaty
IT4Innovations, VSB-Technical University of Ostrava (VSB)

Abstract: Solution of multiscale and/or multiphysics problems is one of the domains which can most benefit from use of supercomputers. Those problems are often very complex and their accurate description and numerical solution requires use of several different solvers. For example problems of Fluid Structure Interaction (FSI) are usually solved using two different discretization schemes, Finite volumes to solve Computational Fluid Dynamics (CFD) part and Finite elements to solve the structural part of the problem. This paper summarizes different libraries and solvers used by the PRACE community that are able to deal with multiscale and/or multiphysic problems such as Elmer, Code_Saturne and Code_Aster, and OpenFOAM. The main bottlenecks in performance and scalability on the side of Computational Structure Mechanics (CSM) codes are identified and their possible extension to fulfill needs of future exascale problems are shown. The numerical results of the strong and weak scalabilities of CSM solver implemented in our FLLOP library are presented.

Download PDF


Authors : Sebastian Szkoda, Zbigniew Koza, Mateusz Tykierko
Faculty of Physics and Astronomy, University of Wroclaw, Poland
Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Poland
Wroclaw Centre for Networking and Supercomputing, Wroclaw University of Technology, Poland

Abstract: The Frisch-Hasslacher-Pomeau (FHP) model is a lattice gas cellular automaton designed to simulate fluid flows using the exact, purely Boolean arithmetic, without any round-off error. Here we investigate the problem of its efficient porting to clusters of Fermi-class graphic processing units. To this end two multi-GPU implementations were developed and examined: one using the NVIDIA CUDA and GPU Direct technologies explicitly and the other one using the CUDA implicitly through the OpenACC compiler directives and the MPICH2 MPI interface for communication. For a single Tesla C2090 GPU device both implementations yield up to a 7-fold acceleration over an algorithmically comparable, highly optimized multi-threaded implementation running on a server-class CPU. The weak scaling for the explicit multi-GPU CUDA implementation is almost linear for up to 8 devices (the maximum number of the devices used in the tests), which suggests that the FHP model can be successfully run on much larger clusters and is a prospective candidate for exascale computational fluid dynamics. The scaling for the OpenACC approach turns out less favorable due to compiler-related technical issues. We found that the multi-GPU approach can bring considerable benefits for this class of problems, and the GPU programming can be significantly simplified through the use of the OpenACC standard, without a significant loss of performance, providing that the compilers supporting OpenACC improve their handling of the communication between GPUs.

Download PDF


Authors:

Abstract: Code_Saturne is a popular open-source computational fluid dynamics package. We have carried out a study of applying MPI 2.0 / MPI 3.0 one sided communication routines to Code_Saturne and its impact on improving the scalability of the code for future peta/exa-scaling. We have developed modified versions of the halo exchange routine in Code_Saturne. Our modifications showed that MPI 2.0 one sided calls give some speed improvement and less memory overhead compared to the original version. The MPI 3.0 version on the other hand is unstable and could not run.

Application Code: Code_Saturne

Download PDF


Authors Jan Christian Meyer
High Performance Computing Section, IT Dept., Norwegian University of Science and Technology

Abstract: The LULESH proxy application models the behavior of the ALE3D multi-physics code with an explicit shock hydrodynamics problem, and is made in order to evaluate interactions between programming models and architectures, using a representative code significantly less complex than the application it models. As identified in the PRACE deliverable D7.2.1 [1], the OmpSs programming model specifically targets programming at the exascale, and this whitepaper investigates the effectiveness of its support for development on hybrid architectures.

Download PDF


Authors : Maciej Cytowskia, Matteo Bernardinib
a Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw
b Universita di Roma La Sapienza, Dipartimento di Ingegneria Meccanica e Aerospaziale

Abstract: The project aimed at extending the capabilities of an existing ow solver for Direct Numerical Simulation of turbulent flows. Starting from the scalability analysis of the MPI baseline code, the main goal of the project was to devise a MPI/OpenMP hybridization capable of exploiting the full potential of the current architectures provided in the PRACE framework. The project was very successful, the new hybrid version of the code outperformed the pure MPI version on IBM Blue Gene/Q architecture (FERMI).

Download PDF


Authors: B. Scotta, V. Weinbergb†, O. Hoenena, A. Karmakarb, L. Fazendeiroc
a Max-Planck-Institut fur Plasmaphysik IPP, 85748 Garching b. Munchen, Germany
b Leibniz Rechenzentrum der Bayerischen Akademie der Wissenschaften, 85748 Garching b. Munchen, Germany
c Chalmers University of Technology, 412 96 Gothenburg, Sweden

Abstract: We discuss a detailed weak scaling analysis of GEM, a 3D MPI-parallelised gyrofluid code used in theoretical plasma physics at the Max Planck Institute of Plasma Physics, IPP at Garching near Munich, Germany. Within a PRACE Preparatory Access Project various versions of the code have been analysed on the HPC systems SuperMUC at LRZ and JUQUEEN at Julich Supercomputing Centre (JSC) to improve the parallel scalability of the application. The diagnostic tool Scalasca has been used to filter out suboptimal routines. The code uses the electromagnetic gyrofluid model which is a superset of magnetohydrodynamic and drift-Alfven microturbulance and also includes several relevant kinetic processes. GEM can be used with different geometries depending on the targeted use case, and has been proven to show good scalability when the computational domain is distributed amongst two dimensions. Such a distribution allows grids with sufficient size to describe conventional tokamak devices. In order to enable simulation of very large tokamaks (such as the next generation nuclear fusion device ITER in Cadarache, France) the third dimension has been parallelised and weak scaling has been achieved for significantly larger grids.

Download PDF


Authors: Siwei Donga, Vegard Eideb, Jeroen Engelbertsc
a Universidad Politecnica Madrid, Spain
b Norwegian University of Science and Technology, Norway
c SURFsara, Amsterdam, Netherlands

Abstract: The SHEAR code is developed at the School of Aeronautics, Universidad Politecnica de Madrid, for the simulation of turbulent structures of shear flows. The code has been well tested on smaller clusters. This white paper desbribes the work done to scale and optimise SHEAR for large systems like the Blue Gene/Q system JUQUEEN in Julich.

Download PDF


Authors: Riccardo Brogliaa, Stefano Zaghia, Roberto Muscaria, Francesco Salvadoreb, Soon-Heum Koc
a CNR-INSEAN
- National Marine Technology Research Institute, Via di Vallerano 139, Rome 00128, Italy
b CINECA, Via dei Tizii 6, Rome 00185, Italy
cNSC
- National Supercomputing Centre, Linkoping University, 58183 Linkoping, Sweden

Abstract: In this paper, the work that has been performed to extend the capabilities of the -Xnavis software, a well tested and validated parallel flow solver developed by the research group of CNR-INSEAN, is reported. The solver is based on the fi-nite volume discretization of the unsteady incompressible Navier-Stokes equations; main features include a level set approach to handle free surface flows and a dynamical overlapping grids approach, which allows to deal with bodies in relative motion. The baseline code features a hybrid MPI/OpenMP parallelization, proven to scale when running on order of hundreds of cores (i.e. Tier-1 platforms). However, some issues arise when trying to use this code with the current massively parallel HPC facilities provided in the Tier-0 PRACE context. First of all, it is mandatory to assess an effi-cient speed-up up to thousands of processors. Other important aspects are related to the pre- and post- processing phases which need to be optimized and, possibly, parallelized. The last one concerns the implementation of MPI-I/O procedures in order to try to accelerate data access and to reduce the number of generated -files.

Download PDF


Authors: Stoyan Markov, Peicho Petkov, Damyan Grancharov and Georgi Georgiev
National Centre for Supercomputing Application, Akad. G. Bonchev Str., 25A, 1113 Sofia, Bulgaria

Abstract: We investigated the possible way for treatment of electrostatic interactions by solving numerically Poisson’s equation using Conjugate Gradient method and Stabilized BiConjugate Gradient method. The aim of the research was to test the execution time of prototype programs running on BLueGene/P and CPU/GPU system. The results show that the tested methods are applicable for electrostatics treatment in molecular-dynamics simulations.

Download PDF


Authors: A.Charalampidoua,b, P.Daogloua,b, J.Hertzerc, E.V.Votyakovd
a Greek Research and Technology Network, Athens, Greece
b Scientific Computing Center, Aristotle University of Thessaloniki Thessaloniki 54124, Greece
c HLRS, Nobelstr. 19, D-70569 Stuttgart, Germany
d The Cyprus Institute, 20 Konstantinou Kavafi Street 2121 Aglantzia, Nicosia, Cyprus

Abstract: The project objective has been to develop and justify the OpenFOAM model for the simulation of a TES tank. In the course of the project we have obtained scalability results, which are presented in this paper. Scalability tests have been performed on HLRS Hermit HPC system using various combinations of decomposition methods, cell capacities and number of physical cpu cores.

Download PDF


Authors: O. Akinci1, M. Sahin2, B.O. Kanat3
1 National High Performance Computing Center of Turkey (UHeM), Istanbul Technical University (ITU), Ayazaga Campus, UHeM Office, Istanbul 34469, Turkey
2 ITU, Ayazaga Campus, Faculty of Aeronautics and Astronautics, Istanbul 34469, Turkey
3 ITU, Ayazaga Campus, Computer Engineering Department, Istanbul 34469, Turkey

Abstract: ViscoSolve is a stable unstructured finite volume method for parallel large-scale viscoelastic fluid flow calculations. The code incorporates the open-source libraries PETSc and MPI for parallel computation. In this whitepaper we report work that was done to investigate scaling the performance of the ViscoSolve code.

Download PDF


Author: Evghenii Gaburov
SURFsara, Science Park 140, 1098XG Amsterdam, the Netherlands

Abstract: This white-paper reports on an enabling e-ort that involves porting a legacy 2D fluid dynamics Fortran code to NVIDIA GPUs. Given the complexity of both code and underlying (custom) numerical method, the natural choice was to use NVIDIA CUDA C to achieve the best possible performance. We achieved over 4.5x speed-up on a single K20 compared to the original code executed on a dual-socket E5-2687W.1

Download PDF


Authors: Paride Dagnaa, Joerg Hertzerb
a CINECA-SCAI Department, Via R. Sanzio 4, Segrate (MI) 20090, Italy
b HLRS, Nobelstr. 19, D-70569 Stuttgart, German

Abstract: The performance results from the hybridization of the OpenFOAM linear system solver, tested on the CINECA Fermi and the HLRS Hermit supercomputers are presented in this paper. A comparison between the original and the hybrid OpenFOAM versions on four physical problems, based on four different solvers, will be shown and a detailed analysis of the behavior of the main computing and communication phases, which are responsible for scalability during the linear system solution, will be given.

Download PDF


Authors: Stephane Glocknerb, N. Audiffrena, H. Ouvrardb,a
a CINES, Centre Informatique National de l’Enseignement Superieur, Montpellier, France
bI2M (Institut de Mecanique et d’Ingenierie de Bordeaux) *Corresponding author:audiffren@cines.fr

Abstract: It has already been shown that the numerical tool Thetis based on the resolution of the Navier-Stokes equation for multiphase flows gives accurate results for coastal applications, e.g. wave breaking, tidal bore propagation, tsunamis generation, swash flows, etc. [1,2,3,4,5,6]. In this study our goal is to improve the time and memory consumption in the set-up phase of the simulation (partitioning and building the computational mesh), examining the eventual benefits of an hybrid approach of the Hypre library, and doing fine tuning in implementation of the code on Curie Tier-0. We also implement parallel POSIX VTK and HDF5 I/O. Thetis is now able to run efficiently up to 1 billion mesh nodes at 16384 cores on CURIE in a production context.

Download PDF


Authors: Thibaut Delozea, Yannick Hoaraub, Marianna Brazaa
a IMFT, 2 allee du Professeur Camille Soula, Toulouse 31400, France
b IMFS, 2 rue Boussingault, Strasbourg 67000, France

Abstract: The present work focuses on code development of an efficient and robust CFD solver in aeronautics: the NSMB code, Navier-Stokes MultiBlock. A specific advanced version of the code containing Turbulence Modeling Approached developed by IMFT will be the object of the present optimization. The present project aims at improving the performances of the MPI version of the code in order to use efficiently the fat node part of Curie (TGCC, France). Different load balancing strategies have been studied in order to achieve an optimal distribution of work using up to 4096 processes.

Download PDF


Authors: A. Schnurpfeila, A. Schillera, F. Janetzkoa, St. Meiera, G. Sutmanna
a Forschungszentrum Juelich - JSC, Wilhelm-Johnen-Strasse, 52425 Juelich, Germany

Abstract: MP2C is a molecular dynamic code that focusses on mesoscopic particles dynamics simulations on massive parallel computers. The program is a development of JSC together with the Institute for Theoretical Biophysics and Soft Matter (IFF-2) of the Institute for Solid State Research (IFF) at Juelich. Within the frame of the PRACE Internal Call further optimization of the code as well as targeting possible bottlenecks were addressed. The project was mainly performed on JUGENE, a BG/P based architecture located at the Forschungszentrum Juelich/Germany. Besides that, some scaling test were performed on JUROPA, an Intel-Nehalem based general purpose supercomputer, also located at the Forschungszentrum Juelich. In this report the e-orts made by working on the program package are presented.

Download PDF


Author: Sami Saarinen
CSC –IT Center for Science Ltd, P.O.Box 405, FI-02101 Espoo, Finland

Abstract: The Sun exhibits magnetic activity at various spatial and temporal scales. The best known example is the 11-year sunspot cycle which is related to the 22-year periodicity of the Sun’s magnetic field. The sunspots, and thus solar magnetic activity, have some robust systematic features: in the beginning of the cycle sunspots appear at latitudes around 40 degrees. As the cycle progresses these belts of activity move towards the equator. The sign of the magnetic field changes from one cycle to the next and the large-scale field remains approximately anti-symmetric with respect to the equator. This cycle has been studied using direct observations for four centuries. Furthermore, proxy data from tree rings and Greenland ice cores has revealed that the cycle has persisted through millennia. The period and amplitude of activity change from cycle to cycle and there are even periods of several decades in the modern era when the activity has been very low. Since it is unlikely that the primordial field of the hydrogen gas that formed the Sun billions of years ago could have survived to the present day, the solar magnetic field is considered to be continuously replenished by some dynamo mechanism.

Download PDF


Authors: Kevin Stratforda, Ignacio Pagonabarragab
aEPCC, The King’s Buildings, The University of Edinburgh, EH9 3JZ, United Kingdom
bDepartament de Fisica Fonamental, Universitat ode Barcelona, Carrer Mart i Franques, 08028 Barcelona, Spain

Abstract: This project looked at the performance of simulations of bacterial swimmers using a lattice Boltzmann code for complex fluids.

Download PDF


Authors: A. Turka,?, C. Moulinecb, A.G. Sunderlandb, C. Aykanata
aBilkent University, Comp. Eng. Dept., Ankara, Turkey
bSTFC Daresbury Laboratory, Warrington WA4 4AD, UK

Abstract: Code Saturne is an open-source, multi-purpose Computational Fluid Dynamics (CFD) software, which has been developed by Electricite de France Recherche et Development (EDF-R&D). Code Saturne has been selected as an application of interest for CFD community in The Partnership for Advanced Computing in Europe First Implementation Phase Project (PRACE-1IP) and various e?orts towards improving the scalability of Code Saturne have been conducted. In this whitepaper, the e?orts towards understanding and improving the preprocessing subsystem of Code Saturne are described, and to this end, the performance of di?erent mesh partitioning software packages that can be used are investigated.

Download PDF


Authors: aC. Moulinec, A.G. Sunderland1
bP. Kabelikova, A. Ronovsky, V. Vondrak2
cA. Turk, C. Aykanat3
dC. Theodosiou4
aComputational Science and Engineering Department, STFC Daresbury Laboratory, United Kingdom
bDepartment of Applied Mathematics, VSB-Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava, Czech Republic
cDepartment of Computer Engineering, Bilkent University, 06800 Bilkent Ankara, Turkey
dScientific Computational Center, Aristotle University, 54 124 Thessaloniki, Greece

Abstract: Some of the optimisations required to prepare Code_Saturne for petascale simulations are presented in this white paper, along with the performance of the code. A mesh multiplication package based on parallel global refinement of hexahedral meshes has been developed for Code_Saturne to handle meshes containing billions† of cells and to efficiently exploit PRACE Tier-0 system capabilities. Several parallel partitioning tools have been tested and Code_Saturne performance has been assessed up to a 3.2 billion cell mesh. The parallel code is highly scalable and demonstrates good parallel speed-up at very high core counts, e.g. from 32,768 to 65,536 cores.

Download PDF


Author: Massimiliano Culpo
CINECA, Via Magnanelli 6/3, Casalecchio di Reno (BO) I-40033, Italy

Abstract: The scaling behavior of different OpenFOAM versions is analyzed on two benchmark problems. Results show that the applications scale reasonably well up to a thousand tasks. An in-depth profiling identifies the calls to the MPI_Allreduce function in the linear algebra core libraries as the main communication bottleneck. A sub-optimal performance on-core is due to the sparse matrices storage format that does not employ any cache-blocking mechanism at present. Possible strategies to overcome these limitations are proposed and analyzed, and preliminary results on prototype implementations are presented.

Download PDF


Authors: Michael Moylesa, Peter Nash, Ivan Girotto
Irish Centre for High End Computing, Grand Canal Quay, Dublin 2

Abstract: The following report outlines work undertaken for PRACE-2IP. The report will outline the computational methods used to examine petascaling of OpenFOAM on the French Tier-0 system CURIE. The case study used has been provided by the National University of Ireland, Galway (NUIG). The profiling techniques utilised to uncover bottlenecks, specifically in communication and file I/O within the code, will provide an insight into the behaviour of OpenFOAM and highlight practices that will be of benefit to the user community.

Download PDF


Author: Murat Manguoglu
Middle East Technical University, Department of Computer Engineering, 06800 Ankara, Turkey

Abstract: Solution of large sparse linear systems is frequently the most time consuming operation in computational fluid dynamics simulations. Improving the scalability of this operation is likely to have significant impact on the overall scalability of application. In this white paper we show scalability results up to a thousand cores for a new algorithm devised to solve large sparse linear systems. We have also compared pure MPI vs. MPI-OpenMP hybrid implementation of the same algorithm.

Download PDF




Earth Science applications


Authors: Mads R. B. Kristensena and Roman Nutermana
a Niels Bohr Institute, Universityt of Copenhagen, Denmark

Abstract: In this paper, we explore the challenges of running the current version (v1.2.2) of Community Earth System Model (CESM) Simulation on Juqueen. We present a set of workarounds for the Juqueen supercomputer that enables massively parallel executions and demonstrate scalability up to 3024 CPU-cores.

Download PDF

Download WP200_Appendix_topology patch


Authors: P. Nolan, A. McKinstry
Irish Centre for High-End Computing (ICHEC), Ireland

Abstract: Climate change due to increasing anthropogenic greenhouse gases and land surface change is currently one of the most relevant environmental concerns. It threatens ecosystems and human societies. However, its impact on the economy and our living standards depends largely on our ability to anticipate its effects and take appropriate action. Earth System Models (ESMs), such as EC-Earth, can be used to provide society with information on the future climate. EC-Earth3 generates reliable predictions and projections of global climate change, which are a prerequisite to support the development of national adaptation and mitigation strategies. This project investigates methods to enhance the parallel capabilities of EC-Earth3 by offloading bottleneck routines to GPUs and Intel Xeon Phi coprocessors. To gain a full understanding of climate change at a regional scale will require EC-Earth3 to be run at a much higher spatial resolution (T3999 5km) than is currently feasible. It is envisaged that the work outlined in this project will provide climate scientists with valuable data for simulations planned for future exascale systems.

Download PDF


Author: Paride Dagna
CINECA-SCAI Department, Via R. Sanzio 4, Segrate (MI) 20090, Italy

Abstract: SPEED (Spectral Element in Elastodynamics with Discontinuous Galerkin) is an open source code, jointly developedby the Department of Structural Engineering and of Mathematics at Politecnico di Milano, for seismic hazard analyses.
In this paper, performanceresults, which come from the optimization and hybridization work done on SPEED, tested on the CINECA Fermi BG/Q supercomputer will be shown. A comparison between the pure MPI and the hybrid SPEED versions on three earthquake scenarios, with increasing complexity, will be presented and a detailed analysisof the advantages that come from hybridization and optimization of the computing and I/O phases will be given.

Download PDF


Authors: Jean-MArc Molinesa, Nicole Audiffrenba, Albanne Lecointrea
a CNRS, LEGI, Grenoble, France
bCINES, Centre Informatique National de l’Enseignement Superieur, 34000 Montpellier FRANCE

Abstract: This project aims at preparing the high-resolution ocean/sea-ice realistic modeling environment implemented by the European DRAKKAR consortium for use on PRACE Tier-0 computers. DRAKKAR participating Teams jointly develop and share this modeling environment to address a wide range of scientific questions investigating multiple-scale interactions in the world ocean. Each team relies on the achievements of DRAKKAR to have available for its research the most efficient and up-to-date ocean models and related modeling tools. Two original realistic model configurations, based on the NEMO modeling framework, are considered in this project. They are designed to make possible the study of the role of multiple-scale interactions in the ocean variability, in the ocean carbon cycle and in marine ecosystems changes.

Download PDF


Authors: Charles Moulinec, Yoann Audouin, Andrew Sunderland
STFC Daresbury Laboratory, UK

Abstract: This report details optimization undertaken on the Computational Fluid Dynamic (CFD) software suite TELEMAC, a modelling system for free surface waters with over 200 installations worldwide. The main focus of the work has involved eliminating memory bottlenecks occurring at the pre-processing stage that have historically limited the size of simulations processed. This has been achieved by localizing global arrays in the pre-processing tool, known as PARTEL. Parallelism in the partitioning stage has also been improved by replacing the serial partitioning tool with a new parallel implementation. These optimizations have enabled massively parallel runs of TELEMAC-2D, a Shallow Water Equations based code, involving over 200 million elements to be undertaken on Tier-0 systems. These runs simulate extreme flooding events on very -ne meshes (locally less than one meter). Simulations at this scale are crucial for predicting and understanding flooding events occurring, e.g. in the region of the Rhine river

Download PDF


Authors: Thomas Zwinger, Mika Malinen, Juha Ruokolainen, Peter Raback
CSC – IT Center for Science, P.O. Box 405, FI-02101 Espoo,Finland

Abstract: By gaining and losing mass, glaciers and ice-sheets play a key role in sea level evolution. This is obvious when considering the past 20000 years, during which the collapse of the large northern hemisphere ice-sheets after the Last Glacial Maximum contributed to a 120m rise in sea level. This is particularly worrying when the future is considered. Indeed, recent observations clearly indicate that important changes in the velocity structure of both the Antarctic and Greenland ice-sheets are occurring, suggesting that large and irreversible changes may already have been initiated. This was clearly emphasised in the last report published by the Intergovernmental Panel on Climate Change (IPCC) [7]. The IPCC also asserted that current knowledge of key processes causing the observed accelerations was poor, and concluded that reliable projections obtained with process-based models for sea-level rise (SLR) are currently unavailable. Most of these uncertain key processes have in common that their physical/numerical characteristics, such as shallow ice approximation (SIA), are not accordingly reflected or even completely missing in the established simplified models that have been in use since decades. Whereas those simplified models run on common PC systems, the new approaches require higher resolution and larger computational models, which demand High Performance Computing (HPC) methods to be applied. In other words, numerical glaciology, like climatology and oceanography decades ago, needs to be updated for HPC with scalable codes, in order to deliver the prognostic simulations demanded by the IPCC. The DECI project ElmerIce, and enabling work associated with it, improved simulations of key processes that lead to continental ice loss. The project also developed new data assimilation methods. This was intended to decrease the degree of uncertainty affecting future SLR scenarios and consequently contribute to on-going international debates surrounding coastal adaptation and sea-defence planning. These results directly feed into existing projects, such as the European FP7 project ice2sea [9], which has the objective of improving projections of the contribution of continental ice to future sea-level rise and the French ANR ADAGe project [10], coordinated by O. Gagliardini, which has the objective to develop data assimilation methods dedicated to ice flow studies. Results from these projects will directly impact the upcoming IPCC assessment report (AR5).

Download PDF


Authors: Sebastian von Alfthana, Dusan Stankovicb, Vladimir Slavnicb
aFinnish Metoerological Institute, Helsinki, Finland
bInstitute of Physics Belgrade, Serbia

Abstract: In this whitepaper we report work that was done to investigate and improve the performance of a hyrid-Vlasov code for simulating Earth’s Magnetosphere. We improved the performance of the code through a hybrid OpenMP-MPI mode.

Download PDF


Authors: Dalibor Lukas, Jan Zapletal
Department of Applied Mathematics, IT4Innovations, VSB-Technical University of Ostrava, Czech Republic

Abstract: In this paper, a new parallel acoustic simulation package has been created, using the boundary element method (BEM). The package is built on top of SPECFEM3D, which is parallel software for doing seismic simulations, e.g. earthquake simulations of the globe. The acoustical simulation relies on a Fourier transform of the seismic elastodynamic data, resulting from SPECFEM3D_GLOBE, which are then postprocessed by a sequence of solutions to Helmholtz equations, in the exterior of the globe. For the acoustic simulations BEM has been employed, which reduces computation to the sphere; however, its naive implementation suffers from quadratic time and memory complexity, with respect to the number of unknowns. To overcome the latter, the method was accelerated by using hierarchical matrices and adaptive cross approximation techniques, which is referred to as Fast BEM. First, a hierarchical clustering of the globe surface triangulation is performed. The arising cluster pairs decompose the fully populated BEM matrices into a hierarchy of blocks, which are classified as far-field or near-field. While the near-field blocks are kept as full matrices, the far-field blocks are approximated by low-rank matrices. This reduces the quadratic complexity of the serial code to almost linear complexity, i.e. O(n*log(n)), where n denotes the number of triangles. Furthermore, a parallel implementation was done, so that the blocks are assigned to concurrent MPI processes with an optimal load balance. The processes share the triangulation data. The parallel code reduces the computational complexity to O(n*log(n)/N), where N denotes the number of processes. This is a novel implementation of BEM that overcomes computational times of traditional volume discretization methods, e.g. finite elements, by an order of magnitude.

Download PDF


Authors:John Donnersa, Chandan Basub, Alastair McKinstryc, Muhammad Asifd, Andrew Portere, Eric Maisonnavef, Sophie Valckef, Uwe Fladrichg
aSARA, Amsterdam, The Netherlands
bNSC, Linkoping, Sweden
cICHEC, Galway, Ireland
dIC3, Barcelona, Spain
eSTFC, Daresbury, United Kingdom
fCERFACS, Toulouse, France
gSMHI, Norrkoping, Sweden

Abstract: The EC-EARTH model is a global, coupled climate model that consists of the separate components IFS for the atmosphere and NEMO for the ocean that are coupled using the OASIS coupler. EC-EARTH was ported and run on the Curie system. Different configurations, using resolutions from T159 (approx. 128km) to T799 (approx 25km), were available for benchmarking. Scalasca was used to analyze the performance of the model in detail. Although it was expected that either the I/O or the coupling would be a bottleneck for scaling of the highest resolution model, that is clearly not, yet, the case. The IFS model uses two MPI_Alltoallv calls per timestep that dominate the loss of scaling at 1024 cores. Using the OpenMP functionality in IFS could potentially increase scalability considerably, but this does not yet work on Curie. Work is ongoing to make MPI_Alltoallv more efficient on Curie. It is expected that I/O and/or coupling does become a bottleneck when IFS can be scaled further than 2000 cores. Therefore, the OASIS team increased the scalability of OASIS dramatically with the implementation of a radically different approach, showing less than 1% overhead at 2000 cores. The scalability of NEMO was improved during an earlier PRACE project. The I/O subsystem in IFS is described and is probably not easily accelerated unless it is being rewritten and uses a different file format.

Download PDF


Authors: Dalibor Lukas, Petr Kovar, Tereza Kovarova, Jan Zapletal
Department of Applied Mathematics, IT4Innovations, VSB-Technical University of Ostrava, Czech Republic

Abstract: In this paper, a new parallel acoustic simulation package has been created, using the boundary element method (BEM). The package is built on top of SPECFEM3D, which is parallel software for doing seismic simulations, e.g. earthquake simulations of the globe. The acoustical simulation relies on a Fourier transform of the seismic elastodynamic data, resulting from SPECFEM3D_GLOBE, which are then postprocessed by a sequence of solutions to Helmholtz equations, in the exterior of the globe. For the acoustic simulations BEM has been employed, which reduces computation to the sphere; however, its naive implementation suffers from quadratic time and memory complexity, with respect to the number of unknowns. To overcome the latter, the method was accelerated by using hierarchical matrices and adaptive cross approximation techniques, which is referred to as Fast BEM. First, a hierarchical clustering of the globe surface triangulation is performed. The arising cluster pairs decompose the fully populated BEM matrices into a hierarchy of blocks, which are classified as far-field or near-field. While the near-field blocks are kept as full matrices, the far-field blocks are approximated by low-rank matrices. This reduces the quadratic complexity of the serial code to almost linear complexity, i.e. O(n log(n)), where n denotes the number of triangles. Furthermore, a parallel implementation was done, so that the blocks are assigned to concurrent MPI processes with an optimal load balance. The novelty of our approach is based on a nontrivial and theoretically supported memory distribution of the hierarchical matrices and right-hand side vectors so that the overall memory consumption leads to O(n log(n)/N+n/sqrt(N)), which is the theoretical limit at the same time.

Download PDF


Authors: Marcin Zielinski1a, John Donnersa
aSARA B.V., Science Park 140, 1098XG Amsterdam, The Netherlands

Abstract: The entire project focused on an evaluation of the code for a possible introduction of OpenMP and its actual implementation and extensive tests. Major time consuming parts of the code were detected and thoroughly analyzed. The most time consuming part was successfully parallelized using OpenMP. Very extensive test simulations using the hybrid code allowed for many further improvements and validations of its results. Possible improvements have also been discussed with the developers to be implemented in the near future.

Download PDF


Author: Chandan Basu
National Supercomputer Center, Linkoping University, Linkoping 581 83, Sweden

Abstract: The high resolution version of EC-EARTH is ported on Curie. The scalability of the code is tested up to 3500 CPU cores. An example EC-EARTH run is profiled using the TAU tool.

Download PDF


Astrophysics applications


Authors: T. Ponweiser, M.E. Innocenti, G. Lapent A. Beck, S. Markidis
Research Institute for Symbolic Computation (RISC), Johannes Kepler University, Altenberger Stra?e 69, 4040 Linz, Austria
Center for mathematical Plasma Astrophysics, Department of Mathematics, K.U. Leuven, Celestijnenlaan 200B, B-3001 Leuven, Belgium
Laboratoire Leprince-Ringuet, Ecole Polytechnique, CNRS-IN2P3, France
KTH Royal Institute of Technology, Stockholm, Sweden

Abstract: Parsek2D-MLMD is a semi-implicit Multi Level Multi Domain Particle-in-Cell (PIC) code for the simulation of astrophysical and space plasmas. In this Whitepaper, we report on improvements on Parsek2D-MLMD carried out in the course of the PRACE preparatory access project 2010PA1802. Through algorithmic enhancements – in particular the implementation of smoothing and temporal sub-stepping – as well as through performance tuning using HPCToolkit, the efficiency of the code has been improved significantly. For representative benchmark cases, we consistently achieved a total speedup of a factor 10 and higher.

Download PDF


Authors: J. Donners, J. Bedorf
SURFsara, Amsterdam, The Netherlands
Leiden,UniversityLeiden, The Netherlands

Abstract: This white paper describes a project to modify the I/O of the Bonsai astrophysics code to scale up to more than 10,000 nodes on the Titan system. A remaining bottleneck is the I/O: the creation of separate files for each MPI task overloads the Lustre metadata server. The use of the SIONlib library on the Lustre filesystem of different PRACE systems is investigated. Several issues had to be resolved, both with the SIONlib library and the Liblustre API, before a satisfactory I/O performance could be achieved. SIONlib reaches about half the performance of a naive approach where each MPI task writes a separate file for a few thousand MPI tasks. However, when more MPI tasks are used, the SIONlib library shows the same performance as the naive approach. The SIONlib library exhibits both the performance and the scalability that is needed to be successful at exascale.

Download PDF


Authors: Kacper Kowalik, Artur Gawryszczak, Marcin Lawenda, Michal Hanasza, Norbert Meyer
Nicolaus Copernicus University, Jurija Gagarina 11, 87-100 Torun, Poland
Copernicus Astronomical Center, Polish Academy of Sciences, Bartycka 18, 00-716 Warszawa, Poland
Poznan Supercomputing and Networking Centre, Dabrowskiego 79a, 60-529 Poznan, Poland

Abstract: PIERNIK is an MHD code created in Centre for Astronomy, Nicolaus Copernicus University in Torun, Poland. The current version of the code uses a simple, conservative numerical scheme, which is known as Relaxing TVD scheme (RTVD). The aim of this project was to increase the performance of the PIERNIK code in a case where the computational domain is decomposed into large number of smaller grids and each concurrent processes is assigned a significant number of those grids. This optimization enable the PIERNIK to efficiently run on Tier-0 machines. In chapter 1 we introduce PIERNIK software more particularly. Next we focus on scientific aspects (chapter 2) and discuss used algorithms (chapter 3) including potential optimization issues. Subsequently we present performance analysis (chapter 4) carried out with Scalasca and Vampir tools. In the final chapter 5 we present optimization results. In the appendix we provided technical information about the installation and test environment.

Download PDF


Authors: Joachim Heina, Anders Johansonb
aCentre of Mathematical Sciences & Lunarc, Lund University, Box 118, 221 00 Lund, Sweden
bDepartment of Astronomy and Theoretical Physics, Lund University, Box 43, 221 00 Lund, Sweden

Abstract: The simulation of weakly compressible turbulent gas fiows with embedded particles is one of the main objectives of the Pencil Code. While the code mostly deploys high order -finite difference schemes, portions of the code require the use of Fourier space methods. This report describes an optimisation project to improve the performance of the parallel Fourier transformation in the code. Certain optimisations which significantly improve the performance of the parallel FFT were observed to have a negative impact on other parts of the code, such that the overall performance decreases. Despite this challenge the project managed to improve the performance of the parallel FFT within the Pencil Code by a factor of 2.4 and the overall performance of the application by 8% for a project-relevant benchmark.

Download PDF


Authors: Petri Nikunena, Frank Scheinerb
aCSC – IT Center for Science, P.O. Box 405, FI-02101 Espoo, Finland
bHigh Performance Computing Center Stuttgart (HLRS),University of Stuttgart, D-70550 Stuttgart, Germany

Abstract: Planck is a mission of the European Space Agency (ESA) to map the anisotropies of the cosmic microwave background with the highest accuracy ever achieved. Planck is supported by several computing centres, including CSC (Finland) and NERSC (USA). Computational resources were provided by CSC through the DECI project Planck-LFI, and by NERSC as a regular production project. This whitepaper describes how PRACE-2IP staff helped Planck-LFI with two types of support tasks: (1) porting their applications to the execution machine and seeking ways to improve applications’ performance; and (2) improving performance and facilities to transfer data between the execution site and the different data centres where data is stored.

Download PDF


Authors: V.Antonuccio-Delogua, U.Becciania, M.Cytowski*b, J.Heinc, J.Hertzerd
aINAF – Osservatorio Astrofisico di Catania, Italy
bInterdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw, Poland
cLunarc Lund University, Sweden
dHLRS, University of Stuttgart, Germany

Abstract: In this whitepaper we report work that was done to investigate and improve the performance of a mixed MPI and OpenMP implementation of the FLY code for cosmological simulations on a PRACE Tier-0 system Hermit (Cray XE6).

Download PDF


Authors: Claudio Ghellera, Graziella Ferinia, Maciej Cytowskib, Franco Vazzac
aCINECA, Via Magnanelli 6/3, Casalecchio di Reno, 40033, Italy
bICM, University of Warsaw, ul.Pawinskiego 5a, 02-106 Warsaw, Poland
cJacobs University,Campus Ring 1, 28759 Bremen, Germany

Abstract: In this paper we present the work performed in order to build and optimize the cosmological simulation code ENZO on the Jugene, Blue Gene/P system available at the Forschungszentrum Juelich in Germany. The work allowed us to define the optimal setup to perform high resolution simulations -finalized to the description of non thermal phenomena (e.g. the acceleration of relativistic particles at shock waves) active in massive galaxy clusters during their cosmological evolution. These simulations will be the subject of a proposal in a future call for projects of the PRACE EU funded project (http://www.prace-ri.eu/).

Download PDF


Authors: Ata Turka, Cevdet Aykanata, G. Vehbi Demircia, Sebastian von Alfthanb, Ilja Honkonenb
aBilkent University, Computer Engineering Department, 06800 Ankara, Turkey
bFinnish Meteorological Institute, PO Box 503, Helsinki, FI-00101, Finland

Abstract: This whitepaper describes the load-balancing performance issues that are observed and tackled during the petascaling of a space plasma simulation code developed at the Finnish Meteorological Institute (FMI). It models the communication pattern as a hypergraph, and partitions the computational grid using the parallel hypergraph partitioning scheme (PHG) of the Zoltan partitioning framework. The result of partitioning determines the distribution of grid cells to processors. It is observed that the partitioning phase takes a substantial percentage of the overall computation time. Alternative (graph-partitioning-based) schemes that perform almost as well as the hypergraph partitioning scheme and that require less preprocessing overhead and better balance are proposed and investigated. A comparison in terms of effect on running time, preprocessing overhead and load-balancing quality of Zoltan’s PHG, ParMeTiS, and PT-SCOTCH are presented. Test results on Juelich BlueGene/P cluster are presented.

Download PDF


Finite Element applications


Authors: Mikko Byckling, Mika Malinen, Juha Ruokolainen, Peter Raback
CSC – IT Center for Science, Keilaranta 14, 02101 Espoo, Finland

Abstract: Recent developments of Elmer finite element solver are presented. The applicability of the code to industrial problems has been improved by introducing features for handling rotational boundary conditions with mortar -nite elements. The scalability of the code has been improved by making the code thread-safe and by multithreading some critical sections of the code. The developments are described and some scalability results are presented.

Download PDF


Authors: X. Saez, E. Casoni, G. Houzeaux, M. Vazquez
Dept. of Computer Applications in Science and Egineering, Barcelona Supercomputing Center (BSC-CNS), 08034 Barcelona, Spain

Abstract: While solid mechanics codes are now proven tools both in the industry and research sectors, the increasingly more exigent requirements of both sectors are fuelling the need for more computational power and more advanced algorithms. While commercial codes are widely used in industry during the design process, they often lag behind academic codes in terms of computational efficiency. In fact, the commercial codes are usually general purpose and include millions of lines of codes. Massively parallel computers appeared only recently, and the adaptation of these codes is going slowly. In the meantime, academy adapted very quickly to the new computer architectures and now offers an attractive alternative: not so much modeling but more accuracy.

Alya is a computational mechanics code developed at Barcelona Supercomputing Center (BCS-CNS) that solves Partial Differential Equations (PDEs) on non-structured meshes. To address the lack of an efficient parallel solid mechanics code, and motivated by the demand coming from industrial partners, Alya-Solidz, the specific Alya module for solving computational solid mechanics problems, has been enhanced to treat large complex problems involving solid deformations and fracture. Some of these developments have been carried out in the framework of PRACE-2IP European project.

In this article a solid mechanics simulation strategy for parallel supercomputers based on a hybrid approach is presented. A hybrid parallelization approach combining MPI tasks with OpenMP threads is proposed in order to exploit the different levels of parallelism of actual multicore architectures. This paper describes the strategy programmed in Alya and shows nearly optimal scalability results for some solid mechanical problems.

Download PDF


Authors: T. Kozubek, M. Jarosov, M. Mens??k, A. Markopoulos
CE IT4Innovations, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic

Abstract: We describe a hybrid FETI (Finite Element Tearing and Interconnecting) method based on our variant of the FETI type domain decomposition method called Total FETI. In our approach a small number of neighboring subdomains is aggregated into the clusters, which results into a smaller coarse problem. To solve the original problem the Total FETI method is applied twice: to the clusters (macro-subdomains) and then to the subdomains in each cluster. This approach simpli?es implementation of hybrid FETI methods and enables to extend parallelization of the original problem up to tens of thousands of cores due to the coarse space reduction and thus lower memory requirements. The performance is demonstrated on a linear elasticity benchmark.

Download PDF


Authors: T. Kozubek, D. Horak, V. Hapla
CE IT4Innovations, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic

Abstract: Most of computations (subdomain problems) appearing in FETI-type methods are purely local and therefore parallelizable without any data transfers. However, if we want to accelerate also dual actions, some communication is needed due to primal-dual transition. Distribution of primal matrices is quite straightforward. Each of cores works with local part associated with its subdomains. A natural e?ort using the massively parallel computers is to maximize the number of subdomains so that sizes of subdomain sti?ness matrices are reduced which accelerates their factorization and subsequent pseudoinverse application, belonging to the most time consuming actions. On the other hand, a negative e?ect of that is an increase of the null space dimension and the number of Lagrange multipliers on subdomains interfaces, i.e. the dual dimension, so that the bottleneck of the TFETI method becomes the application of the projector onto the natural coarse space, especially its part called coarse problem solution. In this paper, we suggest and test di?erent parallelization strategies of the coarse problem solution regarding to the improvements of the TFETI massively parallel implementation. Simultaneously we discuss some details of our FLLOP (Feti Light Layer on Petsc) implementation and demonstrate its performance on an engineering elastostatic benchmark of the car engine block up to almost 100 million DOFs. The best parallelization strategy based on the MUMPS was implemented into the multi-physical ?nite element based opensource code ELMER developed by CSC, Finland.

Download PDF


Authors: T. Kozubeka, V. Vondraka, P. Rabackb, J. Ruokolainenb
a Department of Applied Mathematics, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic
b CSC – IT Center for Science, Keilaranta 14 a, 20101 Espoo, Finland

Abstract: The bottlenecks related to the numerical solution of many engineering problems are very dependent on the techniques used to solve the systems of linear equations that result from their linearizations and ?nite element discretizations. The large linearized problems can be solved e?ciently using the so-called scalable algorithms based on multigrid or domain decomposition method. In cooperation with the Elmer team two variants of the domain decomposition method have been implemented into Elmer: (i) FETI-1 (Finite Element Tearing and Interconnecting) introduced by Farhat and Roux and (ii) Total FETI introduced by Dostal, Horak, and Kucera. In the latter, the Dirichlet boundary conditions are torn o? to have all subdomains ?oating, which makes the method very ?exible. In this paper, we review the results related to the e?cient solution of symmetric positive semide?nite systems arising in FETI methods when they are applied on elliptic boundary value problems. More speci?cally, we show three di?erent strategies to ?nd the so-called ?xing nodes (or DOFs – degrees of freedom), which enable an e?ective regularization of the corresponding subdomain system matrices that eliminates the work with singular matrices. The performance is illustrated on an elasticity benchmark computed using ELMER on the French Tier-0 system CURIE.

Download PDF


Authors: J. Ruokolainena, P. R ?abacka,?, M. Lylya, T. Kozubekb, V. Vondrakb, V. Karakasisc, G. Goumasc
a CSC – IT Center for Science, Keilaranta 14 a, 20101 Espoo, Finland
b Department of Applied Mathematics, V?SB – Technical University of Ostrava, 17. listopadu 15, 70833 Ostrava Poruba, Czech Republic
c ICCS-NTUA, 9, Iroon. Polytechniou Str., GR-157 73 Zografou, Greece

Abstract: Elmer is a ?nite element software for the solution of multiphysical problems. In the present work some performance bottlenecks in the work?ow are eliminated: In prepocessing the mesh splitting scheme is improved to allow the conservation of mesh grading for simple problems. For the solution of linear systems a preliminary FETI domain decomposition method is implemented. It utilizes a direct factorization of the local problem and an iterative method for joining the results from the subproblems. The weak scaling of FETI is shown to be nearly ideal the number of iterations staying almost ?xed. For postprocessing binary output formats and a XDMF+HDF5 I/O routine are implemented. Both may be used in conjunction with parallel visualization software.

Download PDF



Authors: Vasileios Karakasis1, Georgios Goumas1, Konstantinos Nikas2,*, Nectarios Koziris1, Juha Ruokolainen3, and Peter Raback3
1Institute of Communication and Computer Systems (ICCS), Greece
2Greek Research & Technology Network (GRNET), Greece
3CSC -IT Center for Science Ltd., Finland

Abstract: Multiphysics simulations are at the core of modern Computer Aided Engineering (CAE) allowing the analysis of multiple, simultaneously acting physical phenomena. These simulations often rely on Finite Element Methods (FEM) and the solution of large linear systems which, in turn, end up in multiple calls of the costly Sparse Matrix-Vector Multiplication (SpMV) kernel. We have recently proposed the Compressed Sparse eXtended (CSX) format, which applies aggressive compression to the column indexing structure of the CSR format and is able to provide an average performance improvement of more than 40% over multithreaded CSR implementations. This work integrates CSX into the Elmer multiphysics simulation software and evaluates its impact on the total execution time of the solver. Despite its preprocessing cost, CSX is able to improve by almost 40% the performance of the Elmer’s SpMV component (using multithreaded CSR) and provides an up to 15% performance gain in the overall solver time after 1000 linear system iterations. To our knowledge, this is one of the -rst attempts to evaluate the real impact of an innovative sparse-matrix storage format within a `production’ multiphysics software.

Download PDF


Authors: K. Georgiev, N. Kosturski, I. Lirkov, S. Margenov, Y. Vutov
National Center for Supercomputing Applications, Acad. G. Bonchev str, Bl. 25-A, 1113 Sofia, Bulgaria

Abstract: The White Paper content is focused on: a) construction and analysis of novel scalable algorithms to enhance scientific applications based on mesh methods (mainly on finite element method (FEM) technology); b) optimization of a new class of algorithms on many core systems. From one site, the commonly accepted benchmark problem in computational fluid dynamics (CFD) – time dependent system of incompressible Navier-Stokes equations, is considered. The activities were motivated by advanced large-scale simulations of turbulent flows in the atmosphere and in the ocean, simulation of multiphase flows in order to extract average statistics, solving subgrid problems as part of homogenization procedures etc. The computer model is based on implementation of a new class of parallel numerical methods and algorithms for time dependent problems. It only requires solution of tridiagonal linear systems and therefore it is computationally very efficient, with a computational complexity of the same order as that of an explicit scheme, and yet, unconditionally stable. The scheme is particularly convenient for parallel implementation. Among the most important novel ideas is to avoid the transposition which is usually used in alternating directions time stepping algorithms. The final goal is to provide portable tools for integration in commonly accepted codes like Elmer and OpenFOAM. The new development software is organized as a computer library for using of researchers dealing with solution of incompressible Navier-Stokes equations From other hand, we implement and develop new scalable algorithms and software for FEM simulations with typically O(109 ) degrees of freedom in space for an IBM Blue Gene/P computer. We have considered voxel and unstructured meshes; stationary and time dependent problems; linear and nonlinear models The performed work was focused on the development of scalable mesh methods, and tuning of the related software tools mainly to the IBM Blue Gene/P architecture but other massively parallel computers and MPI clusters were taken into account too. Efficient algorithms for time stepping, mesh refinement and parallel mappings were implemented. The aim here is again providing software tools for integration in Elmer and OpenFOAM. The computational models address discrete problems in the range of O(109 ) degrees of freedom in space. The related time stepping techniques and iterative solvers are targeted to meet the Tier-1 and (further) Tier-0 requirements. Scalability on 512 IBM Blue Gene/P nodes and several other high performance computing clusters is currently documented for the tested software modules and some of them are presented in this paper. Comparison results of running Elmer code on Intel cluster (16 cores, Intel Xeon X5560) and on IBM Blue Gene/P computer can be found. Variants of 1D, 2D and 3D domain partitioning for the 3D test problems were systematically analysed showing the advantages of the 3D partitioning for the Blue Gene/P communication system.

Download PDF


Fusion applications


Authors: Xavier Saeza, Taner Akguna, Edilberto Sanchezb
a Barcelona Supercomputing Center – Centro Nacional de Supercomputacion, C/ Gran Capita 2-4, Barcelona, 08034, Spain
b Laboratorio Nacional de Fusion. Avda Complutense 22, 28040 Madrid, Spain

Abstract: In this paper we report the work done in Task 7.2 of PRACE-1IP project on the code EUTERPE. We report on the progress made on the hybridization of the code to MPI and OpenMP; the status of the porting to GPUs; the outline of the analysis of parameters; and the study on the possibility of incorporating I/O forwarding to improve performance. Our initial findings indicate that particle-in-cell algorithms such as EUTERPE are suitable candidates for the new computing paradigms involving heterogeneous architectures.

Download PDF


Life Science applications


Authors: Andrew Sunderland, Martin Plummer
STFC Daresbury Laboratory, Warrington, United Kingdom

Abstract: DNA oxidative damage has long been associated with the development of a variety of cancers including colon, breast and prostate, whilst RNA damage has been implicated in a variety of neurological diseases, such as Alzheimer’s disease and Parkinson’s disease. Radiation damage arises when energy is deposited in cells by ionizing radiation, which in turn leads to strand breaks in DNA. The strand breaks are associated with electrons trapped in quasi-bound ‘resonances’ on the basic components of the DNA. HPC usage will enable the study of this resonance formation in much more detail than in current initial calculations. The associated application is UKRmol [1], a widely used, general-purpose electron-molecule collision package and the enabling aim is to replace a serial propagator (coupled PDE solver) with a parallel equivalent module.

Download PDF


Authors: R. Oguz Selvitopi, Gunduz Vehbi Demirci, Ata Turk, Cevdet Aykanat
Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY

Abstract: This whitepaper addresses applicability of the Map/Reduce paradigm for scalable and easy parallelization of fundamental data mining approaches with the aim of exploring/enabling processing of terabytes of data on PRACE Tier-0 supercomputing systems. To this end, we first test the usage of MR-MPI library, a lightweight Map/Reduce implementation that uses the MPI library for inter-process communication, on PRACE HPC systems; then propose MR-MPI-based implementations of a number of machine learning algorithms and constructs; and finally provide experimental analysis measuring the scaling performance of the proposed implementations. We test our multiple machine learning algorithms with different datasets. The obtained results show that utilization of the Map/Reduce paradigm can be a strong enhancer on the road to petascale.

Download PDF


Authors: Thomas Roblitz, Ole W. Saastad, Hans A. Eide, Katerina Michalickova, Alexander Johan Nederbragt, Bastiaan Star
Department for Research Computing, University Center for Information Technology (USIT), University of Oslo, P.O. Box 1059, Blindern, 0316 Oslo, Norway
Center for Ecological and Evolutionary Synthesis, Department of Biosciences (CEES), University of Oslo, P.O. Box 1066, Blindern, 0316 Oslo, Norway

Abstract: Sequencing projects, like the Aqua Genome project, generate vast amounts of data which is processed through dif- ferent workfows composed of several steps linked together. Currently, such workfows are often run manually on large servers. With the increasing amount of raw data that approach is no longer feasible. The successful imple- mentation of the project’s goals requires 2-3 orders of magnitude scaling of computing, while achieving high reli- ability on and supporting ease-of-use of super computing resources at the same time. We describe two example use cases, the implementation challenges and constraints, the actual application enabling and report our findings.

Download PDF


Authors: A. Charalampidoua,b, P. Daogloua,b, D. Foliasa,b, P. Borovskac,d, V. Ganchevac,e
a Greek Research and Technology Network, Athens, Greece
b Scientific Computing Center, Aristotle University of Thessaloniki, Greece
c National Centre for Supercomputing Applications, Bulgaria
d Department of Computer Systems, Technical University of Sofia, Bulgaria
e Department of Programming and Computer Technologies, Technical University of Sofia, Bulgaria

Abstract: The project focuses on performance investigation and improvement of multiple biological sequence alignment software MSA_BG on the BlueGene/Q supercomputer JUQUEEN. For this purpose, scientific experiments in the area of bioinformatics have been carried out, using as case study influenza virus sequences. The objectives of the project are code optimization, porting, scaling, profiling and performance evaluation of MSA_BG software. To this end we have developed hybrid MPI/OpenMP parallelization on the top of the MPI only code and we showcase the advantages of this approach through the results of benchmark tests that were performed on JUQUEEN. The experimental results show that the hybrid parallel implementation provides considerably better performance than the original code.

Download PDF


Authors: Plamenka Borovska, Veska Gancheva
National Centre for Supercomputing Applications, Bulgaria

Abstract: In silico biological sequence processing is a key task in molecular biology. This scientific area requires powerful computing resources for exploring large sets of biological data. Parallel in silico simulations based on methods and algorithms for analysis of biological data using high-performance distributed computing is essential for accelerating the research and reducing the investment. Multiple sequence alignment is a widely used method for biological sequence processing. The goal of this method is DNA and protein sequences alignment. This paper presents an innovative parallel algorithm MSA_BG for multiple alignment of biological sequences that is highly scalable and locality aware. The MSA_BG algorithm we describe is iterative and is based on the concept of Artificial Bee Colony metaheuristics and the concept of algorithmic and architectural spaces correlation. The metaphor of the ABC metaheuristics has been constructed and the functionalities of the agents have been defined. The conceptual parallel model of computation has been designed and the algorithmic framework of the designed parallel algorithm constructed. Experimental simulations on the basis of parallel implementation of MSA_BG algorithm for multiple sequences alignment on heterogeneouc compact computer cluster and supercomputer BlueGene/P have been carried out for the case study of the influenza virus variability investigation. The performance estimation and profiling analyses have shown that the parallel system is well balanced both in respect to the workload and machine size.

Download PDF


Authors: Soon-Heum Koa, Plamenka Borovskab‡, Veska Ganchevac†
a National Supercomputing Center, Linkoping University, 58183 Linkoping, Sweden
b Department of Computer Systems, Technical University of Sofia, Sofia, Bulgaria
c Department of Programming and Computer Technologies, Technical University of Sofia, Sofia, Bulgaria

Abstract: This activity with the project PRACE-2IP is aimed to investigate and improve the performance of multiple sequence alignment software ClustalW on the supercomputer BlueGene/Q, so-called JUQUEEN, for the case study of the influenza virus sequences. Porting, tuning, profiling, and scaling of this code has been accomplished in this aspect. A parallel I/O interface has been designed for effcient sequence dataset input, in which sub-groups’ local masters take care of read operation and broadcast the dataset to their slaves. The optimal group size has been investigated and the effects of read buffer size on read performance has been experimented. The application to ClustalW software shows that the current implementation with parallel I/O provides considerably better performance than the original code in view of I/O segment, leading up to 6.8 times speed-up for inputting dataset in case of using 8192 JUQUEEN cores.

Download PDF


Authors: D. Grancharov, E. Lilkova, N. Ilieva, P. Petkov, S. Markov and L. Litov
National Centre for Supercomputing Applications, Acad. G. Bonchev Str, Bl. 25-A, 1113 Sofia, Bulgaria

Abstract: Based on the analysis of the performance, scalability, work-load increase and distribution of the MD simulation packages GROMACS and NAMD for very large systems and core numbers, we evaluate the possibilities for overcoming the deterioration of the scalability and performance of the existing MD packages by implementation of symplectic integration algorithms with multiple step sizes.

Download PDF


Particle Physics applications


Authors: Jacques David,Vincent Bergeaud CEA/DEN/SA2P, CEA-Saclay, 91191 , Gif sur Yvette, France CEA/DEN/DM2S/LGLS, CEA-Saclay, 91191 , Gif sur Yvette, France

Abstract: Mathematical models, designed to simulate complex physics are use in scientific and engineering studies. In the case of nuclear applications, assessing safety parameter such as fuel temperature, numerical simulations is paramount, in order to gain confidence versus comparison to experience. The URANIE tool uses propagation methods to assess uncertainties in simulation output parameters in order to better evaluate confidence intervals (e.g., of temperature, pressure, etc.). This is used for Verification and Validation and Uncertainties Quantification (or VVUQ) process used for safety analysis. While URANIE is well suited for launching many instances of serial codes, it suffers from a lack of scalability and portability when used for coupled simulations and/or parallel codes. The aim of the project is therefore to enhance this launching mechanism to support a wider variety of applications, leveraging on HPC capabilities to go further to a new level of statistical assessment for models.

Download PDF


Authors: Alexei Strelchenko, Marcus Petschlies and Giannis Koutsou
CaSToRC, Nicosia 2121, Cyprus

Abstract: We extend the QUDA library, an open source library for performing calculations in lattice QCD on Graphics Processing Units (GPUs) using NVIDIA’s CUDA platform, to include kernels for non-degenerate twisted mass and multi-gpu Domain Wall fermion operators. Performance analysis is provided for both cases.

Download PDF


Mathematics applications


Authors: Krzysztof T. Zwierzynski*a *aPoznan Supercomputing and Networking Center, ul. Z. Noskowskiego 12/14, 61-704 Poznan, Poland

Abstract: In this paper we consider the problem of designing a self-improving meta-model of job workflow that is sensitive to the change of the computational environment. As an example of searched combinatorial objects permutations and some classes of integral graphs are used. We propose a number of dedicated methods that can improve the execution time of workflow based on decision trees and the replication of some actors in the workflow.

Download PDF


Authors: Jerome Richarda,+, Vincent Lanoreb,+ and Christian Perezc,+
a University of Orleans, France
b Ecole Normale Superieure de Lyon, France
c Inria
+ Avalon Research-Team, LIP, ENS Lyon, France

Abstract: The Fast Fourier Transform (FFT) is a widely-used building block for many high-performance scientific applications. Ef- ficient computing of FFT is paramount for the performance of these applications. This has led to many efforts to implement machine and computation specific optimizations. However, no existing FFT library is capable of easily integrating and au- tomating the selection of new and/or unique optimizations. To ease FFT specialization, this paper evaluates the use of component-based software engineering, a programming paradigm which consists in building applications by assembling small software units. Component models are known to have many software engineering benefits but usually have insusuficient performance for high-performance scientific applications. This paper uses the L2C model, a general purpose high-performance component model, and studies its performance and adaptation capabilities on 3D FFTs. Experiments show that L2C, and components in general, enables easy handling of 3D FFT specializations while obtaining performance comparable to that of well-known libraries. However, a higher-level component model is needed to automatically generate an adequate L2C assembly.

Download paper: PDF


Authors: R. Oguz Selvitopia, Cevdet Aykanata,*
a Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY

Abstract: Parallel iterative solvers are widely used in solving large sparse linear systems of equations on large-scale parallel architectures. These solvers generally contain two different types of communication operations: point-topoint (P2P) and global collective communications. In this work, we present a computational reorganization method to exploit a property that is commonly found in Krylov subspace methods. This reorganization allows P2P and collective communications to be performed simultaneously. We realize this opportunity to embed the content of the messages of P2P communications into the messages exchanged in the collective communications in order to reduce the latency overhead of the solver. Experiments on two different supercomputers up to 2048 processors show that the proposed latency-avoiding method exhibits superior scalability, especially with increasing number of processors.

Download paper: PDF

Authors: Gunduz Vehbi Demirci, Ata Turk, R. Oguz Selvitopi, Kadir Akbudak, Cevdet Aykanat
Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY

Abstract: This whitepaper addresses applicability of the MapReduce paradigm for scientific computing by realizing it on the widely used sparse matrix-vector multiplication (SpMV) operation with a recent library developed for this purpose. Scaling SpMV operations proves vital as it is a kernel that finds its applications in many scientific problems from different domains. Generally, the scalability improvement of these operations is negatively affected by high communication requirements of the multiplication, especially at large processor counts in the case of strong scaling. We propose two partitioning-based methods to reduce these requirements and allow SpMV operations to be performed more efficiently. We demonstrate how to parallelize SpMV operations using MR-MPI, an efficient and portable library that aims at enabling usage of MapReduce paradigm in scientific computing. We test our methods extensively with different matrices. The obtained results show that utilization of communication-efficient methods and constructs are required on the road to Exascale.

Download paper: PDF


Authors: Petri Nikunena, Frank Scheinerb
a CSC – IT Center for Science, P.O. Box 405, FI-02101 Espoo, Finland
b High Performance Computing Center Stuttgart (HLRS),University of Stuttgart, D-70550 Stuttgart, Germany

Abstract: Planck is a mission of the European Space Agency (ESA) to map the anisotropies of the cosmic microwave background with the highest accuracy ever achieved. Planck is supported by several computing centres, including CSC (Finland) and NERSC (USA). Computational resources were provided by CSC through the DECI project Planck-LFI, and by NERSC as a regular production project. This whitepaper describes how PRACE-2IP staff helped Planck-LFI with two types of support tasks: (1) porting their applications to the execution machine and seeking ways to improve applications’ performance; and (2) improving performance and facilities to transfer data between the execution site and the different data centres where data is stored.

Download paper: PDF


Author: Krzysztof T. Zwierzynski
Poznan Supercomputing and Networking Center, ul. Z. Noskowskiego 12/14, 61-704 Poznan, Poland

Abstract: In this white paper we report the work that was done on the problem of generation combinatorial structures with some rare invariant properties. These combinatorial structures are connected integral graphs. All (588) of such graphs of order 1 ? n ? 12 are known. The main goal of this work was to reduce the time of generation by distributing graph generators over hosts in PRACE-RI, and to reduce the time of sieving integral graphs by applying eigenvalue calculation in GPGPU device using the OpenCL technique. This work is also a study of how to minimize the overhead connected with using OpenCL kernels.

Download paper: PDF



Authors: Dimitris Siakavaras, Konstantinos Nikas, Nikos Anastopoulos, and Georgios Goumas
Greek Research & Technology Network (GRNET), Greece

Abstract: This whitepaper studies the various aspects and challenges of performance scaling on large scale shared memory systems. Our experiments are performed on a large ccNUMA machine that consists of 72 IBM 3755 nodes connected with NumaConnect and provides shared memory over a total of 1728 cores, a number that is far beyond conventional server platforms. As benchmarks, three data-intensive and memory-bound applications with different communication patterns are selected, namely Jacobi, CSR SpM-V and Floyd-Warshall. Our results illustrate the need for numa-aware design and implementation of shared-memory parallel algorithms in order to achieve scaling to high core counts. At the same time, we observed that, depending on its communication pattern, an application could bene-t more from explicit communication using message passing.

Download paper: PDF


Disclaimer

These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.

They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.

Copyright notices

© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.

All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.