Application scalability


Quantum MD applications


Authors: Thomas Ponweisera,*, Malgorzata Wierzbowskab,†

aResearch Institute for Symbolic Computation (RISC), Johannes Kepler University, Altenberger Stra?e 69, 4040 Linz, Austria

bInstitute of Physics, Polish Academy of Science, Al. Lotnikow 32/46, 02-668 Warsaw, Poland

Abstract:
WANNIER90 is a quantum-mechanics code for the computation of maximally localized Wannier functions, ballistic transport,
thermoelectrics and Berry-phase derived properties – such as optical conductivity, orbital magnetization and anomalous Hall
conductivity. In this whitepaper, we report on optimizations for WANNIER90 carried out in the course of the PRACE
preparatory access project PA2231. Through performance tuning based on the integrated tool suite HPCToolkit and further
parallelisation of runtime-critical calculations, significant improvements in performance and scalability have been achieved.
Previously unfeasible computations with more than 64 atoms are now possible and the code exhibits almost perfect strong
scaling behaviour up to 2048 processes for sufficiently large problem settings.

Download PDF


Authors: J. Alberdi-Rodrigueza,b, A. Rubioa,c, M. Oliveirad, A. Charalampidoue,f, D. Foliase,f
aNano-Bio Spectroscopy Group and European Theoretical Spectroscopy Facility (ETSF)
University of the Basque Country UPV/EHU, Donostia, Spain

b Department of Computer Architecture and Technology University of the Basque Country UPV/EHU, Donostia, Spain
c Max Planck Institute for the Structure and Dynamics of Matter, Hamburg, Germany, Departamento de Fisica de Materiales, Centro de Fisica de Materiales CSIC-UPV/EHU-MPC and DIPC, University of the Basque Country, UPV/EHU, Donostia, Spain
Fritz-Haber-Institut Max-Planck-Gesellschaft, Berlin, Germany
d Dep. Physique, Universite of Liege
e Greek Research and Technology Network, Athens, Greecee
f fScientific Computing Center, Aristotle University of Thessaloniki, Greece

Abstract:
Octopus is a software package for density-functional theory (DFT), and its time-dependent (TDDFT) variant. Linear Combination of the
Atomic Orbitals (LCAO) is performed previous to the actual DFT run. LCAO is used to get an initial guess of densities, and therefore, to
start with the Self Consistent Field (SCF) of the Ground-State (GS). System initialization and LCAO steps consume a large amount of
memory and do not demonstrate good performance. In this study, extensive profiling has been performed, in order to identify large matrices
and scaling behaviour of initialization and LCAO. Alternative implementations of LCAO in Octopus have been investigated in order to
optimize memory usage and performance of LCAO approach. Use of ScaLAPACK library led to significant improvement of memory
allocation and performance. Benchmark tests have been performed on MareNostrum III HPC system using various combinations of atomic
systems’ sizes and numbers of CPU cores.

Download PDF


Authors:
Soon-Heum Koa, Simen Reine, Thomas Kjargaard

National Supercomputing Centre, Linkoping University, 581 83 Linkoping, Sweden

Centre for Theoretical and Computational Chemistry, Department of Chemistry, Oslo University, Postbox 1033, Blindern, 0315, Oslo, Norway

qLEAP – Center for Theoretical Chemistry, Department of Chemistry, Aarhus University, Langelandsgade 140, Aarhus C, 8000, Denmark

Abstract:
In this paper, we present the performance of LSDALTON’s DFT method in large molecular simulations of biological interest. We primarily focus on evaluating the performance gain by applying the density fitting (DF) scheme and the auxiliary density matrix method (ADMM). The enabling effort is put towards finding the right build environment (composition of the compiler, an MPI and extra libraries) which generates a full 64-bit integer-based binary. Using three biological molecules varying in size, we verify that the DF and the ADMM schemes provide much gain in the performance of the DFT code, at the cost of large memory consumption to store extra matrices and a little change on scalability characteristics with the ADMM calculation. In the insulin simulation, the parallel region of the code accelerates by 30 percent with the DF calculation and 56 percent in the case of the DF-ADMM calculations.

Download PDF


Authors:
Mariusz Uchronski, Agnieszka Kwiecien, Marcin Gebarowski

WCSS, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland

Abstract:
CP2K is an application for atomistic and molecular simulation and, with its excellent scalability, is particularly important with regards to use on future exascale systems. The code is well parallelized using MPI and hybrid MPI/OpenMP, typically scaling well to 1 core per atom in the system. The research on CP2K done within PRACE-1IP stated that due to heavy usage of sparse matrix multiplication for large systems, there is a place for improvement of performance. The main goal of this work, undertaken within PRACE-3IP, was to investigate the most time-consuming routines and port them to accelerators, particularly GPGPUs. The relevant areas of the code that can be effectively accelerated are the matrix multiplications (DBCSR library). A significant amount of work has already been done on DBCSR library using CUDA. We focused on enabling the library on a potentially wider range of computing resources using OpenCL and OpenACC technologies, to bring the overall application closer to exascale. We introduce the ports and promising performance results. The work done has led to the identification of a number of issues with using OpenACC in CP2K, which need to be further investigated and resolved to make the application and technology work better together.

Download PDF


Authors: J.A. Astrom

CSC – It-centre for science, Esbo, Finland

Abstract: NUMFRAC is a generic particle based code for simulation of non-linear mechanics in disordered solids. The generic
theory of the code is outlined and examples are given by glacier calving and fretting. This text is a to a large degree a part of the publication: J. A. °Astr?om, T. I. Riikil?a, T. Tallinen, T. Zwinger, D. Benn, J. C. Moore, and J. Timonen, A particle based simulation model for glacier dynamics, The Cryosphere Discuss, 7, 921-941, 2013.

Download PDF


Authors: A. Calzolaria, C. Cavazzonib

a Istituto Nanoscienze CNR-NANO-S3, I-41125 Modena Italy

b CINECA – Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)

Abstract:
This work regards the enabling of the Time-Dependent Density Functional Theory kernel (TurboTDDFT) of
Quantum-ESPRESSO package on petascale systems. TurboTDDFT is a fundamental tool to investigate
nanostructured materials and nanoclusters, whose optical properties are determined by their electronic excited
states. Enabling of TurboTDDFT on petascale system will open up the possibility to compute optical properties
for large systems relevant for technological applications. Plasmonic excitations in particular are important for a
large range of applications from biological sensing, over energy conversion to subwavelength waveguides. The
goal of the present project was the implementation of novel strategies for reducing the memory requirements and
improving the weak scalability of the TurboTDDFT code, aiming at obtaining an important improvement of the
code capabilities and to be able to study the plasmonic properties of metal nanoparticle (Ag, Au) and their
dependence on the size of the system under test.

Download PDF


Authors: Massimiliano
Guarrasia,
Sandro Frigiob,
Andrew Emersona
and
Giovanni Erbaccia
a CINECA,
Italy

b University
of Camerino, Italy

Abstract:
In this paper we will present part of the work carried out by CINECA in the framework of the PRACE-2IP project aimed to
study the effect on performance due to the implementation of a 2D Domain Decomposition algorithm in DFT codes that use
standard 1D (or slab) Parallel Domain Decomposition. The performance of this new algorithm are tested on two example
applications: Quantum Espresso, a popular code used in materials science, and , the CFD code BlowupNS.

In the first part of this paper we will present the codes that we use. In the last part of this paper we will show the increase of
performance obtained using this new algorithm.

Download PDF



Authors: Al.
Charalampidoua,b,
P. Korosogloua,b,
F. Ortmannc,
S. Rochec
a Greek
Research and Technology Network, Athens, Greece
b Scientific
Computing Center, Aristotle University of Thessaloniki, Thessaloniki
54124, Greece
c Catalan
Institute of Nanotechnology, Spain

Abstract:
This study has focused on an application for Quantum Hall Transport simulations and more specifically on how
to overcome an initially identified potential performance bottleneck related to the I/O of wave functions. These
operations are required in order to enable and facilitate continuation runs of the code. After following several
implementations for performing these I/O operations in parallel (using the MPI I/O library) we showcase that a
performance gain in the range 1.5 – 2 can be achieved when switching from the initial POSIX only approach to
the parallel MPI I/O approach on both CURIE and HERMIT PRACE Tier-0 systems. Moreover, we showcase
that because I/O throughput scales with an increasing number of cores overall the performance of the code is
efficient up to at least 8192 processes.

Download PDF


Authors:
Martti Louhivuoria, Jussi Enkovaaraa,b
a CSC – IT Center for Science Ltd., PO Box 405, 02101 Espoo, Finland
b Aalto
University, Department of Applied Physics, PO Box 11100, 00076 Aalto,
Finland

Abstract:
In recent years, graphical processing units (GPUs) have generated a lot of excitement in computational sciences by
promising a signi-cant increase in computational power compared to conventional processors. While this is true in many
cases for small-scale computational problems that can be solved using the processing power of a single computing unit,
the e-cient usage of multiple GPUs in parallel over multiple inter-connected computing units has been problematic.
Increasingly the real-life problems tackled by computational scientist require large-scale parallel computing and thus it is
crucial that GPU-enabled software reach good parallel scalability to reap the bene-ts of GPU acceleration. This is exactly
what has been achieved for GPAW, a popular quantum chemistry program, by Hakala et al. in their recent work [1].

Download PDF


Authors: Simen
Reinea ,
Thomas Kjrgaarda,
Trygve Helgakera, Ole
Widar Saastadb,
Andrew Sunderlandc
a Centre
for Theoretical and Computational Chemistry (CTCC), Department of
Chemistry, University of Oslo, Oslo, Norway,
b University
Center for Information Technology, University of Oslo, Oslo, Norway
c STFC
Daresbury Laboratory, Warrington, United Kingdom

Abstract:
Linear Scaling DALTON (LSDALTON) is a powerful molecular electronic structure program that is the focus of software optimization
projects in PRACE 1IP-WP7.2 and PRACE 1IP-WP7.5. This part of the project focuses on the introduction of parallel diagonalization routines
from the ScaLAPACK library into the latest MPI version of LSDALTON. The parallelization work has involved three main tasks: i)
Redistribution of the matrices assembled for the SCF cycle from a serial / distributed state to the two dimensional block-cyclic data distribution
used for PBLAS and ScaLAPACK; ii) Interfacing of LSDALTON data structures to parallel diagonalization routines in ScaLAPACK; iii)
Performance testing to determine the favoured ScaLAPACK eigensolver methodology

Download PDF


Authors: Fabio
Affinitoa, Emanuele Cocciab, Sandro Sorellac,
Leonardo Guidonib,
a CINECA,
Casalecchio di Reno, Italy
b Universita
dell’Aquila,L’Aquila,Italy
c SISSA,
Trieste, Italy

Abstract:
Quantum Monte Carlo (QMC) methods are a promising technique for the study of the electronic structure of
correlated molecular systems. The technical of the present project is to demonstrate the scalability of the TurboRVB
code for a series of systems having different properties in terms of number of electrons, number of variational
parameters and size of the basis set.

Download PDF


Authors:Iain
Bethunea, Adam Cartera, Xu Guoa,
Paschalis Korosogloub,c
a EPCC,
The University of Edinburgh, James Clerk Maxwell Building, The King’s
Buildings, Edinburgh EH9 3JZ, United Kingdom
b AUTH,
Aristotle University of Thessaloniki, Thessaloniki 52124, Greece,
c GRNET,
Greek Research & Technology Network, L. Mesogeion 56, Athens
11527, Greece

Abstract:
CP2K is a powerful materials science and computational chemistry code and is widely used by research groups across Europe
and beyond. The recent addition of a linear scaling KS-DFT method within the code has made it possible to simulate systems of
an unprecedented size – 1,000,000 atoms or more – making full used of Petascale computing resources. Here we report on work
undertaken within PRACE 1-IP WP 7.1 to port and test CP2K on Jugene, the PRACE Tier 0 BlueGene/P system. In addition,
development work was performed to reduce the memory usage of a key data structure within the code, to make it more suitable
for the limited memory environment of the BlueGene/P. Finally we present a set of benchmark results and analysis of a large test
system.

Download PDF


Authors: Luigi
Genovesea,b, Brice Videaua, Thierry Deutscha,
Huan Tranc, Stefan Goedeckerc
a Laboratoire
de Simulation Atomistique, SP2M/INAC/CEA, 17 Av. des Martyrs, 38054
Grenoble, France
b European
Synchrotron Radiation Facility, 6 rue Horowitz, BP 220, 38043
Grenoble, France
c Institut
fur Physik, Universitat Basel, Klingelbergstr.82, 4056 Basel,
Switzerland

Abstract:
Electronic structure calculations (DFT codes) are certainly among the disciplines for which an increasing of the computational power correspond to an advancement in the scienti-c results. In this report, we present the ongoing advancements
of DFT code that can run on massively parallel, hybrid and heterogeneous CPU-GPU clusters. This DFT code, named
BigDFT, is delivered within the GNU-GPL license either in a stand-alone version or integrated in the ABINIT software
package. Hybrid BigDFT routines were initially ported with NVidia’s CUDA language, and recently more functionalities
have been added with new routines writeen within Kronos’ OpenCL standard. The formalism of this code is based on
Daubechies wavelets, which is a systematic real-space based basis set. The properties of this basis set are well suited for
an extension on a GPU-accelerated environment. In addition to focusing on the performances of the MPI and OpenMP
parallelisation the BigDFT code, this presentation also relies of the usage of the GPU resources in a complex code with
di-erent kinds of operations. A discussion on the interest of present and expected performances of Hybrid architectures
computation in the framework of electronic structure calculations is also adressed.

Download PDF


Authors: Ruyman
Reyesa, Iain Bethunea
a EPCC,
The University of Edinburgh, James Clerk Maxwell Building, Mayfield
Road, Edinburgh, EH9 3JZ,UK

Abstract:
This report describes the results of a PRACE Preparatory Access Type Cb project to optimise the implementation
of Moller-Plesset second order perturbation theory (MP2) in CP2K, to allow it to be used efficiently on the
PRACE Research Infrastructure. The work consisted of three stages: firstly serial optimisation of several key
computational kernels; secondly, OpenMP implementation of parallel 3D Fourier Transform to support mixed-mode MPI/OpenMP use of CP2K; and thirdly – benchmarking the performance gains achieved by new code on
HERMIT for a test case representative of proposed production simulations. Consistent speedups of 8% were
achieved in the integration kernel routines as a result of the serial optimisation. When using 8 OpenMP threads
per MPI process, speedups of up to 10x for the 3D FFT were achieved, and for some combinations of MPI
processes and OpenMP threads, overall speedups of 66% for the whole code were measured. As a result of this
work, a proposal for full PRACE Project Access has been submitted.

Download PDF


Authors: Peicho
Petkova, Petko Petkovb,*, Georgi Vayssilovb,
Stoyan Markovc
a Faculty
of Physics, University of Sofia, 1164 Sofia, Bulgaria

b Faculty
of Chemistry, University of Sofia, 1164 Sofia, Bulgaria

c Natioanl
Centre for Supercomputing Aplications, Sofia, Bulgaria

Abstract:
The reported work aims at implementation of a method allowing realistic simulation of large or extra-large biochemical systems (of 106
to
107
atoms) with first-principle quantum chemical methods. The current methods treat the whole system simultaneously. In this way the
comput time increases rapidly with the size of the system and does not allow efficient parallelization of the calculations due to the mutual
interactions between the electron density in all parts of the system. In order to avoid these problems we implemented a version of the
Fragment Orbital Method (FMO) in which the whole system is divided into fragments calculated separately. This approach assures nearly
linear scaling of the compute time with the size of the system and provides efficient parallelization of the job. The work includes
development of pre- and post-processing components for automatic division of the system into monomers and reconstructing of the total
energy and electron density of the whole system.

Download PDF


Authors: Iain
Bethunea, Adam Cartera, Kevin Stratforda,
Paschalis Korosogloub,c
a EPCC,
The University of Edinburgh, James Clerk Maxwell Buidling, The King’s
Buildings, Edinburgh, EH9 3JZ, United Kingdom

Abstract:
This report describes the work undertaken under PRACE-1IP to support the European scientific communities who make use
of CP2K in their research. This was done in two ways – firstly, by improving the performance of the code for a wide range of
usage scenarios. The updated code was then tested and installed on the PRACE CURIE supercomputer. We believe this
approach both supports existing user communities by delivering better application performance, and demonstrates to
potential users the benefits of using optimized and scalable software like CP2K on the PRACE infrastructure.

Download PDF


Authors: Jussi
Enkovaaraa,?, Martti Louhivuoria, Petar
Jovanovicb, Vladimir Slavnicb, Mikael R ?annarc
a CSC

- IT Center for Science, P.O. Box 405 FI-02101 Espoo Finland
b Scientific
Computing Laboratory, Institute of Physics Belgrade, Pregrevica 118,
11080 Belgrade, Serbia
c Department
of Computing Science, Umea University, SE-901 87 Umea, Sweden

Abstract:
GPAW is a versatile software package for ?rst-principles simulations of nanostructures utilizing density-functional theory
and time-dependent density-functional theory. Even though GPAW is already used for massively parallel calculations
in several supercomputer systems, some performance bottlenecks still exist. First, the implementation based on the
Python programming language introduces an I/O bottleneck during initialization which becomes serious when using
thousands of CPU cores. Second, the current linear response time-dependent density-functional theory implementation
contains a large matrix, which is replicated on all CPUs. When reaching for larger and larger systems, memory runs
out due to the replication. In this report, we discuss the work done on resolving these bottlenecks. In addition, we have
also worked on optimization aspects that are directed more to the future usage. As the number of cores in multicore
CPUs is still increasing, an hybrid parallelization combining shared memory and distributed memory parallelization is
becoming appealing. We have experimented with hybrid OpenMP/MPI and report here the initial results. GPAW also
performs large dense matrix diagonalizations with the ScaLAPACK library. Due to limitations in ScaLAPACK these
diagonalizations are expected to become a bottleneck in the future, which has led us to investigate alternatives for the
ScaLAPACK.

Download PDF


Abstract:
The work aims at evaluating the performance of DALTON on different platforms and
implementing new strategies to enable the code for petascaling. The activities have been
organized into four tasks within PRACE project: (i) Analysis of the current status of the
DALTON quantum mechanics (QM) code and identification of bottlenecks, implementation of
several performance improvements of DALTON QM and first attempt of hybrid parallelization;
(ii) Implementation of MPI integral components into LSDALTON, improvements of
optimization and scalability, interface of matrix operations to PBLAS and ScaLAPACK
numerical library routines; (iii) Interfacing the DALTON and LSDALTON QM codes to the
ChemShell quantum mechanics/molecular mechanics (QM/MM) package and benchmarking of
QM/MM calculations using this approach; (vi) Analysis of the impact of DALTON QM system
components with Dimemas. Part of the results reported here has been achieved through the
collaboration with ScalaLife project.

Download PDF


Authors: Simen
Reine(a), Thomas Kj?rgaard(a), Trygve
Helgaker(a), Olav Vahtras(b,d), Zilvinas
Rinkevicius(b,g), Bogdan Frecus(b), Thomas W.
Keal(c), Andrew Sunderland(c), Paul
Sherwood(c),

Michael
Schliephake(d), Xavier Aguilar(d), Lilit
Axner(d), Maria Francesca Iozzi(e), Ole Widar
Saastad(e), Judit Gimenez(f)

a
Centre for Theoretical and Computational Chemistry (CTCC), Department
of Chemistry, University of Oslo, P.O.Box 1033 Blindern, N-0315 Oslo,
Norway

b
KTH Royal Institute of Technology, School of Biotechnology,
Division of Theoretical Chemistry

&
Biology, S-106 91 Stockholm, Sweden

c
Computational Science & Engineering Department, STFC Daresbury
Laboratory, Daresbury Science and Innovation Campus, Warrington,
Cheshire, WA4 4AD, UK

d
PDC Center for High Performance Computing at Royal Institute of
Technology (KTH), Teknikringen 14, 100 44 Stockholm, Sweden

e University
center for Information technology, University of Oslo, P.O.Box 1059
Blindern, N-0316 Oslo, Norway

f Computer
Sciences – Performance Tools, Barcelona Supercomputing Center, Campus
Nord UP C6, C/ Jordi Girona, 1-3, Barcelona, 08034

g
KTH Royal Institute of Technology, Swedish e-Science Center (SeRC),
S-100 44, Stockholm, Sweden

Abstract:
In this paper we present development work carried out on Quantum ESPRESSO [1] software package within PRACE-1IP.
We describe the different activities performed to enable the Quantum ESPRESSO user community to challenge frontiers of
science running extreme computing simulation on European Tier-0 system of current and next generation. There main
sections are described: 1) the improvement of parallelization efficiency on two DTF-based applications: Nuclear Magnetic
Resonance (NMR) and EXact-eXchange (EXX) calculation; 2) introduction of innovative van der Waals interaction at the
ab-initio level; 3) porting of PWscf code to hybrid system equipped with NVIDIA GPU technology.

Download PDF


Authors: Simen
Reinea, Thomas Kj?rgaarda, Trygve
Helgakera, Ole Widar Saastadb, Andrew Sunderlandc
a Centre
for Theoretical and Computational Chemistry (CTCC), Department of
Chemistry, University of Oslo, Oslo, Norway
b University
Center for Information Technology, University of Oslo, Oslo, Norway
c STFC
Daresbury Laboratory, Warrington, United Kingdom

Abstract:
Linear Scaling DALTON (LSDALTON) is a powerful molecular electronic structure program that is the focus of software optimization
projects in PRACE 1IP-WP7.2 and PRACE 1IP-WP7.5. This part of the project focuses on the introduction of parallel diagonalization routines
from the ScaLAPACK library into the latest MPI version of LSDALTON. The parallelization work has involved three main tasks: i)
Redistribution of the matrices assembled for the SCF cycle from a serial / distributed state to the two dimensional block-cyclic data distribution
used for PBLAS and ScaLAPACK; ii) Interfacing of LSDALTON data structures to parallel diagonalization routines in ScaLAPACK; iii)
Performance testing to determine the favoured ScaLAPACK eigensolver methodology.

Download PDF


Authors: Dusan
Stankovic*a, Aleksandar Jovica, Petar
Jovanovica, Dusan Vudragovica, Vladimir
Slavnica
a Institute
of Physics Belgrade, Serbia

Abstract:
In this whitepaper we report work that was done on enabling support for FFTE library for Fast Fourier Transform in Quantum
ESPRESSO, enabling threading for FFTW3 library already supported in Quantum ESPRESSO (only a serial version),
benchmarking and comparing their performance with existing implementations of FFT in Quantum ESPRESSO.

Download PDF


Classical MD applications


Authors: Mariusz Uchronskia, Agnieszka Kwieciena,*, Marcin Gebarowskia, Justyna Kozlowskaa
a WCSS, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland

Abstract: The prototypes evaluated within PRACE-2IP project provide a number of different computing hardware, including general purpose Graphics Processing Units (GPUs) and accelerators like Intel Xeon Phi. In this work we evaluated the performance and energy consumption of two prototypes when used for a real case simulation. Due to the heterogeneity of the prototypes we decided to use the DL_POLY molecular simulation package and its OpenCL port for the tests. The DL_POLY OpenCL port implements one of the methods – the Constraints Shake (CS) component. SHAKE is a two stage algorithm based on the leapfrog Verlet integration scheme. We used four test cases for the evaluation, one from the DL_POLY application test-suite – H2O, and three real cases, provided by a user. We show the performance results and discuss the usage experience with prototypes in a context of ease of use, porting effort required, and energy consumption.

Download paper: PDF


Authors:
D. Grancharov, N. Ilieva, E. Lilkova, L. Litov, S. Markov, P. Petkov, I. Todorov
NCSA, Akad. G. Bonchev 25A, Sofia 1311, Bulgaria
STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK

Abstract:
A library, implementing the AGBNP2 [1, 2] implicit solvent model that was developed within PRACE-2IP [3] is integrated into the DL_POLY_4 [4] molecular dynamics package in order to speed up the time to solution for protein solvation processes. Generally, implicit solvent models lighten the computational loads by reducing the degrees of freedom of the model, removing those of the solvent and thus only concentrating on the protein dynamics that is facilitated by the absence of friction with solvent molecules. Furthermore, periodic boundary conditions are no longer formally required, since long-range electrostatic calculations cannot be applied to systems with variable dielectric permittivity. The AGBNP2 implicit solvation model improves the conformational sampling of the protein dynamics by including the influence of solvent accessible surface and water-protein hydrogen bonding effects as interactive force corrections to the atoms of protein surface. This requires the development of suitable bookkeeping data structures, in accordance with the domain decomposition framework of DL_POLY, with dynamically adjustable inter-connectivity to describe the protein surface. The work also requires the use of advanced b-tree search libraries as part of the AGBNP library, in order to reduce the memory and compute requirements, and the automatic derivation of the van der Waals radii of atoms from the self-interaction potentials.

Download: PDF


Authors:
P. Petkov, I. Todorov, D. Grancharov, N. Ilieva, E. Lilkova, L. Litov, S. Markov

NCSA, Akad. G. Bonchev 25A, Sofia 1311, Bulgaria

STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK

Abstract:
Electrostatic interactions in molecular simulations are usually evaluated by employing the Ewald summation method, which splits the summation into a short-ranged part, treated in real space and a long-range part treated in reciprocal space. For performance purposes in molecular dynamics software the latter is usually handled by SPME or P3M grid based methods both relying on 3D fast Fourier transform (FFT) as their central operation. However, the Ewald summation method is derived for model systems that are subject to 3D periodic boundary conditions (PBC) while there are many models of scientific as well as commercial interest, where geometry implies a 1D or 2D structures. Thus for systems, such as membranes, interfaces, linear protein complexes, thin layers, nanotubes, etc.; employing Ewald summation based techniques is either very disadvantageous computationally or impossible at all. Another approach to evaluate the electrostatics interactions is to solve the Poisson equation of the model-system charge distribution on a 3D special grid. The formulation of the method allows an elegant way to switch on and off the dependence on periodic boundary conditions in a simple manner. Furthermore, 3D FFT kernels are known to scale poorly at large scale due to excessive memory and communication overheads, which makes the Poisson solvers a viable alternative for DL_POLY on the road to exascale. This paper describes the work undertaken to integrate a Poisson solver library, developed in PRACE-2IP [1], within the DL_POLY_4 domain decomposition framework. The library relies on a unique combination of bi-conjugated gradient (BiCG) and conjugated gradient (CG) methods to warrant both independence on initial conditions with a rapid convergence of the solution on the one hand and stabilization of possible fluctuations of the iterative solution on the other. The implementation involves the development of procedures for generating charge density and electrostatic potential grids in real space over all domains in a distributed manner as well as halo exchange routines and functions to calculate the gradient of the potential in order to recover electrostatic forces on point charges.

Download PDF


Authors:
Buket Benek Gursoy, Henrik Nagel

Irish Center for High End Computing, Ireland, bNorwegian University of Science and Technology, Norway

Abstract:
This whitepaper investigates the potential benefit of using the OpenACC directive-based programming tool for
enabling DL_POLY_4 on GPUs. DL_POLY is a well-known general-purpose molecular dynamics simulation
package, which has already been parallelised using MPI-2. DL_POLY_3 was accelerated using the CUDA
framework by the Irish Centre for High-End Computing (ICHEC) in collaboration with Daresbury Laboratory.
In this work, we have been inspired by the existing CUDA port to evaluate the effectiveness of OpenACC in
further enabling DL_POLY_4 on the road to Exascale. We have been particularly concerned with investigating
the benefits of OpenACC in terms of maintainability, programmability and portability issues that are becoming
increasingly challenging as we advance to the Exascale era. The impact of the OpenACC port has been assessed
in the context of a change in the reciprocal vector dimension for the calculation of SPME forces. Moreover, the
interoperability of OpenACC with the existing CUDA port has been analysed.

Download PDF


Authors: Mariusz
Uchronskia,
Marcin Gebarowskia,
Agnieszka Kwieciena,*
a Wroclaw
Centre for Networking and Supercomputing (WCSS), Wyb. Wyspianskiego
27, 50-370 Wroclaw, Poland

Abstract:
SHAKE and RATTLE algorithms are widely used in molecular dynamics simulations and for this reason are relevant for
a broad range of scientific applications. In this work, an existing CPU+GPU implementations of the SHAKE and
RATTLE algorithms from the DL_POLY application are investigated. DL_POLY is a general purpose parallel molecular
dynamics simulation package developed at Daresbury Laboratory by W. Smith and I.T. Todorov. OpenCL code of the
SHAKE algorithm for DL_POLY application is analyzed for further optimization possibilities. Our work with RATTLE
algorithm is focused on porting of the algorithm from Fortran to OpenCL and adjusting it to the GPGPU architecture.

Download PDF


Authors: Michael
Lysaghta, Mariusz Uchronskib, Agnieszka
Kwiecienb, Marcin Gebarowskib, Peter Nasha,
Ivan Girottoa and Ilian T.Todorovc
a Irish
Centre for High End Computing, Tower Building, Trinity Technology and
Enterprise Campus, Grand Canal Quay, Dublin 2, Ireland
b Wroclaw
Centre for Network and Supercomputing, Wybrzeze Wyspianskiego 27,
50-370 Wroclaw, Poland
c STFC
Daresbury Laboratory, Daresbury, Warrington WA4 4AD, United Kingdom

Abstract:
We describe recent development work carried out on the GPU-enabled classical molecular dynamics software package,
DL_POLY. We describe how we have updated the original GPU port of DL_POLY 3 in order to align the ‘CUDA+OpenMP’-based code with the recently released MPI-based DL_POLY 4 package. In the process of updating the code we have also fixed
several bugs which allows us to benchmark the GPU-enabled code on many more GPU-nodes than was previously possible. We
also describe how we have recently initiated the development of an OpenCL-based implementation of DL_POLY and present a
performance analysis of the set of DL_POLY modules that have so far been ported to GPUs using the OpenCL framework.

Download PDF


Authors: Sadaf Alam, Ugo Varetto

Swiss National Supercomputing Centre, Lugano, Switzerland

Abstract: This report introduces hybrid implementation of the Gromacs application, and provides instructions on building and executing on PRACE prototype platforms with Grahpical Processing Units (GPU) and Many Intergrated Cores (MIC) accelerator technologies. GROMACS currently employs message-passing MPI parallelism, multi-threading using OpenMP and contains kernels for non-bonded interactions that are accelerated using the CUDA programming language. As a result, the execution model is multi-faceted where end users can tune the application execution according to the underlying platforms. We present results that have been collected on the PRACE prototype systems as well as on other GPU and MIC accelerated platforms with similar configurations. We also report on the preliminary porting effort that involves a fully portable implementation of GROMACS using OpenCL programming language instead of CUDA, which is only available on NVIDIA GPU devices.

Download PDF


Authors: Fabio
Affinitoa, Andrew Emersona, Leandar
Litovb, Peicho Petkovb, Rossen
Apostolovc,d, Lilit Axnerc, Berk Hessdand Erik Lindahld, Maria Francesca Iozzie
a
CINECA Supercomputing, Applications and Innovation Department,
via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

b
National Center for Supercomputing Applications, Sofia,
Bulgaria

c
PDC Center for High Performance Computing at Royal Institute of
Technology (KTH),

Teknikringen
14, 100 44 Stockholm, Sweden

d
Department of Theoretical Physics, Royal Institute of
Technology (KTH), Stockholm, Sweden

e)
Research Computing Services Group, University of Oslo,Postboks
1059 Blindern 0316 Oslo, Norway

Abstract:
The work aims at evaluating the performance of GROMACS on different platforms and and determine the
optimal set of conditions for given architectures for petascaling molecular dynamics simulations. The activities have been organized into three tasks within PRACE project: (i) Optimization of GROMACS performance on Blue Gene systems; (ii) Parallel scaling of the OpenMP implementation; (iii) Development of a multiple step-size symplectic integrator adapted to the large biomolecule systems. Part of the results reported here has been achieved through the collaboration with ScalaLife project.

Download PDF


Computational Fluid Dynamics (CFD) applications


Authors: A. Cassagnea,*,J-F. Boussugeb, G. Puigtb
a Centre Informatique National de l’Enseignement Superieur, Montpellier, France
b bCentre Europeen de Recherche et de Formation Avancee en Calcul Scientifique, Toulouse, France

Abstract: We enabled hybrid OpenMP/MPI computations for a new generation of CFD code based on a new high-order
method (Spectral Difference method) dedicated to Large Eddy Simulation (LES). The code is written in Fortran
90 with MPI library and OpenMP directives for the parallelization. This white-paper is focused on achieving
good performances with the OpenMP shared memory model on standard environment (bi-socket nodes and
multi-core x86 processors). The goal was to reduce the number of MPI communications by considering MPI
communications between nodes and OpenMP approach for all cores on any node. Three different approaches are
compared: full MPI, full OpenMP and hybrid OpenMP/MPI. We observed that hybrid and full MPI
computations took nearly the same time for a small number of cores.

Download paper: PDF


Authors: Ahmet Durana,b,*, M. Serdar Celebia,c, Senol Piskina,c and Mehmet Tuncela,c
a Istanbul Technical University, National Center for High Performance Computing of Turkey (UHeM), Istanbul 34469, Turkey
b Istanbul Technical University,Department of Mathematics, Istanbul 34469, Turkey
c Istanbul Technical University, Informatics Institute, Istanbul 34469, Turkey

Abstract: We study a bio-medical fluid flow simulation using the incompressible, laminar OpenFOAM solver icoFoam and other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT (Many-Core Distributed) for the large penta-diagonal and hepta-diagonal matrices coming from the simulation of blood flow in arteries with a structured mesh domain. A realistic simulation for the sloshing of blood in the heart or vessels in the whole body is a complex problem and may take a very long time, thousands of hours, for the main tasks such as pre-processing (meshing), decomposition and solving the large linear systems. We generated the structured mesh by using blockMesh as a mesh generator tool. To decompose the generated mesh, we used the decomposePar tool. After the decomposition, we used icoFoam as a flow simulator/solver. For example, the total run time of a simple case is about 1500 hours without preconditioning on one core for one period of the cardiac cycle, measured on the Linux Nehalem Cluster (see [28]) available at the National Center for High Performance Computing (UHeM) (see [5]). Therefore, this important problem deserves careful consideration for usage on multi petascale or exascale systems. Our aim is to test the potential scaling capability of the fluid solver for multi petascale systems. We started from the relatively small instances for the whole simulation and solved large linear systems. We measured the wall clock time of single time steps of the simulation. This version gives important clues for a larger version of the problem. Later, we increase the problem size and the number of time steps to obtain a better picture gradually, in our general strategy. We test the performance of the solver icoFoam at TGCC Curie (a Tier-0 system) at CEA, France (see [21]). We consider three large sparse matrices of sizes 8 million x 8 million, 32 million x 32 million, and 64 million x 64 million. We achieved scaled speed-up for the largest matrices of 64 million x 64 million to run up to 16384 cores. In other words, we find that the scalability improves as the problem size increases for this application. This shows that there is no structural problem in the software up to this scale. This is an important and encouraging result for the problem.
Moreover, we imbedded other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT in addition to the solvers provided by OpenFOAM. Since future exascale systems are expected to have heterogeneous and many-core distributed nodes, we believe that our SuperLU_MCDT software is a good candidate for future systems. SuperLU_MCDT worked up to 16384 cores for the large penta-diagonal matrices for 2D problems and hepta-diagonal matrices for 3D problems, coming from the incompressible blood flow simulation, without any problem.

Download paper: PDF


Authors: Sebastian Szkodaa,c, Zbigniew Kozaa, Mateusz Tykierkob,c
a sebastian.szkoda@ift.uni.wroc.pl Faculty of Physics and Astronomy, University of Wroclaw, Poland
b Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Poland
c Wroclaw Centre for Networking and Supercomputing, Wroclaw University of Technology, Poland

Abstract: The aim of this research it to examine the possibility of parallelizing the Frish-Hasslacher-Pomeau (FHP) model, a cellular automata algorithm for modeling fluid flow, on clusters of modern graphics processing units (GPUs). To this end an Open
Computing Language (OpenCL) implementation for GPUs was written and compared with a previous, semi-automatic one
based on the OpenACC compiler pragmas (S. Szkoda, Z. Koza, and M. Tykierko, Multi-GPGPU Cellular Automata
Simulations using OpenACC, http://www.prace-project.eu/IMG/pdf/wp154.pdf). Both implementations were tested on up to
16 Fermi-class GPUs using the MPICH3 library for the inter-process communication. We found that for both of the multi-
GPU implementations the weak scaling is practically linear for up to 16 devices, which suggests that the FHP model can be
successfully run even on much larger clusters. Secondly, while the pragma-based OpenACC implementation is much easier
to develop and maintain, it gives as good performance as the manually written OpenCL code.

Download paper: PDF


Authors:
Charles Moulinec, David R. Emerson

STFC Daresbury, Laboratory, Warrington, WA4 4AD, UK

Abstract:
Understanding the influence of wave distribution, hydrodynamics and sediment transport is crucial for the placement of off-shore energy generating platforms. The TELEMAC suite is used for this purpose, and the performance of the triple coupling between TOMAWAC for wave propagation, TELEMAC-3D for hydrodynamics and SISYPHE for sediment transport is investigated for several mesh sizes, the largest grid having over 10 million elements. The coupling has been tested up to 3,072 processors and good performance is in general observed.

Download PDF


Authors:
E. Casonia, J. Aguadoa, M. Riveroa, M. Vazquez, G. Houzeaux

Department of Computer Applications for Scientific Engineering. BSC, Nexus I, Gran Capita 2-4, 08034 Barcelona, Spain

Abstract:
This paper describes the work done in Alya multiphysics code, which is an open source software developed at Barcelona
Supercomputing Center BSC-CNS. The main activities of this socio-economic application project concern the development of a coupled
uid-electro-mechanical model to simulate the cardiac computational mechanical problem of the heart.
Several aspects involved in the simulation process, methodology and performance of the code are carefully shown.

Download PDF


Authors: J. Donners, M. Guarrasi, A. Emerson, M. Genseberger

SURFsara, Amsterdam, The Netherlands

CINECA, Bologna, Italy

Deltares, Delft, The Netherlands

Abstract:
The applications Delft3D-FLOW and SWAN are used to simulate respectively water flow and water waves. These two applications have
been coupled with Delft3D-WAVE and the combination of these three executables has been optimized on the Bull cluster “Cartesius”. The
runtime could be decreased by a factor 4 with hardly any additional hardware. Over 80% of the total runtime consists of unnecessary I/O
operations for the coupling, of which 70% could be removed. Both I/O optimizations and replacement with MPI were used. The Delft3DFLOW
application has also been ported to and benchmarked on the IBM Blue Gene/Q system “Fermi”.

Download PDF


Authors:
J. Donners, M. Guarrasi, A. Emerson, M. Genseberger

SURFsara, Amsterdam, The Netherlands

CINECA, Bologna, Italy

Deltares, Delft, The Netherlands

_Abstract :
The applications Delft3D-FLOW and SWAN are used to simulate respectively water flow and water waves. These two applications have
been coupled with Delft3D-WAVE and the combination of these three executables has been optimized on the Bull cluster “Cartesius”. The
runtime could be decreased by a factor 4 with hardly any additional hardware. Over 80% of the total runtime consists of unnecessary I/O
operations for the coupling, of which 70% could be removed. Both I/O optimizations and replacement with MPI were used. The Delft3DFLOW
application has also been ported to and benchmarked on the IBM Blue Gene/Q system “Fermi”.

Download PDF


Authors:
Thomas Ponweiser, Peter Stadelmeyer, Tomas Karasek

Johannes Kepler University Linz, RISC Software GmbH, Austria

VSB-Technical University of Ostrava, IT4Innovations, Czech Republic

_Abstract:
Multi-physics, high-fidelity simulations become an increasingly important part of industrial design processes. Simulations of fluid-structure interactions (FSI) are of great practical significance – especially within the aeronautics industry – and because of their complexity they require huge computational resources. On the basis of OpenFOAM a partitioned, strongly coupled solver for transient FSI simulations with independent meshes for the fluid and solid domains has been implemented. Using two different kinds of model sets, a geometrically simple 3D beam with quadratic cross section and a geometrically complex aircraft configuration, runtime and scalability characteristics are investigated. By modifying the implementation of OpenFOAM’s inter-processor communication the scalability limit could be increased by one order of magnitude (from below 512 to above 4096 processes) for a model with 61 million cells.

Download PDF


Authors:
Seren Soner, Can Ozturan

Computer Engineering Department, Bogazici University, Istanbul, Turkey

Abstract:
OpenFOAM is an open source computational
uid dynamics (CFD) package with a large user base from many areas of
engineering and science. This whitepaper documents an enablement tool called PMSH that was developed to generate
multi-billion element unstructured tetrahedral meshes for OpenFOAM. PMSH is developed as a wrapper code around
the popular open source sequential Netgen mesh generator. Parallelization of the mesh generation process is carried out
in five main stages: (i) generation of a coarse volume mesh (ii) partitioning of the coarse mesh to get sub-meshes, each
of which is processed by a processor (iii) extraction and refinement of coarse surface sub-meshes to produce fine surface
sub-meshes (iv) re-meshing of each fine surface sub-mesh to get the final fine volume mesh (v) matching of partition
boundary vertices followed by global vertex numbering. An integer based barycentric coordinate method is developed
for matching distributed partition boundary vertices. This method does not have precision related problems of
oating point coordinate based vertex matching. Test results obtained on an SGI Altix ICE X system with 8192 cores and 14 TB
of total memory confirm that our approach does indeed enable us to generate multi-billion element meshes in a scalable
way. PMSH tool is available at https://code.google.com/p/pmsh/.

Download PDF


Authors:
Tomas Karasek, David Horak, Vaclav Hapla, Alexandros Markopoulos, Lubomir Riha, Vit Vondrak, Tomas Brzobohaty

IT4Innovations, VSB-Technical University of Ostrava (VSB)

Abstract:
Solution of multiscale and/or multiphysics problems is one of the domains which can most benefit from use of
supercomputers. Those problems are often very complex and their accurate description and numerical solution requires use of
several different solvers. For example problems of Fluid Structure Interaction (FSI) are usually solved using two different
discretization schemes, Finite volumes to solve Computational Fluid Dynamics (CFD) part and Finite elements to solve the
structural part of the problem. This paper summarizes different libraries and solvers used by the PRACE community that are
able to deal with multiscale and/or multiphysic problems such as Elmer, Code_Saturne and Code_Aster, and OpenFOAM.
The main bottlenecks in performance and scalability on the side of Computational Structure Mechanics (CSM) codes are
identified and their possible extension to fulfill needs of future exascale problems are shown. The numerical results of the
strong and weak scalabilities of CSM solver implemented in our FLLOP library are presented.

Download PDF


Authors :
Sebastian Szkoda, Zbigniew Koza, Mateusz Tykierko

Faculty of Physics and Astronomy, University of Wroclaw, Poland

Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Poland

Wroclaw Centre for Networking and Supercomputing, Wroclaw University of Technology, Poland

Abstract:
The Frisch-Hasslacher-Pomeau (FHP) model is a lattice gas cellular automaton designed to simulate fluid flows using the exact, purely Boolean arithmetic, without any round-off error. Here we investigate the problem of its efficient porting to clusters of Fermi-class graphic processing units. To this end two multi-GPU implementations were developed and examined: one using the NVIDIA CUDA and GPU Direct technologies explicitly and the other one using the CUDA implicitly through the OpenACC compiler directives and the MPICH2 MPI interface for communication. For a single Tesla C2090 GPU device both implementations yield up to a 7-fold acceleration over an algorithmically comparable, highly optimized multi-threaded implementation running on a server-class CPU. The weak scaling for the explicit multi-GPU CUDA implementation is almost linear for up to 8 devices (the maximum number of the devices used in the tests), which suggests that the FHP model can be successfully run on much larger clusters and is a prospective candidate for exascale computational fluid dynamics. The scaling for the OpenACC approach turns out less favorable due to compiler-related technical issues. We found that the multi-GPU approach can bring considerable benefits for this class of problems, and the GPU programming can be significantly simplified through the use of the OpenACC standard, without a significant loss of performance, providing that the compilers supporting OpenACC improve their handling of the communication between GPUs.

Download PDF


Authors:

Abstract:
Code_Saturne is a popular open-source computational fluid dynamics package. We have carried out a study of
applying MPI 2.0 / MPI 3.0 one sided communication routines to Code_Saturne and its impact on improving the
scalability of the code for future peta/exa-scaling. We have developed modified versions of the halo exchange
routine in Code_Saturne. Our modifications showed that MPI 2.0 one sided calls give some speed improvement
and less memory overhead compared to the original version. The MPI 3.0 version on the other hand is unstable
and could not run.

Application Code: Code_Saturne

Download PDF


Authors
Jan Christian Meyer

High Performance Computing Section, IT Dept., Norwegian University of Science and Technology

Abstract:
The LULESH proxy application models the behavior of the ALE3D multi-physics code with an explicit shock
hydrodynamics problem, and is made in order to evaluate interactions between programming models and
architectures, using a representative code significantly less complex than the application it models. As identified
in the PRACE deliverable D7.2.1 [1], the OmpSs programming model specifically targets programming at the
exascale, and this whitepaper investigates the effectiveness of its support for development on hybrid
architectures.

Download PDF


Authors :
Maciej
Cytowskia, Matteo Bernardinib
a Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw
b Universita di Roma La Sapienza, Dipartimento di Ingegneria Meccanica e Aerospaziale

Abstract:
The project aimed at extending the capabilities of an existing
ow solver for Direct Numerical Simulation of turbulent
flows. Starting from the scalability analysis of the MPI baseline code, the main goal of the project was to devise a
MPI/OpenMP hybridization capable of exploiting the full potential of the current architectures provided in the PRACE
framework. The project was very successful, the new hybrid version of the code outperformed the pure MPI version on
IBM Blue Gene/Q architecture (FERMI).

Download PDF


Authors: B.
Scotta, V. Weinbergb†, O. Hoenena,
A. Karmakarb, L. Fazendeiroc
a Max-Planck-Institut
fur Plasmaphysik IPP, 85748 Garching b. Munchen, Germany

b Leibniz
Rechenzentrum der Bayerischen Akademie der Wissenschaften, 85748
Garching b. Munchen, Germany

c Chalmers
University of Technology, 412 96 Gothenburg, Sweden

Abstract:
We discuss a detailed weak scaling analysis of GEM, a 3D MPI-parallelised gyrofluid code used in theoretical
plasma physics at the Max Planck Institute of Plasma Physics, IPP at Garching near Munich, Germany. Within a
PRACE Preparatory Access Project various versions of the code have been analysed on the HPC systems
SuperMUC at LRZ and JUQUEEN at Julich Supercomputing Centre (JSC) to improve the parallel scalability of
the application. The diagnostic tool Scalasca has been used to filter out suboptimal routines. The code uses the
electromagnetic gyrofluid model which is a superset of magnetohydrodynamic and drift-Alfven microturbulance
and also includes several relevant kinetic processes. GEM can be used with different geometries depending on
the targeted use case, and has been proven to show good scalability when the computational domain is
distributed amongst two dimensions. Such a distribution allows grids with sufficient size to describe
conventional tokamak devices. In order to enable simulation of very large tokamaks (such as the next generation
nuclear fusion device ITER in Cadarache, France) the third dimension has been parallelised and weak scaling
has been achieved for significantly larger grids.

Download PDF


Authors:
Siwei Donga, Vegard Eideb, Jeroen
Engelbertsc
a Universidad
Politecnica Madrid, Spain
b Norwegian
University of Science and Technology, Norway
c SURFsara,
Amsterdam, Netherlands

Abstract:
The SHEAR code is developed at the School of Aeronautics, Universidad Politecnica de Madrid, for the
simulation of turbulent structures of shear flows. The code has been well tested on smaller clusters. This white
paper desbribes the work done to scale and optimise SHEAR for large systems like the Blue Gene/Q system
JUQUEEN in Julich.

Download PDF


Authors:
Riccardo Brogliaa, Stefano Zaghia, Roberto
Muscaria, Francesco Salvadoreb, Soon-Heum Koc
a CNR-INSEAN

- National Marine Technology Research Institute, Via di Vallerano
139, Rome 00128, Italy
b CINECA,
Via dei Tizii 6, Rome 00185, Italy
cNSC

- National Supercomputing Centre, Linkoping University, 58183
Linkoping, Sweden

Abstract:
In this paper, the work that has been performed to extend the capabilities of the -Xnavis software, a well tested and
validated parallel
flow solver developed by the research group of CNR-INSEAN, is reported. The solver is based on the
fi-nite volume discretization of the unsteady incompressible Navier-Stokes equations; main features include a level set
approach to handle free surface
flows and a dynamical overlapping grids approach, which allows to deal with bodies in
relative motion. The baseline code features a hybrid MPI/OpenMP parallelization, proven to scale when running on
order of hundreds of cores (i.e. Tier-1 platforms). However, some issues arise when trying to use this code with the
current massively parallel HPC facilities provided in the Tier-0 PRACE context. First of all, it is mandatory to assess
an effi-cient speed-up up to thousands of processors. Other important aspects are related to the pre- and post- processing
phases which need to be optimized and, possibly, parallelized. The last one concerns the implementation of MPI-I/O
procedures in order to try to accelerate data access and to reduce the number of generated -files.

Download PDF


Authors:
Stoyan Markov, Peicho Petkov, Damyan Grancharov and Georgi Georgiev

National Centre for Supercomputing Application, Akad. G. Bonchev Str., 25A, 1113 Sofia, Bulgaria

Abstract:
We investigated the possible way for treatment of electrostatic interactions by
solving numerically Poisson’s equation using Conjugate Gradient method and
Stabilized BiConjugate Gradient method. The aim of the research was to test the
execution time of prototype programs running on BLueGene/P and CPU/GPU
system. The results show that the tested methods are applicable for electrostatics
treatment in molecular-dynamics simulations.

Download PDF


Authors:
A.Charalampidoua,b, P.Daogloua,b,
J.Hertzerc, E.V.Votyakovd

a Greek
Research and Technology Network, Athens, Greece
b Scientific
Computing Center, Aristotle University of Thessaloniki Thessaloniki
54124, Greece

c HLRS,
Nobelstr. 19, D-70569 Stuttgart, Germany

d The
Cyprus Institute, 20 Konstantinou Kavafi Street 2121 Aglantzia,
Nicosia, Cyprus

Abstract:
The project objective has been to develop and justify the OpenFOAM model for the simulation of a TES tank. In
the course of the project we have obtained scalability results, which are presented in this paper. Scalability tests
have been performed on HLRS Hermit HPC system using various combinations of decomposition methods, cell
capacities and number of physical cpu cores.

Download PDF


Authors:
O.
Akinci1,
M. Sahin2,
B.O. Kanat3
1 National
High Performance Computing Center of Turkey (UHeM), Istanbul
Technical University (ITU), Ayazaga Campus, UHeM Office, Istanbul
34469, Turkey

2 ITU,
Ayazaga Campus, Faculty of Aeronautics and Astronautics, Istanbul
34469, Turkey

3 ITU,
Ayazaga Campus, Computer Engineering Department, Istanbul 34469,
Turkey

Abstract:
ViscoSolve is a stable unstructured finite volume method for parallel large-scale viscoelastic fluid flow
calculations. The code incorporates the open-source libraries PETSc and MPI for parallel computation. In this
whitepaper we report work that was done to investigate scaling the performance of the ViscoSolve code.

Download PDF


Author:
Evghenii Gaburov
SURFsara, Science Park 140, 1098XG Amsterdam, the Netherlands

Abstract:
This white-paper reports on an enabling e-ort that involves porting a legacy 2D fluid dynamics Fortran code to NVIDIA
GPUs. Given the complexity of both code and underlying (custom) numerical method, the natural choice was to use
NVIDIA CUDA C to achieve the best possible performance. We achieved over 4.5x speed-up on a single K20 compared
to the original code executed on a dual-socket E5-2687W.1

Download PDF


Authors:
Paride Dagnaa, Joerg Hertzerb
a CINECA-SCAI Department, Via
R. Sanzio 4, Segrate (MI) 20090, Italy

b HLRS, Nobelstr. 19, D-70569
Stuttgart, German

Abstract:
The performance results from the hybridization of the OpenFOAM linear system solver, tested on the CINECA Fermi and the
HLRS Hermit supercomputers are presented in this paper. A comparison between the original and the hybrid OpenFOAM
versions on four physical problems, based on four different solvers, will be shown and a detailed analysis of the behavior of the
main computing and communication phases, which are responsible for scalability during the linear system solution, will be given.

Download PDF


Authors:
Stephane Glocknerb, N.
Audiffrena, H. Ouvrardb,a
a CINES, Centre Informatique
National de l’Enseignement Superieur, Montpellier, France
bI2M (Institut de Mecanique
et d’Ingenierie de Bordeaux) *Corresponding
author:audiffren@cines.fr

Abstract:
It has already been shown that the numerical tool Thetis based on the resolution of the Navier-Stokes
equation for multiphase flows gives accurate results for coastal applications, e.g. wave breaking, tidal
bore propagation, tsunamis generation, swash flows, etc. [1,2,3,4,5,6]. In this study our goal is to
improve the time and memory consumption in the set-up phase of the simulation (partitioning and
building the computational mesh), examining the eventual benefits of an hybrid approach of the Hypre
library, and doing fine tuning in implementation of the code on Curie Tier-0. We also implement
parallel POSIX VTK and HDF5 I/O. Thetis is now able to run efficiently up to 1 billion mesh nodes at
16384 cores on CURIE in a production context.

Download PDF


Authors:
Thibaut Delozea, Yannick Hoaraub,
Marianna Brazaa
a IMFT, 2 allee du
Professeur Camille Soula, Toulouse 31400, France
b IMFS, 2 rue Boussingault,
Strasbourg 67000, France

Abstract:
The present work focuses on code development of an efficient and robust CFD solver in aeronautics: the NSMB
code, Navier-Stokes MultiBlock. A specific advanced version of the code containing Turbulence Modeling
Approached developed by IMFT will be the object of the present optimization. The present project aims at
improving the performances of the MPI version of the code in order to use efficiently the fat node part of Curie
(TGCC, France). Different load balancing strategies have been studied in order to achieve an optimal distribution of
work using up to 4096 processes.

Download PDF


Authors:
A. Schnurpfeila, A. Schillera,
F. Janetzkoa, St. Meiera, G. Sutmanna
a Forschungszentrum Juelich –
JSC, Wilhelm-Johnen-Strasse, 52425 Juelich, Germany

Abstract:
MP2C is a molecular dynamic code that focusses on mesoscopic particles dynamics simulations on massive parallel
computers. The program is a development of JSC together with the Institute for Theoretical Biophysics and Soft
Matter (IFF-2) of the Institute for Solid State Research (IFF) at Juelich. Within the frame of the PRACE Internal
Call further optimization of the code as well as targeting possible bottlenecks were addressed. The project was mainly
performed on JUGENE, a BG/P based architecture located at the Forschungszentrum Juelich/Germany. Besides that,
some scaling test were performed on JUROPA, an Intel-Nehalem based general purpose supercomputer, also located at
the Forschungszentrum Juelich. In this report the e-orts made by working on the program package are presented.

Download PDF


Author:
Sami Saarinen
CSC –IT Center for Science Ltd,
P.O.Box 405, FI-02101 Espoo, Finland

Abstract:
The Sun exhibits magnetic activity at various spatial and temporal scales. The best known example is the 11-year
sunspot cycle which is related to the 22-year periodicity of the Sun’s magnetic field. The sunspots, and thus solar
magnetic activity, have some robust systematic features: in the beginning of the cycle sunspots appear at latitudes
around 40 degrees. As the cycle progresses these belts of activity move towards the equator. The sign of the
magnetic field changes from one cycle to the next and the large-scale field remains approximately anti-symmetric
with respect to the equator. This cycle has been studied using direct observations for four centuries. Furthermore,
proxy data from tree rings and Greenland ice cores has revealed that the cycle has persisted through millennia. The
period and amplitude of activity change from cycle to cycle and there are even periods of several decades in the
modern era when the activity has been very low. Since it is unlikely that the primordial field of the hydrogen gas that
formed the Sun billions of years ago could have survived to the present day, the solar magnetic field is considered to
be continuously replenished by some dynamo mechanism.

Download PDF


Authors:
Kevin Stratforda, Ignacio
Pagonabarragab
aEPCC, The King’s Buildings,
The University of Edinburgh, EH9 3JZ, United Kingdom
bDepartament de Fisica
Fonamental, Universitat ode Barcelona, Carrer Mart i Franques, 08028
Barcelona, Spain

Abstract:
This project looked at the performance of simulations of bacterial swimmers using a lattice Boltzmann code for complex fluids.

Download PDF


Authors: A. Turka,?,
C. Moulinecb, A.G. Sunderlandb, C. Aykanata
aBilkent University, Comp.
Eng. Dept., Ankara, Turkey
bSTFC Daresbury Laboratory,
Warrington WA4 4AD, UK

Abstract:
Code Saturne is an open-source, multi-purpose Computational Fluid Dynamics (CFD) software, which has been developed by Electricite de France Recherche et Development (EDF-R&D). Code Saturne has been selected as an application
of interest for CFD community in The Partnership for Advanced Computing in Europe First Implementation Phase
Project (PRACE-1IP) and various e?orts towards improving the scalability of Code Saturne have been conducted. In this
whitepaper, the e?orts towards understanding and improving the preprocessing subsystem of Code Saturne are described,
and to this end, the performance of di?erent mesh partitioning software packages that can be used are investigated.

Download PDF


Authors: aC.
Moulinec, A.G. Sunderland1
bP. Kabelikova, A. Ronovsky,
V. Vondrak2
cA. Turk, C. Aykanat3
dC. Theodosiou4
aComputational Science and
Engineering Department, STFC Daresbury Laboratory, United Kingdom
bDepartment of Applied
Mathematics, VSB-Technical University of Ostrava, 17. listopadu 15,
708 33 Ostrava, Czech Republic
cDepartment of Computer
Engineering, Bilkent University, 06800 Bilkent Ankara, Turkey
dScientific Computational
Center, Aristotle University, 54 124 Thessaloniki, Greece

Abstract:
Some of the optimisations required to prepare Code_Saturne for petascale simulations are presented in this white paper, along
with the performance of the code. A mesh multiplication package based on parallel global refinement of hexahedral meshes has
been developed for Code_Saturne to handle meshes containing billions†
of cells and to efficiently exploit PRACE Tier-0 system
capabilities. Several parallel partitioning tools have been tested and Code_Saturne performance has been assessed up to a 3.2
billion cell mesh. The parallel code is highly scalable and demonstrates good parallel speed-up at very high core counts, e.g. from
32,768 to 65,536 cores.

Download PDF


Author: Massimiliano Culpo

CINECA, Via Magnanelli 6/3, Casalecchio di Reno (BO) I-40033, Italy

Abstract:
The scaling behavior of different OpenFOAM versions is analyzed on two benchmark
problems. Results show that the applications scale reasonably well up to a thousand tasks.
An in-depth profiling identifies the calls to the MPI_Allreduce function in the linear algebra
core libraries as the main communication bottleneck. A sub-optimal performance on-core is
due to the sparse matrices storage format that does not employ any cache-blocking
mechanism at present. Possible strategies to overcome these limitations are proposed and
analyzed, and preliminary results on prototype implementations are presented.

Download PDF


Authors: Michael
Moylesa, Peter Nash, Ivan Girotto

Irish Centre for High End Computing, Grand Canal Quay, Dublin 2

Abstract:
The following report outlines work undertaken for PRACE-2IP. The report will outline the computational methods used to
examine petascaling of OpenFOAM on the French Tier-0 system CURIE. The case study used has been provided by the
National University of Ireland, Galway (NUIG). The profiling techniques utilised to uncover bottlenecks, specifically in
communication and file I/O within the code, will provide an insight into the behaviour of OpenFOAM and highlight practices
that will be of benefit to the user community.

Download PDF


Author:
Murat Manguoglu

Middle East Technical University, Department of Computer Engineering, 06800 Ankara, Turkey

Abstract:
Solution of large sparse linear systems is frequently the most time consuming operation in computational fluid dynamics simulations.
Improving the scalability of this operation is likely to have significant impact on the overall scalability of application. In this white paper we
show scalability results up to a thousand cores for a new algorithm devised to solve large sparse linear systems. We have also compared pure
MPI vs. MPI-OpenMP hybrid implementation of the same algorithm.

Download PDF




Earth Science applications


Authors: Mads R. B. Kristensena and Roman Nutermana

a Niels Bohr Institute, Universityt of Copenhagen, Denmark

Abstract:
In this paper, we explore the challenges of running the current version (v1.2.2) of
Community Earth System Model (CESM) Simulation on Juqueen. We present a set of
workarounds for the Juqueen supercomputer that enables massively parallel executions and
demonstrate scalability up to 3024 CPU-cores.

Download PDF

Download WP200_Appendix_topology patch


Authors:
P. Nolan, A. McKinstry

Irish Centre for High-End Computing (ICHEC), Ireland

Abstract:
Climate change due to increasing anthropogenic greenhouse gases and land surface change is currently one of
the most relevant environmental concerns. It threatens ecosystems and human societies. However, its impact on
the economy and our living standards depends largely on our ability to anticipate its effects and take appropriate
action. Earth System Models (ESMs), such as EC-Earth, can be used to provide society with information on the
future climate. EC-Earth3 generates reliable predictions and projections of global climate change, which are a
prerequisite to support the development of national adaptation and mitigation strategies.
This project investigates methods to enhance the parallel capabilities of EC-Earth3 by offloading bottleneck
routines to GPUs and Intel Xeon Phi coprocessors. To gain a full understanding of climate change at a regional
scale will require EC-Earth3 to be run at a much higher spatial resolution (T3999 5km) than is currently
feasible. It is envisaged that the work outlined in this project will provide climate scientists with valuable data
for simulations planned for future exascale systems.

Download PDF


Author:
Paride Dagna

CINECA-SCAI Department, Via R. Sanzio 4, Segrate (MI) 20090, Italy

Abstract:
SPEED (Spectral Element in Elastodynamics with Discontinuous Galerkin) is an open source code, jointly developedby the
Department of Structural Engineering and of Mathematics at Politecnico di Milano, for seismic hazard analyses.

In this paper, performanceresults, which come from the optimization and hybridization work done on SPEED, tested on the
CINECA Fermi BG/Q supercomputer will be shown. A comparison between the pure MPI and the hybrid SPEED versions on
three earthquake scenarios, with increasing complexity, will be presented and a detailed analysisof the advantages that come from
hybridization and optimization of the computing and I/O phases will be given.

Download PDF


Authors: Jean-MArc
Molinesa, Nicole Audiffrenba, Albanne
Lecointrea
a CNRS, LEGI, Grenoble,
France
bCINES, Centre Informatique
National de l’Enseignement Superieur, 34000 Montpellier FRANCE

Abstract:
This project aims at preparing the high-resolution ocean/sea-ice realistic modeling environment implemented by the European DRAKKAR
consortium for use on PRACE Tier-0 computers. DRAKKAR participating Teams jointly develop and share this modeling environment to
address a wide range of scientific questions investigating multiple-scale interactions in the world ocean. Each team relies on the achievements
of DRAKKAR to have available for its research the most efficient and up-to-date ocean models and related modeling tools. Two original
realistic model configurations, based on the NEMO modeling framework, are considered in this project. They are designed to make possible
the study of the role of multiple-scale interactions in the ocean variability, in the ocean carbon cycle and in marine ecosystems changes.

Download PDF


Authors:
Charles Moulinec, Yoann Audouin, Andrew Sunderland

STFC Daresbury Laboratory, UK

Abstract:
This report details optimization undertaken on the Computational Fluid Dynamic (CFD) software suite TELEMAC, a
modelling system for free surface waters with over 200 installations worldwide. The main focus of the work has involved
eliminating memory bottlenecks occurring at the pre-processing stage that have historically limited the size of simulations
processed. This has been achieved by localizing global arrays in the pre-processing tool, known as PARTEL. Parallelism in
the partitioning stage has also been improved by replacing the serial partitioning tool with a new parallel implementation.
These optimizations have enabled massively parallel runs of TELEMAC-2D, a Shallow Water Equations based code,
involving over 200 million elements to be undertaken on Tier-0 systems. These runs simulate extreme flooding events
on very -ne meshes (locally less than one meter). Simulations at this scale are crucial for predicting and understanding flooding events occurring, e.g. in the region of the Rhine river

Download PDF


Authors: Thomas
Zwinger, Mika Malinen, Juha Ruokolainen, Peter Raback

CSC – IT Center for Science, P.O. Box 405, FI-02101 Espoo,Finland

Abstract:
By gaining and losing mass, glaciers and ice-sheets play a key role in sea level evolution. This is obvious when
considering the past 20000 years, during which the collapse of the large northern hemisphere ice-sheets after the
Last Glacial Maximum contributed to a 120m rise in sea level. This is particularly worrying when the future is
considered. Indeed, recent observations clearly indicate that important changes in the velocity structure of both
the Antarctic and Greenland ice-sheets are occurring, suggesting that large and irreversible changes may already
have been initiated. This was clearly emphasised in the last report published by the Intergovernmental Panel on
Climate Change (IPCC) [7]. The IPCC also asserted that current knowledge of key processes causing the
observed accelerations was poor, and concluded that reliable projections obtained with process-based models for
sea-level rise (SLR) are currently unavailable. Most of these uncertain key processes have in common that their
physical/numerical characteristics, such as shallow ice approximation (SIA), are not accordingly reflected or
even completely missing in the established simplified models that have been in use since decades. Whereas those
simplified models run on common PC systems, the new approaches require higher resolution and larger
computational models, which demand High Performance Computing (HPC) methods to be applied. In other
words, numerical glaciology, like climatology and oceanography decades ago, needs to be updated for HPC with
scalable codes, in order to deliver the prognostic simulations demanded by the IPCC. The DECI project
ElmerIce, and enabling work associated with it, improved simulations of key processes that lead to continental
ice loss. The project also developed new data assimilation methods. This was intended to decrease the degree of
uncertainty affecting future SLR scenarios and consequently contribute to on-going international debates
surrounding coastal adaptation and sea-defence planning. These results directly feed into existing projects, such
as the European FP7 project ice2sea [9], which has the objective of improving projections of the contribution of
continental ice to future sea-level rise and the French ANR ADAGe project [10], coordinated by O. Gagliardini,
which has the objective to develop data assimilation methods dedicated to ice flow studies. Results from these
projects will directly impact the upcoming IPCC assessment report (AR5).

Download PDF


Authors: Sebastian von Alfthana, Dusan Stankovicb, Vladimir
Slavnicb
aFinnish Metoerological
Institute, Helsinki, Finland
bInstitute of Physics
Belgrade, Serbia

Abstract:
In this whitepaper we report work that was done to investigate and improve the performance of a hyrid-Vlasov
code for simulating Earth’s Magnetosphere. We improved the performance of the code through a hybrid OpenMP-MPI mode.

Download PDF


Authors:
Dalibor Lukas, Jan Zapletal

Department of Applied Mathematics, IT4Innovations, VSB-Technical University of Ostrava, Czech Republic

Abstract:
In this paper, a new parallel acoustic simulation package has been created, using the boundary element method (BEM). The
package is built on top of SPECFEM3D, which is parallel software for doing seismic simulations, e.g. earthquake simulations
of the globe. The acoustical simulation relies on a Fourier transform of the seismic elastodynamic data, resulting from
SPECFEM3D_GLOBE, which are then postprocessed by a sequence of solutions to Helmholtz equations, in the exterior of
the globe. For the acoustic simulations BEM has been employed, which reduces computation to the sphere; however, its
naive implementation suffers from quadratic time and memory complexity, with respect to the number of unknowns. To
overcome the latter, the method was accelerated by using hierarchical matrices and adaptive cross approximation techniques,
which is referred to as Fast BEM. First, a hierarchical clustering of the globe surface triangulation is performed. The arising
cluster pairs decompose the fully populated BEM matrices into a hierarchy of blocks, which are classified as far-field or near-field. While the near-field blocks are kept as full matrices, the far-field blocks are approximated by low-rank matrices. This
reduces the quadratic complexity of the serial code to almost linear complexity, i.e. O(n*log(n)), where n denotes the number
of triangles. Furthermore, a parallel implementation was done, so that the blocks are assigned to concurrent MPI processes
with an optimal load balance. The processes share the triangulation data. The parallel code reduces the computational
complexity to O(n*log(n)/N), where N denotes the number of processes. This is a novel implementation of BEM that
overcomes computational times of traditional volume discretization methods, e.g. finite elements, by an order of magnitude.

Download PDF


Authors:John Donnersa,
Chandan Basub, Alastair McKinstryc, Muhammad
Asifd, Andrew Portere, Eric Maisonnavef,
Sophie Valckef, Uwe Fladrichg

aSARA, Amsterdam, The
Netherlands

bNSC, Linkoping, Sweden

cICHEC, Galway, Ireland

dIC3, Barcelona, Spain

eSTFC, Daresbury, United
Kingdom

fCERFACS, Toulouse, France

gSMHI, Norrkoping, Sweden

Abstract:
The EC-EARTH model is a global, coupled climate model that consists of the separate components IFS for the
atmosphere and NEMO for the ocean that are coupled using the OASIS coupler. EC-EARTH was ported and run on
the Curie system. Different configurations, using resolutions from T159 (approx. 128km) to T799 (approx 25km),
were available for benchmarking. Scalasca was used to analyze the performance of the model in detail. Although it
was expected that either the I/O or the coupling would be a bottleneck for scaling of the highest resolution model,
that is clearly not, yet, the case. The IFS model uses two MPI_Alltoallv calls per timestep that dominate the loss of
scaling at 1024 cores. Using the OpenMP functionality in IFS could potentially increase scalability considerably, but
this does not yet work on Curie. Work is ongoing to make MPI_Alltoallv more efficient on Curie. It is expected that
I/O and/or coupling does become a bottleneck when IFS can be scaled further than 2000 cores. Therefore, the
OASIS team increased the scalability of OASIS dramatically with the implementation of a radically different
approach, showing less than 1% overhead at 2000 cores. The scalability of NEMO was improved during an earlier
PRACE project. The I/O subsystem in IFS is described and is probably not easily accelerated unless it is being
rewritten and uses a different file format.

Download PDF


Authors:
Dalibor Lukas, Petr Kovar, Tereza Kovarova, Jan Zapletal

Department of Applied Mathematics, IT4Innovations, VSB-Technical University of Ostrava, Czech Republic

Abstract:
In this paper, a new parallel acoustic simulation package has been created, using the boundary element method (BEM). The
package is built on top of SPECFEM3D, which is parallel software for doing seismic simulations, e.g. earthquake simulations of
the globe. The acoustical simulation relies on a Fourier transform of the seismic elastodynamic data, resulting from
SPECFEM3D_GLOBE, which are then postprocessed by a sequence of solutions to Helmholtz equations, in the exterior of the
globe. For the acoustic simulations BEM has been employed, which reduces computation to the sphere; however, its naive
implementation suffers from quadratic time and memory complexity, with respect to the number of unknowns. To overcome the
latter, the method was accelerated by using hierarchical matrices and adaptive cross approximation techniques, which is referred
to as Fast BEM. First, a hierarchical clustering of the globe surface triangulation is performed. The arising cluster pairs
decompose the fully populated BEM matrices into a hierarchy of blocks, which are classified as far-field or near-field. While the
near-field blocks are kept as full matrices, the far-field blocks are approximated by low-rank matrices. This reduces the quadratic
complexity of the serial code to almost linear complexity, i.e. O(n log(n)), where n denotes the number of triangles. Furthermore,
a parallel implementation was done, so that the blocks are assigned to concurrent MPI processes with an optimal load balance.
The novelty of our approach is based on a nontrivial and theoretically supported memory distribution of the hierarchical matrices
and right-hand side vectors so that the overall memory consumption leads to O(n log(n)/N+n/sqrt(N)), which is the theoretical
limit at the same time.

Download PDF


Authors: Marcin
Zielinski1a, John Donnersa
aSARA B.V., Science Park
140, 1098XG Amsterdam, The Netherlands

Abstract:
The entire project focused on an evaluation of the code for a possible introduction of OpenMP and its actual implementation and
extensive tests. Major time consuming parts of the code were detected and thoroughly analyzed. The most time consuming part
was successfully parallelized using OpenMP. Very extensive test simulations using the hybrid code allowed for many further
improvements and validations of its results. Possible improvements have also been discussed with the developers to be
implemented in the near future.

Download PDF


Author:
Chandan Basu

National Supercomputer Center, Linkoping University, Linkoping 581 83, Sweden

Abstract:
The high resolution version of EC-EARTH is ported on Curie. The scalability of the code is tested up to 3500 CPU cores. An
example EC-EARTH run is profiled using the TAU tool.

Download PDF


Astrophysics applications


Authors: T. Ponweiser, M.E. Innocenti, G. Lapent A. Beck, S. Markidis

Research Institute for Symbolic Computation (RISC), Johannes Kepler University, Altenberger Stra?e 69, 4040 Linz, Austria

Center for mathematical Plasma Astrophysics, Department of Mathematics, K.U. Leuven, Celestijnenlaan 200B, B-3001 Leuven, Belgium

Laboratoire Leprince-Ringuet, Ecole Polytechnique, CNRS-IN2P3, France

KTH Royal Institute of Technology, Stockholm, Sweden

Abstract:
Parsek2D-MLMD is a semi-implicit Multi Level Multi Domain Particle-in-Cell (PIC) code for the simulation of astrophysical and space plasmas. In this Whitepaper, we report on improvements on Parsek2D-MLMD carried out in the course of the PRACE preparatory access project 2010PA1802. Through algorithmic enhancements – in particular the implementation of smoothing and temporal sub-stepping – as well as through performance tuning using HPCToolkit, the efficiency of the code has been improved significantly. For representative benchmark cases, we consistently achieved a total speedup of a factor 10 and higher.

Download PDF


Authors:
J. Donners, J. Bedorf

SURFsara, Amsterdam, The Netherlands

Leiden,UniversityLeiden, The Netherlands

Abstract:
This white paper describes a project to modify the I/O of the Bonsai astrophysics code to scale up to more than 10,000 nodes on the Titan
system. A remaining bottleneck is the I/O: the creation of separate files for each MPI task overloads the Lustre metadata server. The use of
the SIONlib library on the Lustre filesystem of different PRACE systems is investigated. Several issues had to be resolved, both with the
SIONlib library and the Liblustre API, before a satisfactory I/O performance could be achieved. SIONlib reaches about half the performance
of a naive approach where each MPI task writes a separate file for a few thousand MPI tasks. However, when more MPI tasks are used, the
SIONlib library shows the same performance as the naive approach. The SIONlib library exhibits both the performance and the scalability
that is needed to be successful at exascale.

Download PDF


Authors:
Kacper Kowalik, Artur Gawryszczak, Marcin Lawenda, Michal Hanasza, Norbert Meyer

Nicolaus Copernicus University, Jurija Gagarina 11, 87-100 Torun, Poland

Copernicus Astronomical Center, Polish Academy of Sciences, Bartycka 18, 00-716 Warszawa, Poland

Poznan Supercomputing and Networking Centre, Dabrowskiego 79a, 60-529 Poznan, Poland

Abstract:
PIERNIK is an MHD code created in Centre for Astronomy, Nicolaus Copernicus University in Torun, Poland. The
current version of the code uses a simple, conservative numerical scheme, which is known as Relaxing TVD
scheme (RTVD). The aim of this project was to increase the performance of the PIERNIK code in a case where
the computational domain is decomposed into large number of smaller grids and each concurrent processes is
assigned a significant number of those grids. This optimization enable the PIERNIK to efficiently run on Tier-0
machines. In chapter 1 we introduce PIERNIK software more particularly. Next we focus on scientific aspects
(chapter 2) and discuss used algorithms (chapter 3) including potential optimization issues. Subsequently we
present performance analysis (chapter 4) carried out with Scalasca and Vampir tools. In the final chapter 5 we
present optimization results. In the appendix we provided technical information about the installation and test
environment.

Download PDF


Authors:
Joachim Heina, Anders Johansonb
aCentre
of Mathematical Sciences & Lunarc, Lund University, Box 118, 221
00 Lund, Sweden
bDepartment
of Astronomy and Theoretical Physics, Lund University, Box 43, 221 00
Lund, Sweden

Abstract:
The simulation of weakly compressible turbulent gas
fiows with embedded particles is one of the main objectives of the
Pencil Code. While the code mostly deploys high order -finite difference schemes, portions of the code require the use of
Fourier space methods. This report describes an optimisation project to improve the performance of the parallel Fourier
transformation in the code. Certain optimisations which significantly improve the performance of the parallel FFT were
observed to have a negative impact on other parts of the code, such that the overall performance decreases. Despite this
challenge the project managed to improve the performance of the parallel FFT within the Pencil Code by a factor of
2.4 and the overall performance of the application by 8% for a project-relevant benchmark.

Download PDF


Authors: Petri
Nikunena, Frank Scheinerb

aCSC – IT Center for
Science, P.O. Box 405, FI-02101 Espoo, Finland

bHigh Performance Computing
Center Stuttgart (HLRS),University of Stuttgart, D-70550 Stuttgart,
Germany

Abstract:
Planck is a mission of the European Space Agency (ESA) to map the anisotropies of the cosmic microwave
background with the highest accuracy ever achieved. Planck is supported by several computing centres,
including CSC (Finland) and NERSC (USA). Computational resources were provided by CSC through the DECI
project Planck-LFI, and by NERSC as a regular production project. This whitepaper describes how PRACE-2IP
staff helped Planck-LFI with two types of support tasks: (1) porting their applications to the execution machine
and seeking ways to improve applications’ performance; and (2) improving performance and facilities to transfer
data between the execution site and the different data centres where data is stored.

Download PDF


Authors:
V.Antonuccio-Delogua, U.Becciania,
M.Cytowski*b, J.Heinc, J.Hertzerd
aINAF – Osservatorio
Astrofisico di Catania, Italy

bInterdisciplinary Centre
for Mathematical and Computational Modeling, University of Warsaw,
Poland
cLunarc Lund University,
Sweden

dHLRS, University of
Stuttgart, Germany

Abstract:
In this whitepaper we report work that was done to investigate and improve the performance of a mixed MPI and OpenMP
implementation of the FLY code for cosmological simulations on a PRACE Tier-0 system Hermit (Cray XE6).

Download PDF


Authors: Claudio
Ghellera, Graziella Ferinia, Maciej Cytowskib,
Franco Vazzac
aCINECA, Via Magnanelli 6/3,
Casalecchio di Reno, 40033, Italy
bICM, University of Warsaw,
ul.Pawinskiego 5a, 02-106 Warsaw, Poland
cJacobs University,Campus
Ring 1, 28759 Bremen, Germany

Abstract:
In this paper we present the work performed in order to build and optimize the cosmological simulation code ENZO
on the Jugene, Blue Gene/P system available at the Forschungszentrum Juelich in Germany. The work allowed us to
define the optimal setup to perform high resolution simulations -finalized to the description of non thermal phenomena
(e.g. the acceleration of relativistic particles at shock waves) active in massive galaxy clusters during their cosmological
evolution. These simulations will be the subject of a proposal in a future call for projects of the PRACE EU funded
project (http://www.prace-ri.eu/).

Download PDF


Authors: Ata Turka,
Cevdet Aykanata, G. Vehbi Demircia, Sebastian
von Alfthanb, Ilja Honkonenb
aBilkent University,
Computer Engineering Department, 06800 Ankara, Turkey
bFinnish Meteorological
Institute, PO Box 503, Helsinki, FI-00101, Finland

Abstract:
This whitepaper describes the load-balancing performance issues that are observed and tackled during the petascaling of a space
plasma simulation code developed at the Finnish Meteorological Institute (FMI). It models the communication pattern as a
hypergraph, and partitions the computational grid using the parallel hypergraph partitioning scheme (PHG) of the Zoltan
partitioning framework. The result of partitioning determines the distribution of grid cells to processors. It is observed that the
partitioning phase takes a substantial percentage of the overall computation time. Alternative (graph-partitioning-based) schemes
that perform almost as well as the hypergraph partitioning scheme and that require less preprocessing overhead and better
balance are proposed and investigated. A comparison in terms of effect on running time, preprocessing overhead and load-balancing quality of Zoltan’s PHG, ParMeTiS, and PT-SCOTCH are presented. Test results on Juelich BlueGene/P cluster are
presented.

Download PDF


Finite Element applications


Authors:
A. Abbà a*, A. Emerson b, M. Nini a, M. Restelli c, M. Tugnoli a

a* Dipartimento di Scienze e Tecnologie Aerospaziali, Politecnico di Milano, Via La Masa, 34, 20156 Milano, Italy

b Cineca, via Magnanelli 6/3, 40033 Casalecchio di Reno, Bologna, Italy

c Max-Planck-Institut für Plasmaphysik, Boltzmannstraße 2, D-85748 Garching, Germany

Abstract:
We present a performance analysis of the numerical code DG-comp which is based on a Local Discontinuous Galerkin method and designed for the simulation of compressible turbulent flows in complex geometries. In the code several subgrid-scale models for Large Eddy Simulation and a hybrid RANS-LES model are implemented. Within a PRACE Preparatory Access Project, the attention was mainly focused on the following aspects:

1. Testing of the parallel scalability on three different Tier-0 architectures available in PRACE (Fermi, MareNostrum3 and Hornet);

2. Parallel profiling of the application with the Scalasca tool;

3. Optimization of the I/O strategy with the HDF5 library.

Without any code optimizations it was found that the application demonstrated strong parallel scaling of more than 1000 cores on Hornet and MareNostrum, and least up to 16384 cores on Fermi. The performance characteristics giving rise to this behaviour were confirmed with Scalasca. As regards the I/O strategy, a modification to the code was made to allow the use of HDF5-formatted files for output. This enhancement resulted in an increase in performance for most input datasets and a significant decrease in the storage space required. Other data were collected on the influence of the optimal compiler options to employ on the different computer systems and the influence of numerical libraries for the linear algebra computations in the code..

Download PDF


Authors: Mikko
Byckling, Mika Malinen, Juha Ruokolainen, Peter Raback
CSC – IT Center for Science, Keilaranta
14, 02101 Espoo, Finland

Abstract:
Recent developments of Elmer finite element solver are presented. The applicability of the code to industrial problems
has been improved by introducing features for handling rotational boundary conditions with mortar -nite elements. The
scalability of the code has been improved by making the code thread-safe and by multithreading some critical sections
of the code. The developments are described and some scalability results are presented.

Download PDF


Authors: X. Saez, E.
Casoni, G. Houzeaux, M. Vazquez

Dept. of Computer Applications in
Science and Egineering, Barcelona Supercomputing Center (BSC-CNS),
08034 Barcelona, Spain

Abstract:
While solid mechanics codes are now proven tools both in the industry and research sectors, the increasingly more exigent requirements of both
sectors are fuelling the need for more computational power and more advanced algorithms. While commercial codes are widely used in
industry during the design process, they often lag behind academic codes in terms of computational efficiency. In fact, the commercial codes
are usually general purpose and include millions of lines of codes. Massively parallel computers appeared only recently, and the adaptation of
these codes is going slowly. In the meantime, academy adapted very quickly to the new computer architectures and now offers an attractive
alternative: not so much modeling but more accuracy.

Alya is a computational mechanics code developed at Barcelona Supercomputing Center (BCS-CNS) that solves Partial Differential Equations
(PDEs) on non-structured meshes. To address the lack of an efficient parallel solid mechanics code, and motivated by the demand coming from
industrial partners, Alya-Solidz, the specific Alya module for solving computational solid mechanics problems, has been enhanced to treat large
complex problems involving solid deformations and fracture. Some of these developments have been carried out in the framework of PRACE-2IP European project.

In this article a solid mechanics simulation strategy for parallel supercomputers based on a hybrid approach is presented. A hybrid
parallelization approach combining MPI tasks with OpenMP threads is proposed in order to exploit the different levels of parallelism of actual
multicore architectures. This paper describes the strategy programmed in Alya and shows nearly optimal scalability results for some solid
mechanical problems.

Download PDF


Authors:
T. Kozubek, M. Jarosov, M. Mens??k, A. Markopoulos

CE IT4Innovations, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic

Abstract:
We describe a hybrid FETI (Finite Element Tearing and Interconnecting) method based on our variant of the FETI
type domain decomposition method called Total FETI. In our approach a small number of neighboring subdomains is
aggregated into the clusters, which results into a smaller coarse problem. To solve the original problem the Total FETI
method is applied twice: to the clusters (macro-subdomains) and then to the subdomains in each cluster. This approach
simpli?es implementation of hybrid FETI methods and enables to extend parallelization of the original problem up to
tens of thousands of cores due to the coarse space reduction and thus lower memory requirements. The performance is
demonstrated on a linear elasticity benchmark.

Download PDF


Authors:
T. Kozubek, D. Horak, V. Hapla

CE IT4Innovations, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic

Abstract:
Most of computations (subdomain problems) appearing in FETI-type methods are purely local and therefore parallelizable without any data transfers. However, if we want to accelerate also dual actions, some communication is needed due
to primal-dual transition. Distribution of primal matrices is quite straightforward. Each of cores works with local part
associated with its subdomains. A natural e?ort using the massively parallel computers is to maximize the number of
subdomains so that sizes of subdomain sti?ness matrices are reduced which accelerates their factorization and subsequent
pseudoinverse application, belonging to the most time consuming actions. On the other hand, a negative e?ect of that
is an increase of the null space dimension and the number of Lagrange multipliers on subdomains interfaces, i.e. the
dual dimension, so that the bottleneck of the TFETI method becomes the application of the projector onto the natural
coarse space, especially its part called coarse problem solution. In this paper, we suggest and test di?erent parallelization
strategies of the coarse problem solution regarding to the improvements of the TFETI massively parallel implementation.
Simultaneously we discuss some details of our FLLOP (Feti Light Layer on Petsc) implementation and demonstrate its
performance on an engineering elastostatic benchmark of the car engine block up to almost 100 million DOFs. The best
parallelization strategy based on the MUMPS was implemented into the multi-physical ?nite element based opensource
code ELMER developed by CSC, Finland.

Download PDF


Authors: T. Kozubeka,
V. Vondraka, P. Rabackb, J. Ruokolainenb
a Department of Applied
Mathematics, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava,
Czech Republic
b CSC – IT Center for
Science, Keilaranta 14 a, 20101 Espoo, Finland

Abstract:
The bottlenecks related to the numerical solution of many engineering problems are very dependent on the techniques
used to solve the systems of linear equations that result from their linearizations and ?nite element discretizations. The
large linearized problems can be solved e?ciently using the so-called scalable algorithms based on multigrid or domain
decomposition method. In cooperation with the Elmer team two variants of the domain decomposition method have
been implemented into Elmer: (i) FETI-1 (Finite Element Tearing and Interconnecting) introduced by Farhat and Roux
and (ii) Total FETI introduced by Dostal, Horak, and Kucera. In the latter, the Dirichlet boundary conditions are torn
o? to have all subdomains ?oating, which makes the method very ?exible. In this paper, we review the results related
to the e?cient solution of symmetric positive semide?nite systems arising in FETI methods when they are applied on
elliptic boundary value problems. More speci?cally, we show three di?erent strategies to ?nd the so-called ?xing nodes
(or DOFs – degrees of freedom), which enable an e?ective regularization of the corresponding subdomain system matrices
that eliminates the work with singular matrices. The performance is illustrated on an elasticity benchmark computed
using ELMER on the French Tier-0 system CURIE.

Download PDF


Authors: J.
Ruokolainena, P. R ?abacka,?, M. Lylya,
T. Kozubekb, V. Vondrakb, V. Karakasisc,
G. Goumasc
a CSC – IT Center for
Science, Keilaranta 14 a, 20101 Espoo, Finland
b Department of Applied
Mathematics, V?SB – Technical University of Ostrava, 17. listopadu
15, 70833 Ostrava Poruba, Czech Republic
c ICCS-NTUA, 9, Iroon.
Polytechniou Str., GR-157 73 Zografou, Greece

Abstract:
Elmer is a ?nite element software for the solution of multiphysical problems. In the present work some performance
bottlenecks in the work?ow are eliminated: In prepocessing the mesh splitting scheme is improved to allow the conservation of mesh grading for simple problems. For the solution of linear systems a preliminary FETI domain decomposition
method is implemented. It utilizes a direct factorization of the local problem and an iterative method for joining the
results from the subproblems. The weak scaling of FETI is shown to be nearly ideal the number of iterations staying
almost ?xed. For postprocessing binary output formats and a XDMF+HDF5 I/O routine are implemented. Both may
be used in conjunction with parallel visualization software.

Download PDF



Authors: Vasileios
Karakasis1, Georgios Goumas1, Konstantinos
Nikas2,*, Nectarios Koziris1, Juha
Ruokolainen3, and Peter Raback3
1Institute of Communication
and Computer Systems (ICCS), Greece
2Greek Research &
Technology Network (GRNET), Greece
3CSC -IT Center for Science
Ltd., Finland

Abstract:
Multiphysics simulations are at the core of modern Computer Aided Engineering (CAE) allowing the analysis of multiple,
simultaneously acting physical phenomena. These simulations often rely on Finite Element Methods (FEM) and the
solution of large linear systems which, in turn, end up in multiple calls of the costly Sparse Matrix-Vector Multiplication
(SpMV) kernel. We have recently proposed the Compressed Sparse eXtended (CSX) format, which applies aggressive
compression to the column indexing structure of the CSR format and is able to provide an average performance improvement of more than 40% over multithreaded CSR implementations. This work integrates CSX into the Elmer multiphysics
simulation software and evaluates its impact on the total execution time of the solver. Despite its preprocessing cost,
CSX is able to improve by almost 40% the performance of the Elmer’s SpMV component (using multithreaded CSR)
and provides an up to 15% performance gain in the overall solver time after 1000 linear system iterations. To our
knowledge, this is one of the -rst attempts to evaluate the real impact of an innovative sparse-matrix storage format
within a `production’ multiphysics software.

Download PDF


Authors:
K. Georgiev, N. Kosturski, I. Lirkov, S. Margenov, Y. Vutov

National Center for Supercomputing Applications, Acad. G. Bonchev str, Bl. 25-A, 1113 Sofia, Bulgaria

Abstract:
The White Paper content is focused on: a) construction and analysis of novel scalable algorithms to enhance scientific
applications based on mesh methods (mainly on finite element method (FEM) technology); b) optimization of a new class of
algorithms on many core systems.
From one site, the commonly accepted benchmark problem in computational fluid dynamics (CFD) – time dependent system of
incompressible Navier-Stokes equations, is considered. The activities were motivated by advanced large-scale simulations of
turbulent flows in the atmosphere and in the ocean, simulation of multiphase flows in order to extract average statistics, solving
subgrid problems as part of homogenization procedures etc. The computer model is based on implementation of a new class of
parallel numerical methods and algorithms for time dependent problems. It only requires solution of tridiagonal linear systems
and therefore it is computationally very efficient, with a computational complexity of the same order as that of an explicit
scheme, and yet, unconditionally stable. The scheme is particularly convenient for parallel implementation. Among the most
important novel ideas is to avoid the transposition which is usually used in alternating directions time stepping algorithms. The
final goal is to provide portable tools for integration in commonly accepted codes like Elmer and OpenFOAM. The new
development software is organized as a computer library for using of researchers dealing with solution of incompressible Navier-Stokes equations
From other hand, we implement and develop new scalable algorithms and software for FEM simulations with typically O(109
)
degrees of freedom in space for an IBM Blue Gene/P computer. We have considered voxel and unstructured meshes; stationary
and time dependent problems; linear and nonlinear models The performed work was focused on the development of scalable
mesh methods, and tuning of the related software tools mainly to the IBM Blue Gene/P architecture but other massively parallel
computers and MPI clusters were taken into account too. Efficient algorithms for time stepping, mesh refinement and parallel
mappings were implemented. The aim here is again providing software tools for integration in Elmer and OpenFOAM. The
computational models address discrete problems in the range of O(109
) degrees of freedom in space. The related time stepping
techniques and iterative solvers are targeted to meet the Tier-1 and (further) Tier-0 requirements.
Scalability on 512 IBM Blue Gene/P nodes and several other high performance computing clusters is currently documented for
the tested software modules and some of them are presented in this paper. Comparison results of running Elmer code on Intel
cluster (16 cores, Intel Xeon X5560) and on IBM Blue Gene/P computer can be found. Variants of 1D, 2D and 3D domain
partitioning for the 3D test problems were systematically analysed showing the advantages of the 3D partitioning for the Blue
Gene/P communication system.

Download PDF


Fusion applications


Authors: Xavier
Saeza, Taner Akguna, Edilberto Sanchezb
a Barcelona Supercomputing
Center – Centro Nacional de Supercomputacion, C/ Gran Capita 2-4,
Barcelona, 08034, Spain

b Laboratorio Nacional de
Fusion. Avda Complutense 22, 28040 Madrid, Spain

Abstract:
In this paper we report the work done in Task 7.2 of PRACE-1IP project on the code EUTERPE. We report on the progress
made on the hybridization of the code to MPI and OpenMP; the status of the porting to GPUs; the outline of the analysis of
parameters; and the study on the possibility of incorporating I/O forwarding to improve performance. Our initial findings
indicate that particle-in-cell algorithms such as EUTERPE are suitable candidates for the new computing paradigms
involving heterogeneous architectures.

Download PDF


Life Science applications


Authors:
Andrew Sunderland, Martin Plummer

STFC Daresbury Laboratory, Warrington, United Kingdom

Abstract:
DNA oxidative damage has long been associated with the development of a variety of cancers including colon, breast and
prostate, whilst RNA damage has been implicated in a variety of neurological diseases, such as Alzheimer’s disease and
Parkinson’s disease. Radiation damage arises when energy is deposited in cells by ionizing radiation, which in turn leads to
strand breaks in DNA. The strand breaks are associated with electrons trapped in quasi-bound ‘resonances’ on the basic
components of the DNA. HPC usage will enable the study of this resonance formation in much more detail than in current initial
calculations. The associated application is UKRmol [1], a widely used, general-purpose electron-molecule collision package and
the enabling aim is to replace a serial propagator (coupled PDE solver) with a parallel equivalent module.

Download PDF


Authors:
R. Oguz Selvitopi, Gunduz Vehbi Demirci, Ata Turk, Cevdet Aykanat

Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY

Abstract:
This whitepaper addresses applicability of the Map/Reduce paradigm for scalable and easy parallelization of fundamental data mining approaches with the aim of exploring/enabling processing of terabytes of data on PRACE Tier-0 supercomputing systems. To this end, we first test the usage of MR-MPI library, a lightweight Map/Reduce implementation that uses the MPI library for inter-process communication, on PRACE HPC systems; then propose MR-MPI-based implementations of a number of machine learning algorithms and constructs; and finally provide experimental analysis measuring the scaling performance of the proposed implementations. We test our multiple machine learning algorithms with different datasets. The obtained results show that utilization of the Map/Reduce paradigm can be a strong enhancer on the road to petascale.

Download PDF


Authors:
Thomas Roblitz, Ole W. Saastad, Hans A. Eide, Katerina Michalickova, Alexander Johan Nederbragt, Bastiaan Star

Department for Research Computing, University Center for Information Technology (USIT), University of Oslo, P.O. Box 1059, Blindern, 0316 Oslo, Norway

Center for Ecological and Evolutionary Synthesis, Department of Biosciences (CEES), University of Oslo, P.O. Box 1066, Blindern, 0316 Oslo, Norway

Abstract:
Sequencing projects, like the Aqua Genome project, generate vast amounts of data which is processed through dif-
ferent workfows composed of several steps linked together. Currently, such workfows are often run manually on
large servers. With the increasing amount of raw data that approach is no longer feasible. The successful imple-
mentation of the project’s goals requires 2-3 orders of magnitude scaling of computing, while achieving high reli-
ability on and supporting ease-of-use of super computing resources at the same time. We describe two example
use cases, the implementation challenges and constraints, the actual application enabling and report our findings.

Download PDF


Authors: A.
Charalampidoua,b, P. Daogloua,b, D. Foliasa,b,
P. Borovskac,d, V. Ganchevac,e

a Greek Research and
Technology Network, Athens, Greece

b Scientific Computing
Center, Aristotle University of Thessaloniki, Greece

c National Centre for
Supercomputing Applications, Bulgaria

d Department of Computer
Systems, Technical University of Sofia, Bulgaria

e Department of Programming
and Computer Technologies, Technical University of Sofia, Bulgaria

Abstract:
The project focuses on performance investigation and improvement of multiple biological sequence alignment software
MSA_BG on the BlueGene/Q supercomputer JUQUEEN. For this purpose, scientific experiments in the area of
bioinformatics have been carried out, using as case study influenza virus sequences. The objectives of the project are code
optimization, porting, scaling, profiling and performance evaluation of MSA_BG software. To this end we have developed
hybrid MPI/OpenMP parallelization on the top of the MPI only code and we showcase the advantages of this approach
through the results of benchmark tests that were performed on JUQUEEN. The experimental results show that the hybrid
parallel implementation provides considerably better performance than the original code.

Download PDF


Authors: Plamenka Borovska, Veska Gancheva

National Centre for Supercomputing Applications, Bulgaria

Abstract:
In silico biological sequence processing is a key task in molecular biology. This scientific area requires powerful computing
resources for exploring large sets of biological data. Parallel in silico simulations based on methods and algorithms for analysis of
biological data using high-performance distributed computing is essential for accelerating the research and reducing the investment.
Multiple sequence alignment is a widely used method for biological sequence processing. The goal of this method is DNA and protein
sequences alignment. This paper presents an innovative parallel algorithm MSA_BG for multiple alignment of biological sequences
that is highly scalable and locality aware. The MSA_BG algorithm we describe is iterative and is based on the concept of Artificial
Bee Colony metaheuristics and the concept of algorithmic and architectural spaces correlation. The metaphor of the ABC
metaheuristics has been constructed and the functionalities of the agents have been defined. The conceptual parallel model of
computation has been designed and the algorithmic framework of the designed parallel algorithm constructed. Experimental
simulations on the basis of parallel implementation of MSA_BG algorithm for multiple sequences alignment on heterogeneouc
compact computer cluster and supercomputer BlueGene/P have been carried out for the case study of the influenza virus variability
investigation. The performance estimation and profiling analyses have shown that the parallel system is well balanced both in
respect to the workload and machine size.

Download PDF


Authors: Soon-Heum
Koa, Plamenka Borovskab‡, Veska Ganchevac†
a National Supercomputing
Center, Linkoping University, 58183 Linkoping, Sweden
b Department of Computer
Systems, Technical University of Sofia, Sofia, Bulgaria
c Department of Programming
and Computer Technologies, Technical University of Sofia, Sofia,
Bulgaria

Abstract:
This activity with the project PRACE-2IP is aimed to investigate and improve the performance of multiple sequence alignment
software ClustalW on the supercomputer BlueGene/Q, so-called JUQUEEN, for the case study of the influenza virus sequences.
Porting, tuning, profiling, and scaling of this code has been accomplished in this aspect. A parallel I/O interface has been designed
for effcient sequence dataset input, in which sub-groups’ local masters take care of read operation and broadcast the dataset to their
slaves. The optimal group size has been investigated and the effects of read buffer size on read performance has been experimented.
The application to ClustalW software shows that the current implementation with parallel I/O provides considerably better
performance than the original code in view of I/O segment, leading up to 6.8 times speed-up for inputting dataset in case of using
8192 JUQUEEN cores.

Download PDF


Authors: D.
Grancharov, E. Lilkova, N. Ilieva, P. Petkov, S. Markov and L. Litov

National Centre for Supercomputing Applications, Acad. G. Bonchev Str, Bl. 25-A, 1113 Sofia, Bulgaria

Abstract:
Based on the analysis of the performance, scalability, work-load increase and distribution of the MD simulation packages GROMACS and
NAMD for very large systems and core numbers, we evaluate the possibilities for overcoming the deterioration of the scalability and
performance of the existing MD packages by implementation of symplectic integration algorithms with multiple step sizes.

Download PDF


Particle Physics applications


Authors:
Jacques David,Vincent Bergeaud
CEA/DEN/SA2P, CEA-Saclay, 91191 , Gif sur Yvette, France
CEA/DEN/DM2S/LGLS, CEA-Saclay, 91191 , Gif sur Yvette, France

Abstract:
Mathematical models, designed to simulate complex physics are use in scientific and engineering studies. In the case of nuclear applications, assessing safety parameter such as fuel temperature, numerical simulations is paramount, in order to gain confidence versus comparison to experience. The URANIE tool uses propagation methods to assess uncertainties in simulation output parameters in order to better evaluate confidence intervals (e.g., of temperature, pressure, etc.). This is used for Verification and Validation and Uncertainties Quantification (or VVUQ) process used for safety analysis.
While URANIE is well suited for launching many instances of serial codes, it suffers from a lack of scalability and portability when used for coupled simulations and/or parallel codes. The aim of the project is therefore to enhance this launching mechanism to support a wider variety of applications, leveraging on HPC capabilities to go further to a new level of statistical assessment for models.

Download PDF


Authors:
Alexei Strelchenko, Marcus Petschlies and Giannis Koutsou

CaSToRC, Nicosia 2121, Cyprus

Abstract:
We extend the QUDA library, an open source library for performing calculations in lattice QCD on Graphics
Processing Units (GPUs) using NVIDIA’s CUDA platform, to include kernels for non-degenerate twisted mass and
multi-gpu Domain Wall fermion operators. Performance analysis is provided for both cases.

Download PDF


Mathematics applications


Authors:A. Artiguesa, G. Houzeauxa
aBarcelona Supercomputing Center

Abstract:
The Alya System is the BSC simulation code for multi-physics problems [1]. It is based on a Variational
Multiscale Finite Element Method for unstructured meshes.
Work distribution is achieved by partitioning the original mesh into subdomains (submeshes). This pre-partition
step has until now been done in serial by only one process, using the metis library [2]. This is a huge bottleneck
when larger meshes with millions of elements have to be partitioned. This is due to the data not fitting in the
memory of a single computing node and in the cases where the data does fit; Alya takes too long in the
partitioning step.
In this document we explain the tasks done to design, implement and test a new parallel partitioning algorithm
for Alya. In this algorithm a subset of the workers, is in charge of partition the mesh in parallel, using the
parmetis library [3].
Partitioning workers, load consecutive parts of the main mesh, with a parallel space partitioning bin structure [4],
capable of obtaining the adjacent boundary elements of their respective submeshes. With this local mesh, each of
the partitioning workers is able to create its local element adjacency graph and to partition the mesh.
We have validated our new algorithm using a Navier-Stokes problem on a small cube mesh of 1000 elements.
Then we performed a scalability test on a 30M element mesh to check if the time to partition the mesh is reduced
proportionally with the number of partitioning workers.
We have also done a comparison between metis and parmetis, the balancing of the element distribution among
the domains, to test how the use of many partitioning workers to partition the mesh affects the scalability of
Alya. We have noticed in these tests that it’s better to use fewer partitioning workers to partition the mesh.
Finally we have two sections explaining the results and the future work that has to be done in order to finalise
and improve the parallel partition algorithm.

Download PDF


Authors:
Krzysztof T. Zwierzynski*a
*aPoznan Supercomputing and Networking Center, ul. Z. Noskowskiego 12/14, 61-704 Poznan, Poland

Abstract:
In this paper we consider the problem of designing a self-improving meta-model of job workflow that is sensitive to the
change of the computational environment. As an example of searched combinatorial objects permutations and some classes
of integral graphs are used. We propose a number of dedicated methods that can improve the execution time of workflow
based on decision trees and the replication of some actors in the workflow.

Download PDF


Authors: Jerome Richarda,+, Vincent Lanoreb,+ and Christian Perezc,+
a University of Orleans, France
b Ecole Normale Superieure de Lyon, France
c Inria
+ Avalon Research-Team, LIP, ENS Lyon, France

Abstract: The Fast Fourier Transform (FFT) is a widely-used building block for many high-performance scientific applications. Ef-
ficient computing of FFT is paramount for the performance of these applications. This has led to many efforts to implement
machine and computation specific optimizations. However, no existing FFT library is capable of easily integrating and au-
tomating the selection of new and/or unique optimizations.
To ease FFT specialization, this paper evaluates the use of component-based software engineering, a programming paradigm
which consists in building applications by assembling small software units. Component models are known to have many software
engineering benefits but usually have insusuficient performance for high-performance scientific applications.
This paper uses the L2C model, a general purpose high-performance component model, and studies its performance and
adaptation capabilities on 3D FFTs. Experiments show that L2C, and components in general, enables easy handling of 3D FFT
specializations while obtaining performance comparable to that of well-known libraries. However, a higher-level component
model is needed to automatically generate an adequate L2C assembly.

Download paper: PDF


Authors: R. Oguz Selvitopia, Cevdet Aykanata,*
a Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY

Abstract: Parallel iterative solvers are widely used in solving large sparse linear systems of equations on large-scale
parallel architectures. These solvers generally contain two different types of communication operations: point-topoint
(P2P) and global collective communications. In this work, we present a computational reorganization
method to exploit a property that is commonly found in Krylov subspace methods. This reorganization allows
P2P and collective communications to be performed simultaneously. We realize this opportunity to embed the
content of the messages of P2P communications into the messages exchanged in the collective communications
in order to reduce the latency overhead of the solver. Experiments on two different supercomputers up to 2048
processors show that the proposed latency-avoiding method exhibits superior scalability, especially with
increasing number of processors.

Download paper: PDF

Authors:
Gunduz Vehbi Demirci, Ata Turk, R. Oguz Selvitopi, Kadir Akbudak, Cevdet Aykanat

Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY

Abstract:
This whitepaper addresses applicability of the MapReduce paradigm for scientific computing by realizing it on
the widely used sparse matrix-vector multiplication (SpMV) operation with a recent library developed for this
purpose. Scaling SpMV operations proves vital as it is a kernel that finds its applications in many scientific
problems from different domains. Generally, the scalability improvement of these operations is negatively
affected by high communication requirements of the multiplication, especially at large processor counts in the
case of strong scaling. We propose two partitioning-based methods to reduce these requirements and allow
SpMV operations to be performed more efficiently. We demonstrate how to parallelize SpMV operations using
MR-MPI, an efficient and portable library that aims at enabling usage of MapReduce paradigm in scientific
computing. We test our methods extensively with different matrices. The obtained results show that utilization of
communication-efficient methods and constructs are required on the road to Exascale.

Download paper: PDF


Authors: Petri Nikunena, Frank Scheinerb

a CSC – IT Center for
Science, P.O. Box 405, FI-02101 Espoo, Finland

b High Performance Computing Center Stuttgart (HLRS),University of Stuttgart, D-70550 Stuttgart, Germany

Abstract:
Planck is a mission of the European Space Agency (ESA) to map the anisotropies of the cosmic microwave
background with the highest accuracy ever achieved. Planck is supported by several computing centres,
including CSC (Finland) and NERSC (USA). Computational resources were provided by CSC through the DECI
project Planck-LFI, and by NERSC as a regular production project. This whitepaper describes how PRACE-2IP
staff helped Planck-LFI with two types of support tasks: (1) porting their applications to the execution machine
and seeking ways to improve applications’ performance; and (2) improving performance and facilities to transfer
data between the execution site and the different data centres where data is stored.

Download paper: PDF


Author:
Krzysztof T. Zwierzynski

Poznan Supercomputing and Networking Center, ul. Z. Noskowskiego 12/14, 61-704 Poznan, Poland

Abstract:
In this white paper we report the work that was done on the problem of generation combinatorial structures with some rare
invariant properties. These combinatorial structures are connected integral graphs. All (588) of such graphs of order
1 ? n ? 12 are known. The main goal of this work was to reduce the time of generation by distributing graph generators over
hosts in PRACE-RI, and to reduce the time of sieving integral graphs by applying eigenvalue calculation in GPGPU device
using the OpenCL technique. This work is also a study of how to minimize the overhead connected with using OpenCL
kernels.

Download paper: PDF


Authors: Dimitris Siakavaras, Konstantinos Nikas, Nikos Anastopoulos, and Georgios Goumas

Greek Research & Technology Network (GRNET), Greece

Abstract: This whitepaper studies the various aspects and challenges of performance scaling on large scale shared memory systems.
Our experiments are performed on a large ccNUMA machine that consists of 72 IBM 3755 nodes connected with NumaConnect and provides shared memory over a total of 1728 cores, a number that is far beyond conventional server platforms. As benchmarks, three data-intensive and memory-bound applications with different communication patterns are selected, namely Jacobi, CSR SpM-V and Floyd-Warshall. Our results illustrate the need for numa-aware design and implementation of shared-memory parallel algorithms in order to achieve scaling to high core counts. At the same time, we observed that, depending on its communication pattern, an application could bene-t more from explicit communication using message passing.

Download paper: PDF


DisclaimerThese whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.

They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of
PRACE AISBL and therefore do not emanate from it nor should be considered to reflect
PRACE AISBL’s individual opinion.

Copyright notices© 2014 PRACE Consortium Partners. All rights reserved. This document is a project
document of a PRACE Implementation Phase project. All contents are reserved by default
and may not be disclosed to third parties without the written consent of the PRACE partners,
except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.

All trademarks and other rights on third party products mentioned in the document are
acknowledged as own by the respective holders.

Share: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+Email this to someone