Quantum MD applications
Authors: Soon-Heum Koa, Simen Reine, Thomas Kjargaard
National Supercomputing Centre, Linkoping University, 581 83 Linkoping, Sweden
Centre for Theoretical and Computational Chemistry, Department of Chemistry, Oslo University, Postbox 1033, Blindern, 0315, Oslo, Norway
qLEAP – Center for Theoretical Chemistry, Department of Chemistry, Aarhus University, Langelandsgade 140, Aarhus C, 8000, Denmark
Abstract: In this paper, we present the performance of LSDALTON’s DFT method in large molecular simulations of biological interest. We primarily focus on evaluating the performance gain by applying the density fitting (DF) scheme and the auxiliary density matrix method (ADMM). The enabling effort is put towards finding the right build environment (composition of the compiler, an MPI and extra libraries) which generates a full 64-bit integer-based binary. Using three biological molecules varying in size, we verify that the DF and the ADMM schemes provide much gain in the performance of the DFT code, at the cost of large memory consumption to store extra matrices and a little change on scalability characteristics with the ADMM calculation. In the insulin simulation, the parallel region of the code accelerates by 30 percent with the DF calculation and 56 percent in the case of the DF-ADMM calculations.
Authors: Mariusz Uchronski, Agnieszka Kwiecien, Marcin Gebarowski
WCSS, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland
Abstract: CP2K is an application for atomistic and molecular simulation and, with its excellent scalability, is particularly important with regards to use on future exascale systems. The code is well parallelized using MPI and hybrid MPI/OpenMP, typically scaling well to 1 core per atom in the system. The research on CP2K done within PRACE-1IP stated that due to heavy usage of sparse matrix multiplication for large systems, there is a place for improvement of performance. The main goal of this work, undertaken within PRACE-3IP, was to investigate the most time-consuming routines and port them to accelerators, particularly GPGPUs. The relevant areas of the code that can be effectively accelerated are the matrix multiplications (DBCSR library). A significant amount of work has already been done on DBCSR library using CUDA. We focused on enabling the library on a potentially wider range of computing resources using OpenCL and OpenACC technologies, to bring the overall application closer to exascale. We introduce the ports and promising performance results. The work done has led to the identification of a number of issues with using OpenACC in CP2K, which need to be further investigated and resolved to make the application and technology work better together.
Authors: J.A. Astrom
CSC – It-centre for science, Esbo, Finland
Abstract: NUMFRAC is a generic particle based code for simulation of non-linear mechanics in disordered solids. The generic theory of the code is outlined and examples are given by glacier calving and fretting. This text is a to a large degree a part of the publication: J. A. °Astr?om, T. I. Riikil?a, T. Tallinen, T. Zwinger, D. Benn, J. C. Moore, and J. Timonen, A particle based simulation model for glacier dynamics, The Cryosphere Discuss, 7, 921-941, 2013.
Authors: A. Calzolaria, C. Cavazzonib
a Istituto Nanoscienze CNR-NANO-S3, I-41125 Modena Italy
b CINECA – Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)
Abstract: This work regards the enabling of the Time-Dependent Density Functional Theory kernel (TurboTDDFT) of Quantum-ESPRESSO package on petascale systems. TurboTDDFT is a fundamental tool to investigate nanostructured materials and nanoclusters, whose optical properties are determined by their electronic excited states. Enabling of TurboTDDFT on petascale system will open up the possibility to compute optical properties for large systems relevant for technological applications. Plasmonic excitations in particular are important for a large range of applications from biological sensing, over energy conversion to subwavelength waveguides. The goal of the present project was the implementation of novel strategies for reducing the memory requirements and improving the weak scalability of the TurboTDDFT code, aiming at obtaining an important improvement of the code capabilities and to be able to study the plasmonic properties of metal nanoparticle (Ag, Au) and their dependence on the size of the system under test.
Authors: Massimiliano Guarrasia, Sandro Frigiob, Andrew Emersona and Giovanni Erbaccia
a CINECA, Italy
b University of Camerino, Italy
Abstract: In this paper we will present part of the work carried out by CINECA in the framework of the PRACE-2IP project aimed to study the effect on performance due to the implementation of a 2D Domain Decomposition algorithm in DFT codes that use standard 1D (or slab) Parallel Domain Decomposition. The performance of this new algorithm are tested on two example applications: Quantum Espresso, a popular code used in materials science, and , the CFD code BlowupNS.
In the first part of this paper we will present the codes that we use. In the last part of this paper we will show the increase of performance obtained using this new algorithm.
Authors: Al. Charalampidoua,b, P. Korosogloua,b, F. Ortmannc, S. Rochec
a Greek Research and Technology Network, Athens, Greece
b Scientific Computing Center, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
c Catalan Institute of Nanotechnology, Spain
Abstract: This study has focused on an application for Quantum Hall Transport simulations and more specifically on how to overcome an initially identified potential performance bottleneck related to the I/O of wave functions. These operations are required in order to enable and facilitate continuation runs of the code. After following several implementations for performing these I/O operations in parallel (using the MPI I/O library) we showcase that a performance gain in the range 1.5 – 2 can be achieved when switching from the initial POSIX only approach to the parallel MPI I/O approach on both CURIE and HERMIT PRACE Tier-0 systems. Moreover, we showcase that because I/O throughput scales with an increasing number of cores overall the performance of the code is efficient up to at least 8192 processes.
Authors: Martti Louhivuoria, Jussi Enkovaaraa,b
a CSC – IT Center for Science Ltd., PO Box 405, 02101 Espoo, Finland
b Aalto University, Department of Applied Physics, PO Box 11100, 00076 Aalto, Finland
Abstract: In recent years, graphical processing units (GPUs) have generated a lot of excitement in computational sciences by promising a signi-cant increase in computational power compared to conventional processors. While this is true in many cases for small-scale computational problems that can be solved using the processing power of a single computing unit, the e-cient usage of multiple GPUs in parallel over multiple inter-connected computing units has been problematic. Increasingly the real-life problems tackled by computational scientist require large-scale parallel computing and thus it is crucial that GPU-enabled software reach good parallel scalability to reap the bene-ts of GPU acceleration. This is exactly what has been achieved for GPAW, a popular quantum chemistry program, by Hakala et al. in their recent work .
Authors: Simen Reinea , Thomas Kjrgaarda, Trygve Helgakera, Ole Widar Saastadb, Andrew Sunderlandc
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, Oslo, Norway,
b University Center for Information Technology, University of Oslo, Oslo, Norway
c STFC Daresbury Laboratory, Warrington, United Kingdom
Abstract: Linear Scaling DALTON (LSDALTON) is a powerful molecular electronic structure program that is the focus of software optimization projects in PRACE 1IP-WP7.2 and PRACE 1IP-WP7.5. This part of the project focuses on the introduction of parallel diagonalization routines from the ScaLAPACK library into the latest MPI version of LSDALTON. The parallelization work has involved three main tasks: i) Redistribution of the matrices assembled for the SCF cycle from a serial / distributed state to the two dimensional block-cyclic data distribution used for PBLAS and ScaLAPACK; ii) Interfacing of LSDALTON data structures to parallel diagonalization routines in ScaLAPACK; iii) Performance testing to determine the favoured ScaLAPACK eigensolver methodology
Authors: Fabio Affinitoa, Emanuele Cocciab, Sandro Sorellac, Leonardo Guidonib,
a CINECA, Casalecchio di Reno, Italy
b Universita dell’Aquila,L’Aquila,Italy
c SISSA, Trieste, Italy
Abstract: Quantum Monte Carlo (QMC) methods are a promising technique for the study of the electronic structure of correlated molecular systems. The technical of the present project is to demonstrate the scalability of the TurboRVB code for a series of systems having different properties in terms of number of electrons, number of variational parameters and size of the basis set.
Authors:Iain Bethunea, Adam Cartera, Xu Guoa, Paschalis Korosogloub,c
a EPCC, The University of Edinburgh, James Clerk Maxwell Building, The King’s Buildings, Edinburgh EH9 3JZ, United Kingdom
b AUTH, Aristotle University of Thessaloniki, Thessaloniki 52124, Greece,
c GRNET, Greek Research & Technology Network, L. Mesogeion 56, Athens 11527, Greece
Abstract: CP2K is a powerful materials science and computational chemistry code and is widely used by research groups across Europe and beyond. The recent addition of a linear scaling KS-DFT method within the code has made it possible to simulate systems of an unprecedented size – 1,000,000 atoms or more – making full used of Petascale computing resources. Here we report on work undertaken within PRACE 1-IP WP 7.1 to port and test CP2K on Jugene, the PRACE Tier 0 BlueGene/P system. In addition, development work was performed to reduce the memory usage of a key data structure within the code, to make it more suitable for the limited memory environment of the BlueGene/P. Finally we present a set of benchmark results and analysis of a large test system.
Authors: Luigi Genovesea,b, Brice Videaua, Thierry Deutscha, Huan Tranc, Stefan Goedeckerc
a Laboratoire de Simulation Atomistique, SP2M/INAC/CEA, 17 Av. des Martyrs, 38054 Grenoble, France
b European Synchrotron Radiation Facility, 6 rue Horowitz, BP 220, 38043 Grenoble, France
c Institut fur Physik, Universitat Basel, Klingelbergstr.82, 4056 Basel, Switzerland
Abstract: Electronic structure calculations (DFT codes) are certainly among the disciplines for which an increasing of the computational power correspond to an advancement in the scienti-c results. In this report, we present the ongoing advancements of DFT code that can run on massively parallel, hybrid and heterogeneous CPU-GPU clusters. This DFT code, named BigDFT, is delivered within the GNU-GPL license either in a stand-alone version or integrated in the ABINIT software package. Hybrid BigDFT routines were initially ported with NVidia’s CUDA language, and recently more functionalities have been added with new routines writeen within Kronos’ OpenCL standard. The formalism of this code is based on Daubechies wavelets, which is a systematic real-space based basis set. The properties of this basis set are well suited for an extension on a GPU-accelerated environment. In addition to focusing on the performances of the MPI and OpenMP parallelisation the BigDFT code, this presentation also relies of the usage of the GPU resources in a complex code with di-erent kinds of operations. A discussion on the interest of present and expected performances of Hybrid architectures computation in the framework of electronic structure calculations is also adressed.
Authors: Ruyman Reyesa, Iain Bethunea
a EPCC, The University of Edinburgh, James Clerk Maxwell Building, Mayfield Road, Edinburgh, EH9 3JZ,UK
Abstract: This report describes the results of a PRACE Preparatory Access Type Cb project to optimise the implementation of Moller-Plesset second order perturbation theory (MP2) in CP2K, to allow it to be used efficiently on the PRACE Research Infrastructure. The work consisted of three stages: firstly serial optimisation of several key computational kernels; secondly, OpenMP implementation of parallel 3D Fourier Transform to support mixed-mode MPI/OpenMP use of CP2K; and thirdly – benchmarking the performance gains achieved by new code on HERMIT for a test case representative of proposed production simulations. Consistent speedups of 8% were achieved in the integration kernel routines as a result of the serial optimisation. When using 8 OpenMP threads per MPI process, speedups of up to 10x for the 3D FFT were achieved, and for some combinations of MPI processes and OpenMP threads, overall speedups of 66% for the whole code were measured. As a result of this work, a proposal for full PRACE Project Access has been submitted.
Authors: Peicho Petkova, Petko Petkovb,*, Georgi Vayssilovb, Stoyan Markovc
a Faculty of Physics, University of Sofia, 1164 Sofia, Bulgaria
b Faculty of Chemistry, University of Sofia, 1164 Sofia, Bulgaria
c Natioanl Centre for Supercomputing Aplications, Sofia, Bulgaria
Abstract: The reported work aims at implementation of a method allowing realistic simulation of large or extra-large biochemical systems (of 106 to 107 atoms) with first-principle quantum chemical methods. The current methods treat the whole system simultaneously. In this way the comput time increases rapidly with the size of the system and does not allow efficient parallelization of the calculations due to the mutual interactions between the electron density in all parts of the system. In order to avoid these problems we implemented a version of the Fragment Orbital Method (FMO) in which the whole system is divided into fragments calculated separately. This approach assures nearly linear scaling of the compute time with the size of the system and provides efficient parallelization of the job. The work includes development of pre- and post-processing components for automatic division of the system into monomers and reconstructing of the total energy and electron density of the whole system.
Authors: Iain Bethunea, Adam Cartera, Kevin Stratforda, Paschalis Korosogloub,c
a EPCC, The University of Edinburgh, James Clerk Maxwell Buidling, The King’s Buildings, Edinburgh, EH9 3JZ, United Kingdom
Abstract: This report describes the work undertaken under PRACE-1IP to support the European scientific communities who make use of CP2K in their research. This was done in two ways – firstly, by improving the performance of the code for a wide range of usage scenarios. The updated code was then tested and installed on the PRACE CURIE supercomputer. We believe this approach both supports existing user communities by delivering better application performance, and demonstrates to potential users the benefits of using optimized and scalable software like CP2K on the PRACE infrastructure.
Authors: Jussi Enkovaaraa,?, Martti Louhivuoria, Petar Jovanovicb, Vladimir Slavnicb, Mikael R ?annarc
IT Center for Science, P.O. Box 405 FI-02101 Espoo Finland
b Scientific Computing Laboratory, Institute of Physics Belgrade, Pregrevica 118, 11080 Belgrade, Serbia
c Department of Computing Science, Umea University, SE-901 87 Umea, Sweden
Abstract: GPAW is a versatile software package for ?rst-principles simulations of nanostructures utilizing density-functional theory and time-dependent density-functional theory. Even though GPAW is already used for massively parallel calculations in several supercomputer systems, some performance bottlenecks still exist. First, the implementation based on the Python programming language introduces an I/O bottleneck during initialization which becomes serious when using thousands of CPU cores. Second, the current linear response time-dependent density-functional theory implementation contains a large matrix, which is replicated on all CPUs. When reaching for larger and larger systems, memory runs out due to the replication. In this report, we discuss the work done on resolving these bottlenecks. In addition, we have also worked on optimization aspects that are directed more to the future usage. As the number of cores in multicore CPUs is still increasing, an hybrid parallelization combining shared memory and distributed memory parallelization is becoming appealing. We have experimented with hybrid OpenMP/MPI and report here the initial results. GPAW also performs large dense matrix diagonalizations with the ScaLAPACK library. Due to limitations in ScaLAPACK these diagonalizations are expected to become a bottleneck in the future, which has led us to investigate alternatives for the ScaLAPACK.
Abstract: The work aims at evaluating the performance of DALTON on different platforms and implementing new strategies to enable the code for petascaling. The activities have been organized into four tasks within PRACE project: (i) Analysis of the current status of the DALTON quantum mechanics (QM) code and identification of bottlenecks, implementation of several performance improvements of DALTON QM and first attempt of hybrid parallelization; (ii) Implementation of MPI integral components into LSDALTON, improvements of optimization and scalability, interface of matrix operations to PBLAS and ScaLAPACK numerical library routines; (iii) Interfacing the DALTON and LSDALTON QM codes to the ChemShell quantum mechanics/molecular mechanics (QM/MM) package and benchmarking of QM/MM calculations using this approach; (vi) Analysis of the impact of DALTON QM system components with Dimemas. Part of the results reported here has been achieved through the collaboration with ScalaLife project.
Authors: Simen Reine(a), Thomas Kj?rgaard(a), Trygve Helgaker(a), Olav Vahtras(b,d), Zilvinas Rinkevicius(b,g), Bogdan Frecus(b), Thomas W. Keal(c), Andrew Sunderland(c), Paul Sherwood(c),
Michael Schliephake(d), Xavier Aguilar(d), Lilit Axner(d), Maria Francesca Iozzi(e), Ole Widar Saastad(e), Judit Gimenez(f)
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, P.O.Box 1033 Blindern, N-0315 Oslo, Norway
b KTH Royal Institute of Technology, School of Biotechnology, Division of Theoretical Chemistry
& Biology, S-106 91 Stockholm, Sweden
c Computational Science & Engineering Department, STFC Daresbury Laboratory, Daresbury Science and Innovation Campus, Warrington, Cheshire, WA4 4AD, UK
d PDC Center for High Performance Computing at Royal Institute of Technology (KTH), Teknikringen 14, 100 44 Stockholm, Sweden
e University center for Information technology, University of Oslo, P.O.Box 1059 Blindern, N-0316 Oslo, Norway
f Computer Sciences – Performance Tools, Barcelona Supercomputing Center, Campus Nord UP C6, C/ Jordi Girona, 1-3, Barcelona, 08034
g KTH Royal Institute of Technology, Swedish e-Science Center (SeRC), S-100 44, Stockholm, Sweden
Abstract: In this paper we present development work carried out on Quantum ESPRESSO  software package within PRACE-1IP. We describe the different activities performed to enable the Quantum ESPRESSO user community to challenge frontiers of science running extreme computing simulation on European Tier-0 system of current and next generation. There main sections are described: 1) the improvement of parallelization efficiency on two DTF-based applications: Nuclear Magnetic Resonance (NMR) and EXact-eXchange (EXX) calculation; 2) introduction of innovative van der Waals interaction at the ab-initio level; 3) porting of PWscf code to hybrid system equipped with NVIDIA GPU technology.
Authors: Simen Reinea, Thomas Kj?rgaarda, Trygve Helgakera, Ole Widar Saastadb, Andrew Sunderlandc
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, Oslo, Norway
b University Center for Information Technology, University of Oslo, Oslo, Norway
c STFC Daresbury Laboratory, Warrington, United Kingdom
Abstract: Linear Scaling DALTON (LSDALTON) is a powerful molecular electronic structure program that is the focus of software optimization projects in PRACE 1IP-WP7.2 and PRACE 1IP-WP7.5. This part of the project focuses on the introduction of parallel diagonalization routines from the ScaLAPACK library into the latest MPI version of LSDALTON. The parallelization work has involved three main tasks: i) Redistribution of the matrices assembled for the SCF cycle from a serial / distributed state to the two dimensional block-cyclic data distribution used for PBLAS and ScaLAPACK; ii) Interfacing of LSDALTON data structures to parallel diagonalization routines in ScaLAPACK; iii) Performance testing to determine the favoured ScaLAPACK eigensolver methodology.
Authors: Dusan Stankovic*a, Aleksandar Jovica, Petar Jovanovica, Dusan Vudragovica, Vladimir Slavnica
a Institute of Physics Belgrade, Serbia
Abstract: In this whitepaper we report work that was done on enabling support for FFTE library for Fast Fourier Transform in Quantum ESPRESSO, enabling threading for FFTW3 library already supported in Quantum ESPRESSO (only a serial version), benchmarking and comparing their performance with existing implementations of FFT in Quantum ESPRESSO.
Classical MD applications
Authors: Mariusz Uchronskia, Agnieszka Kwieciena,*, Marcin Gebarowskia, Justyna Kozlowskaa
a WCSS, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland
Abstract: The prototypes evaluated within PRACE-2IP project provide a number of different computing hardware, including general purpose Graphics Processing Units (GPUs) and accelerators like Intel Xeon Phi. In this work we evaluated the performance and energy consumption of two prototypes when used for a real case simulation. Due to the heterogeneity of the prototypes we decided to use the DL_POLY molecular simulation package and its OpenCL port for the tests. The DL_POLY OpenCL port implements one of the methods – the Constraints Shake (CS) component. SHAKE is a two stage algorithm based on the leapfrog Verlet integration scheme. We used four test cases for the evaluation, one from the DL_POLY application test-suite – H2O, and three real cases, provided by a user. We show the performance results and discuss the usage experience with prototypes in a context of ease of use, porting effort required, and energy consumption.
Download paper: PDF
Authors: D. Grancharov, N. Ilieva, E. Lilkova, L. Litov, S. Markov, P. Petkov, I. Todorov NCSA, Akad. G. Bonchev 25A, Sofia 1311, Bulgaria STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK
Abstract: A library, implementing the AGBNP2 [1, 2] implicit solvent model that was developed within PRACE-2IP  is integrated into the DL_POLY_4  molecular dynamics package in order to speed up the time to solution for protein solvation processes. Generally, implicit solvent models lighten the computational loads by reducing the degrees of freedom of the model, removing those of the solvent and thus only concentrating on the protein dynamics that is facilitated by the absence of friction with solvent molecules. Furthermore, periodic boundary conditions are no longer formally required, since long-range electrostatic calculations cannot be applied to systems with variable dielectric permittivity. The AGBNP2 implicit solvation model improves the conformational sampling of the protein dynamics by including the influence of solvent accessible surface and water-protein hydrogen bonding effects as interactive force corrections to the atoms of protein surface. This requires the development of suitable bookkeeping data structures, in accordance with the domain decomposition framework of DL_POLY, with dynamically adjustable inter-connectivity to describe the protein surface. The work also requires the use of advanced b-tree search libraries as part of the AGBNP library, in order to reduce the memory and compute requirements, and the automatic derivation of the van der Waals radii of atoms from the self-interaction potentials.
Authors: P. Petkov, I. Todorov, D. Grancharov, N. Ilieva, E. Lilkova, L. Litov, S. Markov
NCSA, Akad. G. Bonchev 25A, Sofia 1311, Bulgaria
STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK
Abstract: Electrostatic interactions in molecular simulations are usually evaluated by employing the Ewald summation method, which splits the summation into a short-ranged part, treated in real space and a long-range part treated in reciprocal space. For performance purposes in molecular dynamics software the latter is usually handled by SPME or P3M grid based methods both relying on 3D fast Fourier transform (FFT) as their central operation. However, the Ewald summation method is derived for model systems that are subject to 3D periodic boundary conditions (PBC) while there are many models of scientific as well as commercial interest, where geometry implies a 1D or 2D structures. Thus for systems, such as membranes, interfaces, linear protein complexes, thin layers, nanotubes, etc.; employing Ewald summation based techniques is either very disadvantageous computationally or impossible at all. Another approach to evaluate the electrostatics interactions is to solve the Poisson equation of the model-system charge distribution on a 3D special grid. The formulation of the method allows an elegant way to switch on and off the dependence on periodic boundary conditions in a simple manner. Furthermore, 3D FFT kernels are known to scale poorly at large scale due to excessive memory and communication overheads, which makes the Poisson solvers a viable alternative for DL_POLY on the road to exascale. This paper describes the work undertaken to integrate a Poisson solver library, developed in PRACE-2IP , within the DL_POLY_4 domain decomposition framework. The library relies on a unique combination of bi-conjugated gradient (BiCG) and conjugated gradient (CG) methods to warrant both independence on initial conditions with a rapid convergence of the solution on the one hand and stabilization of possible fluctuations of the iterative solution on the other. The implementation involves the development of procedures for generating charge density and electrostatic potential grids in real space over all domains in a distributed manner as well as halo exchange routines and functions to calculate the gradient of the potential in order to recover electrostatic forces on point charges.
Authors: Buket Benek Gursoy, Henrik Nagel
Irish Center for High End Computing, Ireland, bNorwegian University of Science and Technology, Norway
Abstract: This whitepaper investigates the potential benefit of using the OpenACC directive-based programming tool for enabling DL_POLY_4 on GPUs. DL_POLY is a well-known general-purpose molecular dynamics simulation package, which has already been parallelised using MPI-2. DL_POLY_3 was accelerated using the CUDA framework by the Irish Centre for High-End Computing (ICHEC) in collaboration with Daresbury Laboratory. In this work, we have been inspired by the existing CUDA port to evaluate the effectiveness of OpenACC in further enabling DL_POLY_4 on the road to Exascale. We have been particularly concerned with investigating the benefits of OpenACC in terms of maintainability, programmability and portability issues that are becoming increasingly challenging as we advance to the Exascale era. The impact of the OpenACC port has been assessed in the context of a change in the reciprocal vector dimension for the calculation of SPME forces. Moreover, the interoperability of OpenACC with the existing CUDA port has been analysed.
Authors: Mariusz Uchronskia, Marcin Gebarowskia, Agnieszka Kwieciena,*
a Wroclaw Centre for Networking and Supercomputing (WCSS), Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland
Abstract: SHAKE and RATTLE algorithms are widely used in molecular dynamics simulations and for this reason are relevant for a broad range of scientific applications. In this work, an existing CPU+GPU implementations of the SHAKE and RATTLE algorithms from the DL_POLY application are investigated. DL_POLY is a general purpose parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and I.T. Todorov. OpenCL code of the SHAKE algorithm for DL_POLY application is analyzed for further optimization possibilities. Our work with RATTLE algorithm is focused on porting of the algorithm from Fortran to OpenCL and adjusting it to the GPGPU architecture.
Authors: Michael Lysaghta, Mariusz Uchronskib, Agnieszka Kwiecienb, Marcin Gebarowskib, Peter Nasha, Ivan Girottoa and Ilian T.Todorovc
a Irish Centre for High End Computing, Tower Building, Trinity Technology and Enterprise Campus, Grand Canal Quay, Dublin 2, Ireland
b Wroclaw Centre for Network and Supercomputing, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland
c STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, United Kingdom
Abstract: We describe recent development work carried out on the GPU-enabled classical molecular dynamics software package, DL_POLY. We describe how we have updated the original GPU port of DL_POLY 3 in order to align the ‘CUDA+OpenMP’-based code with the recently released MPI-based DL_POLY 4 package. In the process of updating the code we have also fixed several bugs which allows us to benchmark the GPU-enabled code on many more GPU-nodes than was previously possible. We also describe how we have recently initiated the development of an OpenCL-based implementation of DL_POLY and present a performance analysis of the set of DL_POLY modules that have so far been ported to GPUs using the OpenCL framework.
Authors: Sadaf Alam, Ugo Varetto
Swiss National Supercomputing Centre, Lugano, Switzerland
Abstract: This report introduces hybrid implementation of the Gromacs application, and provides instructions on building and executing on PRACE prototype platforms with Grahpical Processing Units (GPU) and Many Intergrated Cores (MIC) accelerator technologies. GROMACS currently employs message-passing MPI parallelism, multi-threading using OpenMP and contains kernels for non-bonded interactions that are accelerated using the CUDA programming language. As a result, the execution model is multi-faceted where end users can tune the application execution according to the underlying platforms. We present results that have been collected on the PRACE prototype systems as well as on other GPU and MIC accelerated platforms with similar configurations. We also report on the preliminary porting effort that involves a fully portable implementation of GROMACS using OpenCL programming language instead of CUDA, which is only available on NVIDIA GPU devices.
Authors: Fabio Affinitoa, Andrew Emersona, Leandar Litovb, Peicho Petkovb, Rossen Apostolovc,d, Lilit Axnerc, Berk Hessdand Erik Lindahld, Maria Francesca Iozzie
a CINECA Supercomputing, Applications and Innovation Department, via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
b National Center for Supercomputing Applications, Sofia, Bulgaria
c PDC Center for High Performance Computing at Royal Institute of Technology (KTH),
Teknikringen 14, 100 44 Stockholm, Sweden
d Department of Theoretical Physics, Royal Institute of Technology (KTH), Stockholm, Sweden
e) Research Computing Services Group, University of Oslo,Postboks 1059 Blindern 0316 Oslo, Norway
Abstract: The work aims at evaluating the performance of GROMACS on different platforms and and determine the optimal set of conditions for given architectures for petascaling molecular dynamics simulations. The activities have been organized into three tasks within PRACE project: (i) Optimization of GROMACS performance on Blue Gene systems; (ii) Parallel scaling of the OpenMP implementation; (iii) Development of a multiple step-size symplectic integrator adapted to the large biomolecule systems. Part of the results reported here has been achieved through the collaboration with ScalaLife project.
Computational Fluid Dynamics (CFD) applications
Authors: Ahmet Durana,b,*, M. Serdar Celebia,c, Senol Piskina,c and Mehmet Tuncela,c
a Istanbul Technical University, National Center for High Performance Computing of Turkey (UHeM), Istanbul 34469, Turkey
b Istanbul Technical University,Department of Mathematics, Istanbul 34469, Turkey
c Istanbul Technical University, Informatics Institute, Istanbul 34469, Turkey
Abstract: We study a bio-medical fluid flow simulation using the incompressible, laminar OpenFOAM solver icoFoam and other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT (Many-Core Distributed) for the large penta-diagonal and hepta-diagonal matrices coming from the simulation of blood flow in arteries with a structured mesh domain. A realistic simulation for the sloshing of blood in the heart or vessels in the whole body is a complex problem and may take a very long time, thousands of hours, for the main tasks such as pre-processing (meshing), decomposition and solving the large linear systems. We generated the structured mesh by using blockMesh as a mesh generator tool. To decompose the generated mesh, we used the decomposePar tool. After the decomposition, we used icoFoam as a flow simulator/solver. For example, the total run time of a simple case is about 1500 hours without preconditioning on one core for one period of the cardiac cycle, measured on the Linux Nehalem Cluster (see ) available at the National Center for High Performance Computing (UHeM) (see ). Therefore, this important problem deserves careful consideration for usage on multi petascale or exascale systems. Our aim is to test the potential scaling capability of the fluid solver for multi petascale systems. We started from the relatively small instances for the whole simulation and solved large linear systems. We measured the wall clock time of single time steps of the simulation. This version gives important clues for a larger version of the problem. Later, we increase the problem size and the number of time steps to obtain a better picture gradually, in our general strategy. We test the performance of the solver icoFoam at TGCC Curie (a Tier-0 system) at CEA, France (see ). We consider three large sparse matrices of sizes 8 million x 8 million, 32 million x 32 million, and 64 million x 64 million. We achieved scaled speed-up for the largest matrices of 64 million x 64 million to run up to 16384 cores. In other words, we find that the scalability improves as the problem size increases for this application. This shows that there is no structural problem in the software up to this scale. This is an important and encouraging result for the problem. Moreover, we imbedded other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT in addition to the solvers provided by OpenFOAM. Since future exascale systems are expected to have heterogeneous and many-core distributed nodes, we believe that our SuperLU_MCDT software is a good candidate for future systems. SuperLU_MCDT worked up to 16384 cores for the large penta-diagonal matrices for 2D problems and hepta-diagonal matrices for 3D problems, coming from the incompressible blood flow simulation, without any problem.
Download paper: PDF