Authors: C. Basua*, A. De Nicolab, G. Milanob
a National Supercomputer Centre, Linköping University, Sweden
b University of Salerno, Department of Chemistry and Biology, Via Giovanni Paolo II, 132, 84084, Fisciano, Italy
Abstract: A new parallel I/O scheme is implemented in the hybrid particle-field MD simulation code called OCCAM. In the new implementation the numbers of input and output files are greatly reduced. Furthermore, the sizes of the input and output files are reduced as the new files are in binary format compared to ASCII files in the original code. The I/O performance is improved due to bulk data transfer instead of small and frequent data transfer in the original code. The results of tests on two different systems show 6-18% performance improvements.
Authors: F. Sottilea,C. Roedla, V. Slavnicb, P. Jovanovicb,D. Stankovicb, P. Kestenerc, F. Houssenc
aLaboratoire des SolidesIrradies, Ecole Polytechnique, CNRS, CEA, UMR 7642, 91128 Palaiseaucedex, France
bScientic ComputingLaboratory, Institute of Physics Belgrade, University of Belgrade,Pregrevica 118, 11080 Belgrade, Serbia
cMaison de la Simulation,USR 3441, bat. 565, CEA Saclay, 91191 Gif-sur-Yvette cedex, France
Abstract: Main goal of this PRACE project was to evaluate how GPUs could speed up the DP code – alinear response TDDFT code. Pro-ling analysis of the code has been done to identify computationalbottlenecks to be delegated to the GPU. In order to speed up this code using GPUs, two di-erentstrategies have been developed: a local one and a global one. Both strategies have been implementedwith cuBLAS and/or CUDA C. Results showed that one can reasonably expect about 10 times speedupon the total execution time, depending on the structure of the input and the size of datasets used,and speedups up to 16 have been observed for some cases.
Authors: ErnestArtiagaa, Toni Cortes a,b
aBarcelona SupercomputingCenter, Barcelona, Spain
bUniversitat Politècnica deCatalunya, Barcelona, Spain
Abstract: The purpose of this white paper is to document the measurements obtained in PRACE-2IP regarding file system metadata performance, and to assess mechanisms to further improve such performance. The final goal of the task is to identify the open issues related to file systems for multi-petascale and exascale facilities, and propose novel solutions that can be applied to Lustre, enabling it to manage a huge number of files on a system with many heterogeneous devices while efficiently delivering huge data bandwidth and low latency, minimizing the response time.
The performed tasks being reported included the observation, measurement and study of a large scale system currently in production, in order to identify the key metadata-related issues; the development of a prototype aimed to improve the metadata behaviour in such system and also to provide a framework to easily deploy novel metadata management techniques on top of other systems; the measurement and study of specially deployed Lustre and GPFS prototypes to validate the presence of the metadata issues observed in current in-production systems; and finally the porting of the framework prototype to test novel metadata management techniques on both a production system using GPFS and the PRACE Lustre prototype facility at CEA.
Authors: A.Mignonea,∗, G. Muscianisib, M. Rivib,G. Bodoc
aDipartimento di FisicaGenerale, Universit ́a di Torino, via Pietro Giuria 1, 10125 Torino,Italy
bConsorzioInteruniversitario CINECA, via Magnanelli, 6/3, 40033 Casalecchio diReno (Bologna), Italy
cINAF, OsservatorioAstronomico di Torino, Strada Osservatorio 20, Pino Torinese, Italy
Abstract: PLUTO is a modular and multi-purpose numerical code for astrophysical ﬂuid dynamics targeting highly supersonic andmagnetized ﬂows. As astrophysical applications are becoming increasingly demanding in terms of grid resolution andI/O, eﬀorts have been spent to overcome the main bottlenecks of the code mainly related to an obsolete and no longermaintained library providing parallel functionality. Successful achievements have been pursued in The Partnership forAdvanced Computing in Europe First Implementation Phase Project (PRACE-1IP) and are described in the presentwhite-paper.
Authors: ValentinPavlov, Peicho Petkov
NCSA, Acad. G. Bonchev str., bl. 25A,Sofia 1000, Bulgaria
Abstract: In MPI multiprocessing environments, data I/O in the GROMACS molecular dynamics package is handled by the master node. All input data is read by the master node, then scattered to the computing nodes, and on each step gathered back in full and possibly written out. This method is fine for most of the architectures that use shared memory or where the amount of RAM on the master node can be extended (as in clusters and Cray machines), but introduces a bottleneck for distributed memory systems with hard memory limits like the IBM Blue Gene/P. The effect is that even though a Tier-0 Blue Gene/P machine has enough overall computing power and RAM, it cannot process molecular systems with more than 5,000,000 atoms because the master node simply does not have enough RAM to hold all necessary input data. In this paper we describe an approach that eliminates this bottleneck and allows such large systems to be processed on Tier-0 Blue Gene/Ps. The approach is based on using global memory structures which are distributed among all computing nodes. We utilize the Global Arrays Toolkit by PNNL to achieve this goal. We analyze which structures need to be changed, design an interface for virtual arrays and rewrite all routines which deal with data I/O of the corresponding structures. Our results indicate that the approach works and we present the simulation of a large bio-molecular system (lignocellulose) on an IBM Blue Gene/P machine.
Authors: Jan ChristianMeyer a, Jørn Amundsena, Xavier Saezb
a Norwegian University ofScience and Technology (NTNU), Trondheim, NO-7491, Norway
b Barcelona SupercomputingCenter, c/ Jordi Girona, 29,08034 Barcelona,Spain
Abstract: Rewriting application I/O for performance and scalability on petaflops machines easily becomes a formidable task in terms of man-hour effort. Furthermore, on HPC systems the gap of compute to I/O capability in Tflop/s vs GByte/s has increased by a factor of 10 in recent years. It makes the insertion of I/O forwarding software layers between the application and file system layer increasingly feasible from a performance point of view. This whitepaper describes the work on eval
uating the IOFSL I/O Forwarding and Scalability Layer on PRACE applications. The results show that the approach is relevant, but is presently made infeasible by the associated overhead and issues with the software stack.
Authors: Ra ́ul de laCruz, Hadrien Calmet, Guillaume Houzeaux
Barcelona Supercomputing Center,Edificio NEXUS I, c/ Gran Capit ́an 2-4, 08034 Barcelona, Spain
Abstract: Alya is a Computational Mechanics (CM) code developed at Barcelona Supercomputing Center, which solves PartialDiﬀerential Equations (PDEs) in non-structured meshes, using Finite Element (FE) methods. Being a large scalescientiﬁc code, Alya demands substantial I/O processing, which may consume considerable time and can thereforepotentially reduce speed-up at petascale. Consequently, I/O task turns out a critical key-point to consider in achievingdesirable performance levels. The current Alya I/O model is based on a master-slave approach, which limits scalingand I/O parallelization. However, eﬃcient parallel I/O can be achieved using freely available middleware libraries thatprovide parallel access to disks. The HDF5 parallel I/O implementation shows a relatively low complexity of use and awide number of features compared to others implementations, such as MPI-IO and netCDF. Furthermore, HDF5 exposessome interesting aspects such as a shorter development cycle, a hierarchical data format with metadata support andis becoming a de facto standard as well. Moreover, in order to achieve an open-standard format in Alya, the XDMFapproach (eXtensible Data Model Format) has been used as metadata container (light data) in cooperation with HDF5(heavy data). To overcome the I/O barrier at petascale, XDMF & HDF5 have been introduced in Alya and comparedto the original master-slave strategy. Both versions are deployed, tested and measured on Curie and Jugene Tier-0supercomputers. Our preliminary results on the testbed platforms show a clear improvement of the new parallel I/Ocompared with the original implementation.
Author: Bjørn Lindi
Norwegian University of Science andTechnology (NTNU), Trondheim, NO-7491, Norway
Abstract: Darshan is a set of libraries that can characterize MPI-IO and POSIX file access within typical HPC applications in a non-intrusive way. It can be used to investigate I/O behavior of a MPI-program. An application’s I/O behavior can easily be an obstacle to achieving petascale performance. Hence, to be able to characterize the I/O of a HPC-application is an important step on the path to develop scaling properties. Darshan have been used on selected applications from task 7.1 and 7.2. This whitepaper describes the work carried out and the results achieved.
Authors: PhilippeWauteleta∗, Pierre Kestenerb
aIDRIS-CNRS, Campusuniversitaire d’Orsay, rue John Von Neumann, Bˆatiment 506,F-91403 Orsay, France
bCEA Saclay, DSM / Maison dela Simulation, centre de Saclay, F-91191 Gif-sur-Yvette, France
Abstract: The results of two kinds of parallel IO performance measurements on the CURIE supercomputer are presented in thisreport. In a ﬁrst series of tests, we use the open source IOR benchmark to make a comparative study of the parallelreading and writing performances on the CURIE Lustre ﬁlesystem using diﬀerent IO paradigms (POSIX, MPI-IO,HDF5 and Parallel-netCDF). The impact of the parallel mode (collective or independent) and of the MPI-IO hints onthe performance is also studied. In a second series of tests, we use a well known scientiﬁc code in the HPC astrophysicscommunity: RAMSES, which a grid-based hydrodynamics solver with adaptive mesh reﬁnement (AMR). IDRIS addedsupport for the 3 following parallel IO approaches: MPI-IO, HDF5 and Parallel-netCDF. They are compared to thetraditional one ﬁle per MPI process approach. Results from the two series of tests (synthetic with IOR and more realisticwith RAMSES) are compared. This study could serve as a good starting point for helping other application developersin improving parallel IO performance.
Author: Huub Stoffers
SARA, Science Park 140,1098XGAmsterdam, The Netherlands
Abstract: The I/O subsystems of high performance computing installations tend to be very system specific. There is only a loose coupling with compute architectures and consequently considerable freedom of choice in design and ample room for tradeoffs at various implementation levels. The PRACE Tier-0 systems are no exception in this respect. The I/O subsystem of JUGENE at the Forschungszentrum Jülich is not necessarily very similar to the I/O subsystems of IBM Blue Gene/P installations elsewhere. However, most applications that need efficient handling of petascale data cannot afford to ignore the I/O subsystem. To some extent system specific arrangements of resources need to be known to avoid system- or site-specific bottlenecks.
This first section of paper gives a fairly detailed description of the I/O subsystem of the PRACE Tier-0 system JUGENE at Jülich and points out what I/O rates that can be achieved maximally, following from the specifications of the components that have been used and the way they are interconnected.
In the second section, the description is complemented with some practical guidelines on how to do efficient I/O on JUGENE and a description of some tools for source level adaptation of applications for improving I/O performance. Using standard I/O and MPI communication, rather than adopting a particular library, variations on hierarchical organizing of I/O within an application are explored more in depth and compared for performance. Splitting a parallel program of multiple tasks into a number of equally sized I/O groups, in which a few tasks do I/O on behalf improves performance only moderately when group membership is determined rather arbitrarily. BlueGene specific MPI extensions can be used to bring topological information, about which tasks are being served by the same I/O node, into the program. Example programs are given on how these calls can be used to create a division into groups that are not only balanced in the IO/volume their produce, but also in the underlying resources they have at their disposal to handle th
|DisclaimerThese whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.|
They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.
|Copyright notices© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.|
All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.