Authors: Jeremie Gaidamoura, Dimitri Lecasa, Pierre-Francois Lavalleea
a IDRIS/CNRS, Campus universaire d’Orsay, rue John Von Neumann, Batiment 506, F-91403 Orsay, France
Abstract: The HYDRO mini-application has been successfully used as a research vehicle in previous PRACE projects . In this paper, we evaluate the benefits of the tasking model introduced in recent OpenMP standards . We have developed a new version of HYDRO using the concept of OpenMP tasks and this implementation is compared to already existing and optimized OpenMP versions of HYDRO.
Download paper: PDF
Authors: Jorge Rodrigueza
aBSC-CNS: Barcelona Supercomputing Center, Torre Girona, C/Jordi Girona, 31, 08034 Barcelona, Spain
Abstract: Alya  is a computational mechanics code capable of solving different physics. It has been extensively used in MareNostrum III (BSC’s Tier-0 machine), and it has been also used as a benchmarking code in PRACE Unified European Applications Benchmark Suite. In this document, Extrae will be used to collect and analyze performance data during an Alya simulation in a petaflop environment. As a result of the performance analysis using Extrae  , some potential improvements in Alya have shown up, and if considered, exascale scalability could be achieved. Application Code: Alya
Authors B. Lindia*, T. Ponweiserb, P. Jovanovicc, T. Arslana aNorwegian University of Science and Technology bRISC Software GmbH A company of Johannes Kepler University Linz cInstitute of Physics Belgrade
Abstract This study has profiled the application Code Saturne, which is part of the PRACE benchmark suite. The profiling has been carried out with the tools HPCtookit and Tuning and Analysis Utilities (TAU) with the target of finding compute kernels suitable for autotuning. Autotuning is regarded as a necessary step in achieving sustainable performance at an Exascale level as Exascale systems most likely will have a heterogeneous runtime environment. A heterogeneous runtime environment imposes a parameter space for the applications run time behavior which cannot be explored by a traditional compiler. Neither can the run time behavior be explored manually by the developer/code owner as this will be too time consuming. The tool Orio has been used for autotuning idenitified compute kernels. Orio has been used on traditional Intel processors, Intel Xeon Phi and NVIDIA GPUs.The compute kernels have a small contribution to the overall execution time for Code Saturne. By autotuning with Orio these kernels have been improved by 3-5%.
Authors: Sadaf Alam, Ugo Varettoa aSwiss National Supercomputing Centre, Lugano, Switzerland
Abstract: Recently MPI implementations have been extended to support accelerator devices, Intel Many Integrated Core (MIC) and nVidia GPU. This has been accomplished by changes to different levels of the software stacks and MPI implementations. In order to evaluate performance and scalability of accelerator aware MPI libraries, we developed portable micro-benchmarks to indentify factors that influence efficincies of primitive MPI point-to-point and collective operations. These benchmarks have been implemented in OpenACC, CUDA and OpenCL. On the Intel MIC platform, existing MPI benchmarks can be executed with appropriate mapping onto the MIC and CPU cores. Our results demonstrate that the MPI operations are highly sensitive to the memory and I/O bus configurations on the node. The current implemetation of MIC on-node communication interface exhibit additional limitations on the placement of the card and data transfers over the memory bus.
Authors: Mikael Rannara, Maciej Szpindlerb
aHPC2N & Department of Computing Science, Umea University
bInterdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw
Abstract: The HBM (HIROMB-BOOS Model) ocean circulation model scaling on the selected PRACE Tier-0 systems is described. The model has been ported to the BlueGene/Q architecture and tested against OpenMP and mixed OpenMP/MPI parallel performance and scaling with a given test case scenario. Benchmarking of the selected computational kernels and model procedures with a micro-benchmarking module has been proposed for further integration with the model code. Details on the micro-benchmark proposal and results of the scaling tests are described.
Authors: J. Donnersa*, A. Mouritsb, M. Gensebergerb, B. Jagersb
aSURFsara, Amsterdam, The Netherlands
bDeltares, Delft, The Netherlands
Abstract: The Delft3D modelling suite has been ported to the PRACE Tier-0 and Tier-1 infrastructure. The portability of Delft3D was improved by removing platform-dependent options from the build system and replacing non-standard constructs from the source. Three benchmarks were used to investigate the scaling of Delft3D: (1) a large, regular domain; (2) a realistic, irregular domain with a low fill-factor; (3) a regular domain with a sediment transport module. The first benchmark clearly shows a good scalability up to a thousand cores for a suitable problem. The other benchmarks show a reasonable scalability up to about 100 cores. For test case (2) the main bottleneck is the serialized I/O. It was attempted to implement a separate I/O server by using the last MPI process only for the I/O, but this work is not yet finished. The imbalance due to the irregular domain can be reduced somewhat by using a cyclic placement of MPI tasks. Test case (3) benefits from inlining of often-called routines.
Authors: Maciej Cytowski, Maciej Filocha, Jakub Katarzynski, Maciej Szpindler
Interdisciplinary Centre for Mathematical and Computational Modeling (ICM), University of Warsaw, Poland
Abstract: In this whitepaper we describe the effort we have made to measure performance of applications and synthetic benchmarks with the use of different simultaneous multithreading (SMT) modes. This specific processor architecture feature is currently available in many petascale HPC systems worldwide. Both IBM Power7 processors available in Power775 (IH) and IBM Power A2 processors available in Blue Gene/Q are built upon 4-way simultaneous multithreaded cores. It should be also mentioned that multithreading is predicted to be one of the leading features of future exascale systems available by the end of next decade .
Authors: J. Mark Bulla*, Andrew Emersonb
aEPCC, University of Edinburgh, King’s Buildings, Mayfield Road, Edinburgh EH9 3JZ, UK.
bCINECA, via Magnanelli 6/3, 40033 Casalecchio di Reno, Bologna, Italy
Abstract: This White Paper reports on the selection of a set of application codes taken from the existing PRACE and DEISA application benchmark suites to form a single Unified European Application Benchmark Suite (UEABS).
The selected codes are: QCD, NAMD, GROMACS, Quantum Espresso, CP2K, GPAW, Code_Saturne, ALYA, NEMO, SPECFEM3D, GENE, and GADGET.
These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.
They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.
© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.
All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.