2020 |
Rockenbach, Dinei A High-Level Programming Abstractions for Stream Parallelism on GPUs Masters Thesis School of Technology - PPGCC - PUCRS, 2020. @mastersthesis{ROCKENBACH:DM:20, title = {High-Level Programming Abstractions for Stream Parallelism on GPUs}, author = {Dinei A. Rockenbach}, url = {https://tede2.pucrs.br/tede2/handle/tede/9592}, year = {2020}, date = {2020-11-27}, address = {Porto Alegre, Brazil}, school = {School of Technology - PPGCC - PUCRS}, abstract = {The growth and spread of parallel architectures have driven the pursuit of greater computing power with massively parallel hardware such as the Graphics Processing Units (GPUs). This new heterogeneous computer architecture composed of multi-core Central Processing Units (CPUs) and many-core GPUs became usual, enabling novel software applications such as self-driving cars, real-time ray tracing, deep learning, and Virtual Reality (VR), which are characterized as stream processing applications. However, this heterogeneous environment poses an additional challenge to software development, which is still in the process of adapting to the parallel processing paradigm on multi-core systems, where programmers are supported by several Application Programming Interfaces (APIs) that offer different abstraction levels. The parallelism exploitation in GPU is done using both CUDA and OpenCL for academia and industry, whose developers have to deal with low-level architecture concepts to efficiently exploit GPU parallelism in their applications. There is still a lack of parallel programming abstractions when: 1) parallelizing code on GPUs, and 2) needing higher-level programming abstractions that deal with both CPU and GPU parallelism. Unfortunately, developers still have to be expert programmers on system and architecture to enable efficient hardware parallelism exploitation in this architectural environment. To contribute to the first problem, we created GSPARLIB, a novel structured parallel programming library for exploiting GPU parallelism that provides a unified programming API and driver-agnostic runtime. It offers Map and Reduce parallel patterns on top of CUDA and OpenCL drivers. We evaluate its performance comparing with state-of-the-art APIs, where the experiments revealed a comparable and efficient performance. For contributing to the second problem, we extended the SPar Domain-Specific Language (DSL), which has been proved to be high-level and productive for expressing stream parallelism with C++ annotations in multi-core CPUs. In this work, we propose and implement new annotations that increase expressiveness to combine the current stream parallelism on CPUs and data parallelism on GPUs. We also provide new pattern-based transformation rules that were implemented in the compiler targeting automatic source-to-source code transformations using GSPARLIB for GPU parallelism exploitation. Our experiments demonstrate that SPar compiler is able to generate stream and data parallel patterns without significant performance penalty compared to handwritten code. Thanks to these advances in SPar, our work is the first on providing high-level C++11 annotations as an API that does not require significant code refactoring in sequential programs while enabling multi-core CPU and many-core GPU parallelism exploitation for stream processing applications.}, keywords = {}, pubstate = {published}, tppubtype = {mastersthesis} } The growth and spread of parallel architectures have driven the pursuit of greater computing power with massively parallel hardware such as the Graphics Processing Units (GPUs). This new heterogeneous computer architecture composed of multi-core Central Processing Units (CPUs) and many-core GPUs became usual, enabling novel software applications such as self-driving cars, real-time ray tracing, deep learning, and Virtual Reality (VR), which are characterized as stream processing applications. However, this heterogeneous environment poses an additional challenge to software development, which is still in the process of adapting to the parallel processing paradigm on multi-core systems, where programmers are supported by several Application Programming Interfaces (APIs) that offer different abstraction levels. The parallelism exploitation in GPU is done using both CUDA and OpenCL for academia and industry, whose developers have to deal with low-level architecture concepts to efficiently exploit GPU parallelism in their applications. There is still a lack of parallel programming abstractions when: 1) parallelizing code on GPUs, and 2) needing higher-level programming abstractions that deal with both CPU and GPU parallelism. Unfortunately, developers still have to be expert programmers on system and architecture to enable efficient hardware parallelism exploitation in this architectural environment. To contribute to the first problem, we created GSPARLIB, a novel structured parallel programming library for exploiting GPU parallelism that provides a unified programming API and driver-agnostic runtime. It offers Map and Reduce parallel patterns on top of CUDA and OpenCL drivers. We evaluate its performance comparing with state-of-the-art APIs, where the experiments revealed a comparable and efficient performance. For contributing to the second problem, we extended the SPar Domain-Specific Language (DSL), which has been proved to be high-level and productive for expressing stream parallelism with C++ annotations in multi-core CPUs. In this work, we propose and implement new annotations that increase expressiveness to combine the current stream parallelism on CPUs and data parallelism on GPUs. We also provide new pattern-based transformation rules that were implemented in the compiler targeting automatic source-to-source code transformations using GSPARLIB for GPU parallelism exploitation. Our experiments demonstrate that SPar compiler is able to generate stream and data parallel patterns without significant performance penalty compared to handwritten code. Thanks to these advances in SPar, our work is the first on providing high-level C++11 annotations as an API that does not require significant code refactoring in sequential programs while enabling multi-core CPU and many-core GPU parallelism exploitation for stream processing applications. |
Hoffmann, Renato B; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism Annotations for Multi-Core Frameworks Inproceedings doi XXIV Brazilian Symposium on Programming Languages (SBLP), pp. 48-55, ACM, Natal, Brazil, 2020. @inproceedings{HOFFMANN:SBLP:20, title = {Stream Parallelism Annotations for Multi-Core Frameworks}, author = {Renato B. Hoffmann and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1145/3427081.3427088}, doi = {10.1145/3427081.3427088}, year = {2020}, date = {2020-10-01}, booktitle = {XXIV Brazilian Symposium on Programming Languages (SBLP)}, pages = {48-55}, publisher = {ACM}, address = {Natal, Brazil}, series = {SBLP'20}, abstract = {Data generation, collection, and processing is an important workload of modern computer architectures. Stream or high-intensity data flow applications are commonly employed in extracting and interpreting the information contained in this data. Due to the computational complexity of these applications, high-performance ought to be achieved using parallel computing. Indeed, the efficient exploitation of available parallel resources from the architecture remains a challenging task for the programmers. Techniques and methodologies are required to help shift the efforts from the complexity of parallelism exploitation to specific algorithmic solutions. To tackle this problem, we propose a methodology that provides the developer with a suitable abstraction layer between a clean and effective parallel programming interface targeting different multi-core parallel programming frameworks. We used standard C++ code annotations that may be inserted in the source code by the programmer. Then, a compiler parses C++ code with the annotations and generates calls to the desired parallel runtime API. Our experiments demonstrate the feasibility of our methodology and the performance of the abstraction layer, where the difference is negligible in four applications with respect to the state-of-the-art C++ parallel programming frameworks. Additionally, our methodology allows improving the application performance since the developers can choose the runtime that best performs in their system.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Data generation, collection, and processing is an important workload of modern computer architectures. Stream or high-intensity data flow applications are commonly employed in extracting and interpreting the information contained in this data. Due to the computational complexity of these applications, high-performance ought to be achieved using parallel computing. Indeed, the efficient exploitation of available parallel resources from the architecture remains a challenging task for the programmers. Techniques and methodologies are required to help shift the efforts from the complexity of parallelism exploitation to specific algorithmic solutions. To tackle this problem, we propose a methodology that provides the developer with a suitable abstraction layer between a clean and effective parallel programming interface targeting different multi-core parallel programming frameworks. We used standard C++ code annotations that may be inserted in the source code by the programmer. Then, a compiler parses C++ code with the annotations and generates calls to the desired parallel runtime API. Our experiments demonstrate the feasibility of our methodology and the performance of the abstraction layer, where the difference is negligible in four applications with respect to the state-of-the-art C++ parallel programming frameworks. Additionally, our methodology allows improving the application performance since the developers can choose the runtime that best performs in their system. |
Garcia, Adriano Marques; Serpa, Matheus; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo; Navaux, Philippe O A The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures Inproceedings doi International Conference on Computational Science and its Applications (ICCSA), pp. 142-157, Springer, Cagliari, Italy, 2020. @inproceedings{GARCIA:ICCSA:20, title = {The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures}, author = {Adriano Marques Garcia and Matheus Serpa and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes and Philippe O A Navaux}, url = {https://doi.org/10.1007/978-3-030-58817-5_12}, doi = {10.1007/978-3-030-58817-5_12}, year = {2020}, date = {2020-07-01}, booktitle = {International Conference on Computational Science and its Applications (ICCSA)}, volume = {12254}, pages = {142-157}, publisher = {Springer}, address = {Cagliari, Italy}, series = {ICCSA'20}, abstract = {Since the demand for computing power increases, new architectures emerged to obtain better performance. Reducing the power and energy consumption of these architectures is one of the main challenges to achieving high-performance computing. Current research trends aim at developing new software and hardware techniques to achieve the best performance and energy trade-offs. In this work, we investigate the impact of different CPU frequency scaling techniques such as ondemand, performance, and powersave on the power and energy consumption of multi-core based computer infrastructure. We apply these techniques in PAMPAR, a parallel benchmark suite implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn). We measure the energy and execution time of 10 benchmarks, varying the number of threads. Our results show that although powersave consumes up to 43.1% less power than performance and ondemand governors, it consumes the triple of energy due to the high execution time. Our experiments also show that the performance governor consumes up to 9.8% more energy than ondemand for CPU-bound benchmarks. Finally, our results show that PThreads has the lowest power consumption, consuming less than the sequential version for memory-bound benchmarks. Regarding performance, the performance governor achieved 3% of performance over the ondemand.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Since the demand for computing power increases, new architectures emerged to obtain better performance. Reducing the power and energy consumption of these architectures is one of the main challenges to achieving high-performance computing. Current research trends aim at developing new software and hardware techniques to achieve the best performance and energy trade-offs. In this work, we investigate the impact of different CPU frequency scaling techniques such as ondemand, performance, and powersave on the power and energy consumption of multi-core based computer infrastructure. We apply these techniques in PAMPAR, a parallel benchmark suite implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn). We measure the energy and execution time of 10 benchmarks, varying the number of threads. Our results show that although powersave consumes up to 43.1% less power than performance and ondemand governors, it consumes the triple of energy due to the high execution time. Our experiments also show that the performance governor consumes up to 9.8% more energy than ondemand for CPU-bound benchmarks. Finally, our results show that PThreads has the lowest power consumption, consuming less than the sequential version for memory-bound benchmarks. Regarding performance, the performance governor achieved 3% of performance over the ondemand. |
Stein, Charles Michael; Rockenbach, Dinei A; Griebler, Dalvan; Torquati, Massimo; Mencagli, Gabriele; Danelutto, Marco; Fernandes, Luiz Gustavo Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. @article{STEIN:CCPE:20, title = {Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units}, author = {Charles Michael Stein and Dinei A. Rockenbach and Dalvan Griebler and Massimo Torquati and Gabriele Mencagli and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/cpe.5786}, doi = {10.1002/cpe.5786}, year = {2020}, date = {2020-05-01}, journal = {Concurrency and Computation: Practice and Experience}, volume = {na}, number = {na}, pages = {e5786}, publisher = {Wiley Online Library}, abstract = {Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads. |
Löff, Junior; Griebler, Dalvan; Fernandes, Luiz Gustavo Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 37-40, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{LOFF:ERAD:20, title = {Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito}, author = {Junior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10750}, doi = {10.5753/eradrs.2020.10750}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {37-40}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Neste trabalho, um pipeline implícito com o padrão map foi implementado na aplicação LU do NAS Parallel Benchmarks em C++. O LU possui dependência de dados no tempo, o que dificulta a exploração do paralelismo. Ele foi convertido de Fortran para C++, a fim de ser paralelizado com diferentes bibliotecas de sistemas multi-core. O uso desta estratégia com as bibliotecas permitiu ganhos de desempenho de até 10.6% em relação a versão original.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Neste trabalho, um pipeline implícito com o padrão map foi implementado na aplicação LU do NAS Parallel Benchmarks em C++. O LU possui dependência de dados no tempo, o que dificulta a exploração do paralelismo. Ele foi convertido de Fortran para C++, a fim de ser paralelizado com diferentes bibliotecas de sistemas multi-core. O uso desta estratégia com as bibliotecas permitiu ganhos de desempenho de até 10.6% em relação a versão original. |
Parallel Applications Modelling Group
GMAP is a research group at the Pontifical Catholic University of Rio Grande do Sul (PUCRS). Historically, the group has conducted several types of research on modeling and adapting robust, real-world applications from different domains (physics, mathematics, geology, image processing, biology, aerospace, and many others) to run efficiently on High-Performance Computing (HPC) architectures, such as Clusters.
In the last decade, new abstractions of parallelism are being created through domain-specific languages (DSLs), libraries, and frameworks for the next generation of computer algorithms and architectures, such as embedded hardware and servers with accelerators like Graphics Processing Units (GPUs) or Field-Programmable Gate Array (FPGAs). This has been applied to stream processing and data science-oriented applications. Concomitantly, since 2018, research is being conducted using artificial intelligence to optimize applications in the areas of Medicine, Ecology, Industry, Agriculture, Education, Smart Cities, and others.
Research Lines
Applied Data Science
Parallelism Abstractions
The research line HSPA (High-level and Structured Parallelism Abstractions) aims to create programming interfaces for the user/programmer who is not able in dealing with the parallel programming paradigm. The idea is to offer a higher level of abstraction, where the performance of applications is not compromised. The interfaces developed in this research line go toward specific domains that can later extend to other areas. The scope of the study is broad as regards the use of technologies for the development of the interface and parallelism.
Parallel Application Modeling
Team
Prof. Dr. Luiz Gustavo Leão Fernandes
General Coordinator
Prof. Dr. Dalvan Griebler
Research Coordinator
Last Papers
2020 |
High-Level Programming Abstractions for Stream Parallelism on GPUs Masters Thesis School of Technology - PPGCC - PUCRS, 2020. |
Stream Parallelism Annotations for Multi-Core Frameworks Inproceedings doi XXIV Brazilian Symposium on Programming Languages (SBLP), pp. 48-55, ACM, Natal, Brazil, 2020. |
The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures Inproceedings doi International Conference on Computational Science and its Applications (ICCSA), pp. 142-157, Springer, Cagliari, Italy, 2020. |
Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. |
Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 37-40, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Projects
Software
Last News
Contact us!
Or, feel free to use the form below to contact us.