2021 |
Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz G Providing High‐Level Self‐Adaptive Abstractions for Stream Parallelism on Multicores Journal Article doi Software: Practice and Experience, na (na), pp. na, 2021. @article{VOGEL:SPE:21, title = {Providing High‐Level Self‐Adaptive Abstractions for Stream Parallelism on Multicores}, author = {Adriano Vogel and Dalvan Griebler and Luiz G Fernandes}, url = {https://doi.org/10.1002/spe.2948}, doi = {10.1002/spe.2948}, year = {2021}, date = {2021-01-01}, journal = {Software: Practice and Experience}, volume = {na}, number = {na}, pages = {na}, publisher = {Wiley Online Library}, abstract = {Stream processing applications are common computing workloads that demand parallelism to increase their performance. As in the past, parallel programming remains a difficult task for application programmers. The complexity increases when application programmers must set non-intuitive parallelism parameters, i.e. the degree of parallelism. The main problem is that state-of-the-art libraries use a static degree of parallelism and are not sufficiently abstracted for developing stream processing applications. In this paper, we propose a self-adaptive regulation of the degree of parallelism to provide higher-level abstractions. Flexibility is provided to programmers with two new self-adaptive strategies, one is for performance experts, and the other abstracts the need to set a performance goal. We evaluated our solution using compiler transformation rules to generate parallel code with the SPar domain-specific language. The experimental results with real-world applications highlighted higher abstraction levels without significant performance degradation in comparison to static executions. The strategy for performance experts achieved slightly higher performance than the one that works without user-defined performance goals.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Stream processing applications are common computing workloads that demand parallelism to increase their performance. As in the past, parallel programming remains a difficult task for application programmers. The complexity increases when application programmers must set non-intuitive parallelism parameters, i.e. the degree of parallelism. The main problem is that state-of-the-art libraries use a static degree of parallelism and are not sufficiently abstracted for developing stream processing applications. In this paper, we propose a self-adaptive regulation of the degree of parallelism to provide higher-level abstractions. Flexibility is provided to programmers with two new self-adaptive strategies, one is for performance experts, and the other abstracts the need to set a performance goal. We evaluated our solution using compiler transformation rules to generate parallel code with the SPar domain-specific language. The experimental results with real-world applications highlighted higher abstraction levels without significant performance degradation in comparison to static executions. The strategy for performance experts achieved slightly higher performance than the one that works without user-defined performance goals. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces Inproceedings Forthcoming 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, Valladolid, Spain, Forthcoming. @inproceedings{GARCIA:PDP:21, title = {Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, year = {2021}, date = {2021-03-01}, booktitle = {29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'21}, abstract = {Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently computation. It can be done through Stream Parallelism, which is still a challenging task and most reserved for experts. We introduce a Stream Processing framework for assessing Parallel Programming Interfaces (PPIs). Our framework targets multi-core architectures and C++ stream processing applications, providing an API that abstracts the details of the stream operators of these applications. Therefore, users can easily identify all the basic operators and implement parallelism through different PPIs. In this paper, we present the proposed framework, implement three applications using its API, and show how it works, by using it to parallelize and evaluate the applications with the PPIs Intel TBB, FastFlow, and SPar. The performance results were consistent with the literature.}, keywords = {}, pubstate = {forthcoming}, tppubtype = {inproceedings} } Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently computation. It can be done through Stream Parallelism, which is still a challenging task and most reserved for experts. We introduce a Stream Processing framework for assessing Parallel Programming Interfaces (PPIs). Our framework targets multi-core architectures and C++ stream processing applications, providing an API that abstracts the details of the stream operators of these applications. Therefore, users can easily identify all the basic operators and implement parallelism through different PPIs. In this paper, we present the proposed framework, implement three applications using its API, and show how it works, by using it to parallelize and evaluate the applications with the PPIs Intel TBB, FastFlow, and SPar. The performance results were consistent with the literature. |
Vogel, Adriano; Mencagli, Gabriele; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Towards On-the-fly Self-Adaptation of Stream Parallel Patterns Inproceedings Forthcoming 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, Valladolid, Spain, Forthcoming. @inproceedings{VOGEL:PDP:21, title = {Towards On-the-fly Self-Adaptation of Stream Parallel Patterns}, author = {Adriano Vogel and Gabriele Mencagli and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, year = {2021}, date = {2021-03-01}, booktitle = {29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'21}, abstract = {Stream processing applications compute streams of data and provide insightful results in a timely manner, where parallel computing is necessary for accelerating the application executions. Considering that these applications are becoming increasingly dynamic and long-running, a potential solution is to apply dynamic runtime changes. However, it is challenging for humans to continuously monitor and manually self-optimize the executions. In this paper, we propose self-adaptiveness of the parallel patterns used, enabling flexible on-the-fly adaptations. The proposed solution is evaluated with an existing programming framework and running experiments with a synthetic and a real-world application. The results show that the proposed solution is able to dynamically self-adapt to the most suitable parallel pattern configuration and achieve performance competitive with the best static cases. The feasibility of the proposed solution encourages future optimizations and other applicabilities.}, keywords = {}, pubstate = {forthcoming}, tppubtype = {inproceedings} } Stream processing applications compute streams of data and provide insightful results in a timely manner, where parallel computing is necessary for accelerating the application executions. Considering that these applications are becoming increasingly dynamic and long-running, a potential solution is to apply dynamic runtime changes. However, it is challenging for humans to continuously monitor and manually self-optimize the executions. In this paper, we propose self-adaptiveness of the parallel patterns used, enabling flexible on-the-fly adaptations. The proposed solution is evaluated with an existing programming framework and running experiments with a synthetic and a real-world application. The results show that the proposed solution is able to dynamically self-adapt to the most suitable parallel pattern configuration and achieve performance competitive with the best static cases. The feasibility of the proposed solution encourages future optimizations and other applicabilities. |
2020 |
Bordin, Maycon Viana; Griebler, Dalvan; Mencagli, Gabriele; Geyer, Claudio F R; Fernandes, Luiz Gustavo DSPBench: a Suite of Benchmark Applications for Distributed Data Stream Processing Systems Journal Article doi IEEE Access, 8 (na), pp. 222900-222917, 2020. @article{BORDIN:IEEEAccess:20, title = {DSPBench: a Suite of Benchmark Applications for Distributed Data Stream Processing Systems}, author = {Maycon Viana Bordin and Dalvan Griebler and Gabriele Mencagli and Claudio F R Geyer and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/ACCESS.2020.3043948}, doi = {10.1109/ACCESS.2020.3043948}, year = {2020}, date = {2020-12-01}, journal = {IEEE Access}, volume = {8}, number = {na}, pages = {222900-222917}, publisher = {IEEE}, abstract = {Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this paper is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This paper describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this paper is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This paper describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis. |
Stein, Charles Michael; Rockenbach, Dinei A; Griebler, Dalvan; Torquati, Massimo; Mencagli, Gabriele; Danelutto, Marco; Fernandes, Luiz Gustavo Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. @article{STEIN:CCPE:20, title = {Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units}, author = {Charles Michael Stein and Dinei A. Rockenbach and Dalvan Griebler and Massimo Torquati and Gabriele Mencagli and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/cpe.5786}, doi = {10.1002/cpe.5786}, year = {2020}, date = {2020-05-01}, journal = {Concurrency and Computation: Practice and Experience}, volume = {na}, number = {na}, pages = {e5786}, publisher = {Wiley Online Library}, abstract = {Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads. |
Andrade, Gabriella; Griebler, Dalvan; Fernandes, Luiz Gustavo Avaliação da Usabilidade de Interfaces de Programação Paralela para Sistemas Multi-Core em Aplicação de Vídeo Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 149-150, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{ANDRADE:ERAD:20, title = {Avaliação da Usabilidade de Interfaces de Programação Paralela para Sistemas Multi-Core em Aplicação de Vídeo}, author = {Gabriella Andrade and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10781}, doi = {10.5753/eradrs.2020.10781}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {149-150}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Com a ampla variedade de interfaces para a programação paralela em ambientes multi-core é difícil determinar quais destas oferecem a melhor usabilidade. Esse trabalho realiza um experimento comparando a paralelização de uma aplicação de vídeo com as ferramentas FastFlow, SPar e TBB. Os resultados revelaram que a SPar requer menos esforço na paralelização de uma aplicação de vídeo do que as demais interfaces de programação paralela.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Com a ampla variedade de interfaces para a programação paralela em ambientes multi-core é difícil determinar quais destas oferecem a melhor usabilidade. Esse trabalho realiza um experimento comparando a paralelização de uma aplicação de vídeo com as ferramentas FastFlow, SPar e TBB. Os resultados revelaram que a SPar requer menos esforço na paralelização de uma aplicação de vídeo do que as demais interfaces de programação paralela. |
Garcia, Adriano Marques; Serpa, Matheus; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo; Navaux, Philippe O A The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures Inproceedings doi International Conference on Computational Science and its Applications (ICCSA), pp. 142-157, Springer, Cagliari, Italy, 2020. @inproceedings{GARCIA:ICCSA:20, title = {The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures}, author = {Adriano Marques Garcia and Matheus Serpa and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes and Philippe O A Navaux}, url = {https://doi.org/10.1007/978-3-030-58817-5_12}, doi = {10.1007/978-3-030-58817-5_12}, year = {2020}, date = {2020-07-01}, booktitle = {International Conference on Computational Science and its Applications (ICCSA)}, volume = {12254}, pages = {142-157}, publisher = {Springer}, address = {Cagliari, Italy}, series = {ICCSA'20}, abstract = {Since the demand for computing power increases, new architectures emerged to obtain better performance. Reducing the power and energy consumption of these architectures is one of the main challenges to achieving high-performance computing. Current research trends aim at developing new software and hardware techniques to achieve the best performance and energy trade-offs. In this work, we investigate the impact of different CPU frequency scaling techniques such as ondemand, performance, and powersave on the power and energy consumption of multi-core based computer infrastructure. We apply these techniques in PAMPAR, a parallel benchmark suite implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn). We measure the energy and execution time of 10 benchmarks, varying the number of threads. Our results show that although powersave consumes up to 43.1% less power than performance and ondemand governors, it consumes the triple of energy due to the high execution time. Our experiments also show that the performance governor consumes up to 9.8% more energy than ondemand for CPU-bound benchmarks. Finally, our results show that PThreads has the lowest power consumption, consuming less than the sequential version for memory-bound benchmarks. Regarding performance, the performance governor achieved 3% of performance over the ondemand.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Since the demand for computing power increases, new architectures emerged to obtain better performance. Reducing the power and energy consumption of these architectures is one of the main challenges to achieving high-performance computing. Current research trends aim at developing new software and hardware techniques to achieve the best performance and energy trade-offs. In this work, we investigate the impact of different CPU frequency scaling techniques such as ondemand, performance, and powersave on the power and energy consumption of multi-core based computer infrastructure. We apply these techniques in PAMPAR, a parallel benchmark suite implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn). We measure the energy and execution time of 10 benchmarks, varying the number of threads. Our results show that although powersave consumes up to 43.1% less power than performance and ondemand governors, it consumes the triple of energy due to the high execution time. Our experiments also show that the performance governor consumes up to 9.8% more energy than ondemand for CPU-bound benchmarks. Finally, our results show that PThreads has the lowest power consumption, consuming less than the sequential version for memory-bound benchmarks. Regarding performance, the performance governor achieved 3% of performance over the ondemand. |
Hoffmann, Renato B; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism Annotations for Multi-Core Frameworks Inproceedings doi XXIV Brazilian Symposium on Programming Languages (SBLP), pp. 48-55, ACM, Natal, Brazil, 2020. @inproceedings{HOFFMANN:SBLP:20, title = {Stream Parallelism Annotations for Multi-Core Frameworks}, author = {Renato B. Hoffmann and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1145/3427081.3427088}, doi = {10.1145/3427081.3427088}, year = {2020}, date = {2020-10-01}, booktitle = {XXIV Brazilian Symposium on Programming Languages (SBLP)}, pages = {48-55}, publisher = {ACM}, address = {Natal, Brazil}, series = {SBLP'20}, abstract = {Data generation, collection, and processing is an important workload of modern computer architectures. Stream or high-intensity data flow applications are commonly employed in extracting and interpreting the information contained in this data. Due to the computational complexity of these applications, high-performance ought to be achieved using parallel computing. Indeed, the efficient exploitation of available parallel resources from the architecture remains a challenging task for the programmers. Techniques and methodologies are required to help shift the efforts from the complexity of parallelism exploitation to specific algorithmic solutions. To tackle this problem, we propose a methodology that provides the developer with a suitable abstraction layer between a clean and effective parallel programming interface targeting different multi-core parallel programming frameworks. We used standard C++ code annotations that may be inserted in the source code by the programmer. Then, a compiler parses C++ code with the annotations and generates calls to the desired parallel runtime API. Our experiments demonstrate the feasibility of our methodology and the performance of the abstraction layer, where the difference is negligible in four applications with respect to the state-of-the-art C++ parallel programming frameworks. Additionally, our methodology allows improving the application performance since the developers can choose the runtime that best performs in their system.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Data generation, collection, and processing is an important workload of modern computer architectures. Stream or high-intensity data flow applications are commonly employed in extracting and interpreting the information contained in this data. Due to the computational complexity of these applications, high-performance ought to be achieved using parallel computing. Indeed, the efficient exploitation of available parallel resources from the architecture remains a challenging task for the programmers. Techniques and methodologies are required to help shift the efforts from the complexity of parallelism exploitation to specific algorithmic solutions. To tackle this problem, we propose a methodology that provides the developer with a suitable abstraction layer between a clean and effective parallel programming interface targeting different multi-core parallel programming frameworks. We used standard C++ code annotations that may be inserted in the source code by the programmer. Then, a compiler parses C++ code with the annotations and generates calls to the desired parallel runtime API. Our experiments demonstrate the feasibility of our methodology and the performance of the abstraction layer, where the difference is negligible in four applications with respect to the state-of-the-art C++ parallel programming frameworks. Additionally, our methodology allows improving the application performance since the developers can choose the runtime that best performs in their system. |
Löff, Junior; Griebler, Dalvan; Fernandes, Luiz Gustavo Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 37-40, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{LOFF:ERAD:20, title = {Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito}, author = {Junior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10750}, doi = {10.5753/eradrs.2020.10750}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {37-40}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Neste trabalho, um pipeline implícito com o padrão map foi implementado na aplicação LU do NAS Parallel Benchmarks em C++. O LU possui dependência de dados no tempo, o que dificulta a exploração do paralelismo. Ele foi convertido de Fortran para C++, a fim de ser paralelizado com diferentes bibliotecas de sistemas multi-core. O uso desta estratégia com as bibliotecas permitiu ganhos de desempenho de até 10.6% em relação a versão original.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Neste trabalho, um pipeline implícito com o padrão map foi implementado na aplicação LU do NAS Parallel Benchmarks em C++. O LU possui dependência de dados no tempo, o que dificulta a exploração do paralelismo. Ele foi convertido de Fortran para C++, a fim de ser paralelizado com diferentes bibliotecas de sistemas multi-core. O uso desta estratégia com as bibliotecas permitiu ganhos de desempenho de até 10.6% em relação a versão original. |
de Araújo, Gabriell Alves; Griebler, Dalvan; Fernandes, Luiz Gustavo Implementação CUDA dos Kernels NPB Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 85-88, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{ARAUJO:ERAD:20, title = {Implementação CUDA dos Kernels NPB}, author = {Gabriell Alves de Araújo and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10762}, doi = {10.5753/eradrs.2020.10762}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {85-88}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {NAS Parallel Benchmarks (NPB) é um conjunto de benchmarks utilizado para avaliar hardware e software, que ao longo dos anos foi portado para diferentes frameworks. Concernente a GPUs, atualmente existem apenas versões OpenCL e OpenACC. Este trabalho contribui com a literatura provendo a primeira implementação CUDA completa dos kernels do NPB, realizando experimentos com carga de trabalho inédita e revelando novos fatos sobre o NPB.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } NAS Parallel Benchmarks (NPB) é um conjunto de benchmarks utilizado para avaliar hardware e software, que ao longo dos anos foi portado para diferentes frameworks. Concernente a GPUs, atualmente existem apenas versões OpenCL e OpenACC. Este trabalho contribui com a literatura provendo a primeira implementação CUDA completa dos kernels do NPB, realizando experimentos com carga de trabalho inédita e revelando novos fatos sobre o NPB. |
Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Geração Automática de Código TBB na SPar Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 97-100, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{HOFFMANN:ERAD:20, title = {Geração Automática de Código TBB na SPar}, author = {Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10765}, doi = {10.5753/eradrs.2020.10765}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {97-100}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Técnicas de programação paralela são necessárias para extrair todo o potencial dos processadores de múltiplos núcleos. Para isso, foi criada a SPar, uma linguagem para abstração do paralelismo de stream. Esse trabalho descreve a implementação da geração de código automática para a biblioteca TBB na SPar, uma vez que gerava-se código para FastFlow. Os testes com aplicações resultaram em tempos de execução até 12,76 vezes mais rápidos.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Técnicas de programação paralela são necessárias para extrair todo o potencial dos processadores de múltiplos núcleos. Para isso, foi criada a SPar, uma linguagem para abstração do paralelismo de stream. Esse trabalho descreve a implementação da geração de código automática para a biblioteca TBB na SPar, uma vez que gerava-se código para FastFlow. Os testes com aplicações resultaram em tempos de execução até 12,76 vezes mais rápidos. |
Justo, Gabriel; Hoffmann, Renato Barreto; Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz Gustavo Acelerando uma Aplicação de Detecção de Pistas com MPI Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 117-120, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{JUSTO:ERAD:20, title = {Acelerando uma Aplicação de Detecção de Pistas com MPI}, author = {Gabriel Justo and Renato Barreto Hoffmann and Adriano Vogel and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10770}, doi = {10.5753/eradrs.2020.10770}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {117-120}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Aplicações de stream de vídeo demandam processamento de alto desempenho para atender requisitos de tempo real. Nesse cenário, a programação paralela distribuída é uma alternativa para acelerar e escalar o desempenho. Neste trabalho, o objetivo é paralelizar uma aplicação de detecção de pistas com a biblioteca MPI usando o padrão Farm e implementando duas estratégias de distribuição de tarefas. Os resultados evidenciam os ganhos de desempenho.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Aplicações de stream de vídeo demandam processamento de alto desempenho para atender requisitos de tempo real. Nesse cenário, a programação paralela distribuída é uma alternativa para acelerar e escalar o desempenho. Neste trabalho, o objetivo é paralelizar uma aplicação de detecção de pistas com a biblioteca MPI usando o padrão Farm e implementando duas estratégias de distribuição de tarefas. Os resultados evidenciam os ganhos de desempenho. |
Garcia, Adriano Marques; Griebler, Dalvan; Fernandes, Luiz Gustavo Proposta de uma Suíte de Benchmarks para Processamento de Stream em Sistemas Multi-Core Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 167-168, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{GARCIA:ERAD:20, title = {Proposta de uma Suíte de Benchmarks para Processamento de Stream em Sistemas Multi-Core}, author = {Adriano Marques Garcia and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10790}, doi = {10.5753/eradrs.2020.10790}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {167-168}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {O aumento no volume de dados gerados por sistemas computacionais e a necessidade por processamento rápido desses dados vem alavancando a área de processamento de stream. Entretanto, ainda não existe um benchmark para auxiliar desenvolvedores e pesquisadores. Este trabalho visa propor uma suíte de benchmarks para processamento de stream em arquiteturas multi-core e discute as características necessárias no desenvolvimento dessa suíte.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } O aumento no volume de dados gerados por sistemas computacionais e a necessidade por processamento rápido desses dados vem alavancando a área de processamento de stream. Entretanto, ainda não existe um benchmark para auxiliar desenvolvedores e pesquisadores. Este trabalho visa propor uma suíte de benchmarks para processamento de stream em arquiteturas multi-core e discute as características necessárias no desenvolvimento dessa suíte. |
Vogel, Adriano; Rista, Cassiano; Justo, Gabriel; Ewald, Endrius; Griebler, Dalvan; Mencagli, Gabriele; Fernandes, Luiz Gustavo Parallel Stream Processing with MPI for Video Analytics and Data Visualization Inproceedings doi High Performance Computing Systems, pp. 102-116, Springer, Cham, 2020. @inproceedings{VOGEL:CCIS:20, title = {Parallel Stream Processing with MPI for Video Analytics and Data Visualization}, author = {Adriano Vogel and Cassiano Rista and Gabriel Justo and Endrius Ewald and Dalvan Griebler and Gabriele Mencagli and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/978-3-030-41050-6_7}, doi = {10.1007/978-3-030-41050-6_7}, year = {2020}, date = {2020-02-01}, booktitle = {High Performance Computing Systems}, volume = {1171}, pages = {102-116}, publisher = {Springer}, address = {Cham}, series = {Communications in Computer and Information Science (CCIS)}, abstract = {The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. Parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. Parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications. |
de Araujo, Gabriell Alves; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Efficient NAS Parallel Benchmark Kernels with CUDA Inproceedings doi 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 9-16, IEEE, Västerås, Sweden, Sweden, 2020. @inproceedings{ARAUJO:PDP:20, title = {Efficient NAS Parallel Benchmark Kernels with CUDA}, author = {Gabriell Alves de Araujo and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP50117.2020.00009}, doi = {10.1109/PDP50117.2020.00009}, year = {2020}, date = {2020-03-01}, booktitle = {28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {9-16}, publisher = {IEEE}, address = {Västerås, Sweden, Sweden}, series = {PDP'20}, abstract = {NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels. |
2019 |
Griebler, Dalvan; Vogel, Adriano; Sensi, Daniele De; Danelutto, Marco; Fernandes, Luiz Gustavo Simplifying and implementing service level objectives for stream parallelism Journal Article doi Journal of Supercomputing, pp. 1-26, 2019, ISSN: 0920-8542. @article{GRIEBLER:JS:19, title = {Simplifying and implementing service level objectives for stream parallelism}, author = {Dalvan Griebler and Adriano Vogel and Daniele De Sensi and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-019-02914-6}, doi = {10.1007/s11227-019-02914-6}, issn = {0920-8542}, year = {2019}, date = {2019-06-01}, journal = {Journal of Supercomputing}, pages = {1-26}, publisher = {Springer}, abstract = {An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application’s source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications.}, keywords = {}, pubstate = {published}, tppubtype = {article} } An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application’s source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications. |
Mencagli, Gabriele; Torquati, Massimo; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Raising the Parallel Abstraction Level for Streaming Analytics Applications Journal Article doi IEEE Access, 7 , pp. 131944 - 131961, 2019. @article{MENCAGLI:IEEEAccess:19, title = {Raising the Parallel Abstraction Level for Streaming Analytics Applications}, author = {Gabriele Mencagli and Massimo Torquati and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/ACCESS.2019.2941183}, doi = {10.1109/ACCESS.2019.2941183}, year = {2019}, date = {2019-09-01}, journal = {IEEE Access}, volume = {7}, pages = {131944 - 131961}, publisher = {IEEE}, abstract = {In the stream processing domain, applications are represented by graphs of operators arbitrarily connected and filled with their business logic code. The APIs of existing Stream Processing Systems (SPSs) ease the development of transformations that recur in the streaming practice (e.g., filtering, aggregation and joins). In contrast, their parallelism abstractions are quite limited since they provide support to stateless operators only, or when the state is organized in a set of key-value pairs. This paper presents how the parallel patterns methodology can be revisited for sliding-window streaming analytics. Our vision fosters a design process of the application as composition and nesting of ready-to-use patterns provided through a C++17 fluent interface. Our prototype implements the run-time system of the patterns in the FastFlow parallel library expressing thread-based parallelism. The experimental analysis shows interesting outcomes. First, our pattern-based approach allows easy prototyping of different versions of the application, and the programmer can leverage nesting of patterns to increase performance (up to 37% in one of the two considered test-bed cases). Second, our FastFlow implementation outperforms (three times faster) the handmade porting of our patterns in popular JVM-based SPSs. Finally, in the concluding part of this paper, we explore the use of a task-based run-time system, by deriving interesting insights into how to make our patterns library suitable for multi backends.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In the stream processing domain, applications are represented by graphs of operators arbitrarily connected and filled with their business logic code. The APIs of existing Stream Processing Systems (SPSs) ease the development of transformations that recur in the streaming practice (e.g., filtering, aggregation and joins). In contrast, their parallelism abstractions are quite limited since they provide support to stateless operators only, or when the state is organized in a set of key-value pairs. This paper presents how the parallel patterns methodology can be revisited for sliding-window streaming analytics. Our vision fosters a design process of the application as composition and nesting of ready-to-use patterns provided through a C++17 fluent interface. Our prototype implements the run-time system of the patterns in the FastFlow parallel library expressing thread-based parallelism. The experimental analysis shows interesting outcomes. First, our pattern-based approach allows easy prototyping of different versions of the application, and the programmer can leverage nesting of patterns to increase performance (up to 37% in one of the two considered test-bed cases). Second, our FastFlow implementation outperforms (three times faster) the handmade porting of our patterns in popular JVM-based SPSs. Finally, in the concluding part of this paper, we explore the use of a task-based run-time system, by deriving interesting insights into how to make our patterns library suitable for multi backends. |
Vogel, Adriano; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Minimizing Self-Adaptation Overhead in Parallel Stream Processing for Multi-Cores Inproceedings doi Euro-Par 2019: Parallel Processing Workshops, pp. 12, Springer, Göttingen, Germany, 2019. @inproceedings{VOGEL:adaptive-overhead:AutoDaSP:19, title = {Minimizing Self-Adaptation Overhead in Parallel Stream Processing for Multi-Cores}, author = {Adriano Vogel and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/978-3-030-48340-1_3}, doi = {10.1007/978-3-030-48340-1_3}, year = {2019}, date = {2019-08-01}, booktitle = {Euro-Par 2019: Parallel Processing Workshops}, volume = {11997}, pages = {12}, publisher = {Springer}, address = {Göttingen, Germany}, series = {Lecture Notes in Computer Science}, abstract = {Stream processing paradigm is present in several applications that apply computations over continuous data flowing in the form of streams (e.g., video feeds, image, and data analytics). Employing self-adaptivity to stream processing applications can provide higher-level programming abstractions and autonomic resource management. However, there are cases where the performance is suboptimal. In this paper, the goal is to optimize parallelism adaptations in terms of stability and accuracy, which can improve the performance of parallel stream processing applications. Therefore, we present a new optimized self-adaptive strategy that is experimentally evaluated. The proposed solution provided high-level programming abstractions, reduced the adaptation overhead, and achieved a competitive performance with the best static executions.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Stream processing paradigm is present in several applications that apply computations over continuous data flowing in the form of streams (e.g., video feeds, image, and data analytics). Employing self-adaptivity to stream processing applications can provide higher-level programming abstractions and autonomic resource management. However, there are cases where the performance is suboptimal. In this paper, the goal is to optimize parallelism adaptations in terms of stability and accuracy, which can improve the performance of parallel stream processing applications. Therefore, we present a new optimized self-adaptive strategy that is experimentally evaluated. The proposed solution provided high-level programming abstractions, reduced the adaptation overhead, and achieved a competitive performance with the best static executions. |
Rockenbach, Dinei A; Stein, Charles Michael; Griebler, Dalvan; Mencagli, Gabriele; Torquati, Massimo; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges Inproceedings doi International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 834-841, IEEE, Rio de Janeiro, Brazil, 2019. @inproceedings{ROCKENBACH:stream-multigpus:IPDPSW:19, title = {Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges}, author = {Dinei A. Rockenbach and Charles Michael Stein and Dalvan Griebler and Gabriele Mencagli and Massimo Torquati and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/IPDPSW.2019.00137}, doi = {10.1109/IPDPSW.2019.00137}, year = {2019}, date = {2019-05-01}, booktitle = {International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}, pages = {834-841}, publisher = {IEEE}, address = {Rio de Janeiro, Brazil}, series = {IPDPSW'19}, abstract = {The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models. |
Stein, Charles Michael; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 247-251, IEEE, Pavia, Italy, 2019. @inproceedings{STEIN:LZSS-multigpu:PDP:19, title = {Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs}, author = {Charles Michael Stein and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671624}, doi = {10.1109/EMPDP.2019.8671624}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {247-251}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques. |
Maron, Carlos A F; Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz Gustavo Should PARSEC Benchmarks be More Parametric? A Case Study with Dedup Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 217-221, IEEE, Pavia, Italy, 2019. @inproceedings{MARON:parametric-parsec:PDP:19, title = {Should PARSEC Benchmarks be More Parametric? A Case Study with Dedup}, author = {Carlos A. F. Maron and Adriano Vogel and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671592}, doi = {10.1109/EMPDP.2019.8671592}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {217-221}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {Parallel applications of the same domain can present similar patterns of behavior and characteristics. Characterizing common application behaviors can help for understanding performance aspects in the real-world scenario. One way to better understand and evaluate applications' characteristics is by using customizable/parametric benchmarks that enable users to represent important characteristics at run-time. We observed that parameterization techniques should be better exploited in the available benchmarks, especially on stream processing domain. For instance, although widely used, the stream processing benchmarks available in PARSEC do not support the simulation and evaluation of relevant and modern characteristics. Therefore, our goal is to identify the stream parallelism characteristics present in PARSEC. We also implemented a ready to use parameterization support and evaluated the application behaviors considering relevant performance metrics for stream parallelism (service time, throughput, latency). We choose Dedup to be our case study. The experimental results have shown performance improvements in our parameterization support for Dedup. Moreover, this support increased the customization space for benchmark users, which is simple to use. In the future, our solution can be potentially explored on different parallel architectures and parallel programming frameworks.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Parallel applications of the same domain can present similar patterns of behavior and characteristics. Characterizing common application behaviors can help for understanding performance aspects in the real-world scenario. One way to better understand and evaluate applications' characteristics is by using customizable/parametric benchmarks that enable users to represent important characteristics at run-time. We observed that parameterization techniques should be better exploited in the available benchmarks, especially on stream processing domain. For instance, although widely used, the stream processing benchmarks available in PARSEC do not support the simulation and evaluation of relevant and modern characteristics. Therefore, our goal is to identify the stream parallelism characteristics present in PARSEC. We also implemented a ready to use parameterization support and evaluated the application behaviors considering relevant performance metrics for stream parallelism (service time, throughput, latency). We choose Dedup to be our case study. The experimental results have shown performance improvements in our parameterization support for Dedup. Moreover, this support increased the customization space for benchmark users, which is simple to use. In the future, our solution can be potentially explored on different parallel architectures and parallel programming frameworks. |
Serpa, Matheus S; Moreira, Francis B; Navaux, Philippe O A; Cruz, Eduardo H M; Diener, Matthias; Griebler, Dalvan; Fernandes, Luiz Gustavo Memory Performance and Bottlenecks in Multicore and GPU Architectures Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 233-236, IEEE, Pavia, Italy, 2019. @inproceedings{SERPA:memory-gpu-multicore:PDP:19, title = {Memory Performance and Bottlenecks in Multicore and GPU Architectures}, author = {Matheus S. Serpa and Francis B. Moreira and Philippe O. A. Navaux and Eduardo H. M. Cruz and Matthias Diener and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671628}, doi = {10.1109/EMPDP.2019.8671628}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {233-236}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {Nowadays, there are several different architectures available not only for the industry, but also for normal consumers. Traditional multicore processors, GPUs, accelerators such as the Sunway SW26010, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. Therefore, the same application can perform well when executing on one architecture, but poorly on another architecture. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. The related work in this area mostly focuses on a limited analysis encompassing execution time and energy. In this paper, we perform a detailed investigation on the impact of the memory subsystem of different architectures, which is one of the most important aspects to be considered. For this study, we performed experiments in the Broadwell CPU and Pascal GPU, using applications from the Rodinia benchmark suite. In this way, we were able to understand why an application performs well on one architecture and poorly on others.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Nowadays, there are several different architectures available not only for the industry, but also for normal consumers. Traditional multicore processors, GPUs, accelerators such as the Sunway SW26010, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. Therefore, the same application can perform well when executing on one architecture, but poorly on another architecture. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. The related work in this area mostly focuses on a limited analysis encompassing execution time and energy. In this paper, we perform a detailed investigation on the impact of the memory subsystem of different architectures, which is one of the most important aspects to be considered. For this study, we performed experiments in the Broadwell CPU and Pascal GPU, using applications from the Rodinia benchmark suite. In this way, we were able to understand why an application performs well on one architecture and poorly on others. |
Pieper, Ricardo; Griebler, Dalvan; Fernandes, Luiz Gustavo Structured Stream Parallelism for Rust Inproceedings doi XXIII Brazilian Symposium on Programming Languages (SBLP), pp. 54-61, ACM, Salvador, Brazil, 2019. @inproceedings{PIEPER:SBLP:19b, title = {Structured Stream Parallelism for Rust}, author = {Ricardo Pieper and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1145/3355378.3355384}, doi = {10.1145/3355378.3355384}, year = {2019}, date = {2019-10-01}, booktitle = {XXIII Brazilian Symposium on Programming Languages (SBLP)}, pages = {54-61}, publisher = {ACM}, address = {Salvador, Brazil}, series = {SBLP'19}, abstract = {Structured parallel programming has been studied and applied in several programming languages. This approach has proven to be suitable for abstracting low-level and architecture-dependent parallelism implementations. Our goal is to provide a structured and high-level library for the Rust language, targeting parallel stream processing applications for multi-core servers. Rust is an emerging programming language that has been developed by Mozilla Research group, focusing on performance, memory safety, and thread-safety. However, it lacks parallel programming abstractions, especially for stream processing applications. This paper contributes to a new API based on the structured parallel programming approach to simplify parallel software developing. Our experiments highlight that our solution provides higher-level parallel programming abstractions for stream processing applications in Rust. We also show that the throughput and speedup are comparable to the state-of-the-art for certain workloads.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Structured parallel programming has been studied and applied in several programming languages. This approach has proven to be suitable for abstracting low-level and architecture-dependent parallelism implementations. Our goal is to provide a structured and high-level library for the Rust language, targeting parallel stream processing applications for multi-core servers. Rust is an emerging programming language that has been developed by Mozilla Research group, focusing on performance, memory safety, and thread-safety. However, it lacks parallel programming abstractions, especially for stream processing applications. This paper contributes to a new API based on the structured parallel programming approach to simplify parallel software developing. Our experiments highlight that our solution provides higher-level parallel programming abstractions for stream processing applications in Rust. We also show that the throughput and speedup are comparable to the state-of-the-art for certain workloads. |
Vogel, Adriano; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Seamless Parallelism Management for Multi-core Stream Processing Inproceedings doi Advances in Parallel Computing, Proceedings of the International Conference on Parallel Computing (ParCo), pp. 533-542, IOS Press, Prague, Czech Republic, 2019. @inproceedings{VOGEL:PARCO:19, title = {Seamless Parallelism Management for Multi-core Stream Processing}, author = {Adriano Vogel and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.3233/APC200082}, doi = {10.3233/APC200082}, year = {2019}, date = {2019-09-01}, booktitle = {Advances in Parallel Computing, Proceedings of the International Conference on Parallel Computing (ParCo)}, volume = {36}, pages = {533-542}, publisher = {IOS Press}, address = {Prague, Czech Republic}, series = {ParCo'19}, abstract = {Video streaming applications have critical performance requirements for dealing with fluctuating workloads and providing results in real-time. As a consequence, the majority of these applications demand parallelism for delivering quality of service to users. Although high-level and structured parallel programming aims at facilitating parallelism exploitation, there are still several issues to be addressed for increasing/improving existing parallel programming abstractions. In this paper, we aim at employing self-adaptivity for stream processing in order to seamlessly manage the application parallelism configurations at run-time, where a new strategy alleviates from application programmers the need to set time-consuming and error-prone parallelism parameters. The new strategy was implemented and validated on SPar. The results have shown that the proposed solution increases the level of abstraction and achieved a competitive performance.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Video streaming applications have critical performance requirements for dealing with fluctuating workloads and providing results in real-time. As a consequence, the majority of these applications demand parallelism for delivering quality of service to users. Although high-level and structured parallel programming aims at facilitating parallelism exploitation, there are still several issues to be addressed for increasing/improving existing parallel programming abstractions. In this paper, we aim at employing self-adaptivity for stream processing in order to seamlessly manage the application parallelism configurations at run-time, where a new strategy alleviates from application programmers the need to set time-consuming and error-prone parallelism parameters. The new strategy was implemented and validated on SPar. The results have shown that the proposed solution increases the level of abstraction and achieved a competitive performance. |
Rockenbach, Dinei A; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo High-Level Stream Parallelism Abstractions with SPar Targeting GPUs Inproceedings doi Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo), pp. 543-552, IOS Press, Prague, Czech Republic, 2019. @inproceedings{ROCKENBACH:PARCO:19, title = {High-Level Stream Parallelism Abstractions with SPar Targeting GPUs}, author = {Dinei A. Rockenbach and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.3233/APC200083}, doi = {10.3233/APC200083}, year = {2019}, date = {2019-09-01}, booktitle = {Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo)}, volume = {36}, pages = {543-552}, publisher = {IOS Press}, address = {Prague, Czech Republic}, series = {ParCo'19}, abstract = {The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained. |
Justo, Gabriel B; Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz G Acelerando o Reconhecimento de Pessoas em Vídeos com MPI Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. @inproceedings{JUSTO:ERAD:19, title = {Acelerando o Reconhecimento de Pessoas em Vídeos com MPI}, author = {Gabriel B. Justo and Adriano Vogel and Dalvan Griebler and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2019/CR_ERAD_IC_Justo_2019.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {4}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Três de Maio, BR}, abstract = {Diversas aplicações de processamento de vídeo demandam paralelismo para aumentar o desempenho. O objetivo deste trabalho é implementar etestar versões com processamento distribuído em aplicações de reconhecimentofacial em vídeos. As implementações foram avaliadas quanto ao seu desempe-nho. Os resultados mostraram que essas aplicações podem ter uma aceleraçãosignificativa em ambientes distribuídos.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Diversas aplicações de processamento de vídeo demandam paralelismo para aumentar o desempenho. O objetivo deste trabalho é implementar etestar versões com processamento distribuído em aplicações de reconhecimentofacial em vídeos. As implementações foram avaliadas quanto ao seu desempe-nho. Os resultados mostraram que essas aplicações podem ter uma aceleraçãosignificativa em ambientes distribuídos. |
de Araujo, Gabriell A; Griebler, Dalvan; Fernandes, Luiz G Revisando a Programação Paralela com CUDA nos Benchmarks EP e FT Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. @inproceedings{ARAUJO:gpu:ERAD:19, title = {Revisando a Programação Paralela com CUDA nos Benchmarks EP e FT}, author = {Gabriell A. de Araujo and Dalvan Griebler and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2019/CR_ERAD_IC_Gabriell_2019.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {4}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Três de Maio, BR}, abstract = {Este trabalho visa estender os estudos sobre o NAS Parallel Ben-chmarks (NPB), os quais possuem lacunas relevantes no contexto de GPUs.Os principais trabalhos da literatura consistem em implementações antigas,abrindo margens para possíveis questionamentos. Nessa direção, foram rea-lizados novos estudos de paralelização para GPUs das aplicações EP e FT. Osresultados foram similares ou melhores que o estado-da-arte.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho visa estender os estudos sobre o NAS Parallel Ben-chmarks (NPB), os quais possuem lacunas relevantes no contexto de GPUs.Os principais trabalhos da literatura consistem em implementações antigas,abrindo margens para possíveis questionamentos. Nessa direção, foram rea-lizados novos estudos de paralelização para GPUs das aplicações EP e FT. Osresultados foram similares ou melhores que o estado-da-arte. |
de Araujo, Gabriell A; Griebler, Dalvan; Fernandes, Luiz G Avaliando o Paralelismo de Stream com Pthreads, OpenMP e SPar em Aplicações de Vídeo Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. @inproceedings{ARAUJO:stream:ERAD:19, title = {Avaliando o Paralelismo de Stream com Pthreads, OpenMP e SPar em Aplicações de Vídeo}, author = {Gabriell A. de Araujo and Dalvan Griebler and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2019/CR_ERAD_IC_Araujo_2019.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {4}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Três de Maio, BR}, abstract = {isando estender os estudos de avaliação daSPar, efetuamos umaanálise comparativa entre SPar,Pthreadse OpenMP em aplicações destream. Os resultados revelam que o desempenho do código paralelo geradopela SPar se equipara com as implementações robustas nas consolidadas bibliotecas Pthreads e OpenMP. Não obstante, também encontramos pontos depossíveis melhorias na SPar.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } isando estender os estudos de avaliação daSPar, efetuamos umaanálise comparativa entre SPar,Pthreadse OpenMP em aplicações destream. Os resultados revelam que o desempenho do código paralelo geradopela SPar se equipara com as implementações robustas nas consolidadas bibliotecas Pthreads e OpenMP. Não obstante, também encontramos pontos depossíveis melhorias na SPar. |
Rista, Cassiano; Griebler, Dalvan; Fernandes, Luiz G Proposta de Grau de Paralelismo Autoadaptativo com MPI-2 para a DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. @inproceedings{RISTA:ERAD:19, title = {Proposta de Grau de Paralelismo Autoadaptativo com MPI-2 para a DSL SPar}, author = {Cassiano Rista and Dalvan Griebler and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2019/CR_ERAD_PG_Rista_2019.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {4}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Três de Maio, BR}, abstract = {Este artigo apresenta o projeto de um módulo autoadaptativo paracontrole do grau de paralelismo à ser integrado a DSL SPar. O módulo paraaplicações paralelas distribuídas de stream permite a criação de processos emtempo de execução, seleção da política de escalonamento, balanceamento decarga, ordenamento e serialização, adaptando o grau de paralelismo de formaautônoma sem a necessidade de definição de thresholds por parte do usuário.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Este artigo apresenta o projeto de um módulo autoadaptativo paracontrole do grau de paralelismo à ser integrado a DSL SPar. O módulo paraaplicações paralelas distribuídas de stream permite a criação de processos emtempo de execução, seleção da política de escalonamento, balanceamento decarga, ordenamento e serialização, adaptando o grau de paralelismo de formaautônoma sem a necessidade de definição de thresholds por parte do usuário. |
Maron, Carlos A F; Griebler, Dalvan; Fernandes, Luiz G Benchmark Paramétrico para o Domínio do Paralelismo de Stream: Um Estudo de Caso com o Ferret da Suíte PARSEC Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. @inproceedings{MARON:ERAD:19, title = {Benchmark Paramétrico para o Domínio do Paralelismo de Stream: Um Estudo de Caso com o Ferret da Suíte PARSEC}, author = {Carlos A. F. Maron and Dalvan Griebler and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2019/CR_ERAD_PG_Maron_2019.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {4}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Três de Maio, BR}, abstract = {Benchmarks são aplicações sintéticas que servem para avaliar e com-parar o desempenho de sistemas computacionais. Torná-los parametrizáveispode gerar condições diferenciadas de execuções. Porém, a técnica é pouco ex-plorada nos tradicionais e atuais benchmarks. Portanto, esse trabalho avalia oimpacto da parametrização de características do domínio de stream no Ferret.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Benchmarks são aplicações sintéticas que servem para avaliar e com-parar o desempenho de sistemas computacionais. Torná-los parametrizáveispode gerar condições diferenciadas de execuções. Porém, a técnica é pouco ex-plorada nos tradicionais e atuais benchmarks. Portanto, esse trabalho avalia oimpacto da parametrização de características do domínio de stream no Ferret. |
Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz G Adaptando o Paralelismo em Aplicações de Stream Conforme Objetivos de Throughput Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. @inproceedings{VOGEL:ERAD:19, title = {Adaptando o Paralelismo em Aplicações de Stream Conforme Objetivos de Throughput}, author = {Adriano Vogel and Dalvan Griebler and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2019/CR_ERAD_PG_Vogel_2019.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {4}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Três de Maio, BR}, abstract = {As aplicações de processamento de streams possuem característicasde execuções dinâmicas com variações na carga e na demanda por recursos. Adaptar o grau de paralelismo é uma alternativa para responder a variaçãodurante a execução. Nesse trabalho é apresentada uma abstração de parale-lismo para a DSL SPar através de uma estratégia que autonomicamente adaptao grau de paralelismo de acordo com objetivos de desempenho.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } As aplicações de processamento de streams possuem característicasde execuções dinâmicas com variações na carga e na demanda por recursos. Adaptar o grau de paralelismo é uma alternativa para responder a variaçãodurante a execução. Nesse trabalho é apresentada uma abstração de parale-lismo para a DSL SPar através de uma estratégia que autonomicamente adaptao grau de paralelismo de acordo com objetivos de desempenho. |
Rockenbach, Dinei A; Griebler, Dalvan; Fernandes, Luiz G Proposta de Suporte ao Paralelismo de GPU na SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. @inproceedings{ROCKENBACH:ERAD:19, title = {Proposta de Suporte ao Paralelismo de GPU na SPar}, author = {Dinei A. Rockenbach and Dalvan Griebler and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2019/CR_ERAD_PG_Dinei_2019.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {4}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Três de Maio, BR}, abstract = {As GPUs (Graphics Processing Units) têm se destacado devido a seualto poder de processamento paralelo e sua presença crescente nos dispositivoscomputacionais. Porém, a sua exploração ainda requer conhecimento e esforçoconsideráveis do desenvolvedor. O presente trabalho propõe o suporte ao para-lelismo de GPU na SPar, que fornece um alto nível de abstração através de umalinguagem baseada em anotações do C++.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } As GPUs (Graphics Processing Units) têm se destacado devido a seualto poder de processamento paralelo e sua presença crescente nos dispositivoscomputacionais. Porém, a sua exploração ainda requer conhecimento e esforçoconsideráveis do desenvolvedor. O presente trabalho propõe o suporte ao para-lelismo de GPU na SPar, que fornece um alto nível de abstração através de umalinguagem baseada em anotações do C++. |
2018 |
Griebler, Dalvan; Hoffmann, Renato B; Danelutto, Marco; Fernandes, Luiz Gustavo High-Level and Productive Stream Parallelism for Dedup, Ferret, and Bzip2 Journal Article doi International Journal of Parallel Programming, 47 (1), pp. 253-271, 2018, ISSN: 1573-7640. @article{GRIEBLER:IJPP:18, title = {High-Level and Productive Stream Parallelism for Dedup, Ferret, and Bzip2}, author = {Dalvan Griebler and Renato B. Hoffmann and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s10766-018-0558-x}, doi = {10.1007/s10766-018-0558-x}, issn = {1573-7640}, year = {2018}, date = {2018-02-01}, journal = {International Journal of Parallel Programming}, volume = {47}, number = {1}, pages = {253-271}, publisher = {Springer}, abstract = {Parallel programming has been a challenging task for application programmers. Stream processing is an application domain present in several scientific, enterprise, and financial areas that lack suitable abstractions to exploit parallelism. Our goal is to assess the feasibility of state-of-the-art frameworks/libraries (Pthreads, TBB, and FastFlow) and the SPar domain-specific language for real-world streaming applications (Dedup, Ferret, and Bzip2) targeting multi-core architectures. SPar was specially designed to provide high-level and productive stream parallelism abstractions, supporting programmers with standard C++-11 annotations. For the experiments, we implemented three streaming applications. We discussed SPar’s programmability advantages compared to the frameworks in terms of productivity and structured parallel programming. The results demonstrate that SPar improves productivity and provides the necessary features to achieve similar performances compared to the state-of-the-art.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Parallel programming has been a challenging task for application programmers. Stream processing is an application domain present in several scientific, enterprise, and financial areas that lack suitable abstractions to exploit parallelism. Our goal is to assess the feasibility of state-of-the-art frameworks/libraries (Pthreads, TBB, and FastFlow) and the SPar domain-specific language for real-world streaming applications (Dedup, Ferret, and Bzip2) targeting multi-core architectures. SPar was specially designed to provide high-level and productive stream parallelism abstractions, supporting programmers with standard C++-11 annotations. For the experiments, we implemented three streaming applications. We discussed SPar’s programmability advantages compared to the frameworks in terms of productivity and structured parallel programming. The results demonstrate that SPar improves productivity and provides the necessary features to achieve similar performances compared to the state-of-the-art. |
Griebler, Dalvan; Hoffmann, Renato B; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism with Ordered Data Constraints on Multi-Core Systems Journal Article doi Journal of Supercomputing, 75 (8), pp. 4042-4061, 2018, ISSN: 0920-8542. @article{GRIEBLER:JS:18, title = {Stream Parallelism with Ordered Data Constraints on Multi-Core Systems}, author = {Dalvan Griebler and Renato B. Hoffmann and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-018-2482-7}, doi = {10.1007/s11227-018-2482-7}, issn = {0920-8542}, year = {2018}, date = {2018-07-01}, journal = {Journal of Supercomputing}, volume = {75}, number = {8}, pages = {4042-4061}, publisher = {Springer}, abstract = {It is often a challenge to keep input/output tasks/results in order for parallel computations ver data streams, particularly when stateless task operators are replicated to increase parallelism when there are irregular tasks. Maintaining input/output order requires additional coding effort and may significantly impact the application's actual throughput. Thus, we propose a new implementation technique designed to be easily integrated with any of the existing C++ parallel programming frameworks that support stream parallelism. In this paper, it is first implemented and studied using SPar, our high-level domain-specific language for stream parallelism. We discuss the results of a set of experiments with real-world applications revealing how significant performance improvements may be achieved when our proposed solution is integrated within SPar, especially for data compression applications. Also, we show the results of experiments performed after integrating our solution within FastFlow and TBB, revealing no significant overheads.}, keywords = {}, pubstate = {published}, tppubtype = {article} } It is often a challenge to keep input/output tasks/results in order for parallel computations ver data streams, particularly when stateless task operators are replicated to increase parallelism when there are irregular tasks. Maintaining input/output order requires additional coding effort and may significantly impact the application's actual throughput. Thus, we propose a new implementation technique designed to be easily integrated with any of the existing C++ parallel programming frameworks that support stream parallelism. In this paper, it is first implemented and studied using SPar, our high-level domain-specific language for stream parallelism. We discuss the results of a set of experiments with real-world applications revealing how significant performance improvements may be achieved when our proposed solution is integrated within SPar, especially for data compression applications. Also, we show the results of experiments performed after integrating our solution within FastFlow and TBB, revealing no significant overheads. |
Maliszewski, Anderson M; Griebler, Dalvan; Schepke, Claudio; Ditter, Alexander; Fey, Dietmar; Fernandes, Luiz Gustavo The NAS Benchmark Kernels for Single and Multi-Tenant Cloud Instances with LXC/KVM Inproceedings doi International Conference on High Performance Computing & Simulation (HPCS), IEEE, Orléans, France, 2018. @inproceedings{NAS_cloud_LXC_KVM:HPCS:2018, title = {The NAS Benchmark Kernels for Single and Multi-Tenant Cloud Instances with LXC/KVM}, author = {Anderson M Maliszewski and Dalvan Griebler and Claudio Schepke and Alexander Ditter and Dietmar Fey and Luiz Gustavo Fernandes}, doi = {10.1109/HPCS.2018.00066}, year = {2018}, date = {2018-07-01}, booktitle = {International Conference on High Performance Computing & Simulation (HPCS)}, publisher = {IEEE}, address = {Orléans, France}, abstract = {Private IaaS clouds are an attractive environment for scientific workloads and applications. It provides advantages such as almost instantaneous availability of high-performance computing in a single node as well as compute clusters, easy access for researchers, and users that do not have access to conventional supercomputers. Furthermore, a cloud infrastructure provides elasticity and scalability to ensure and manage any software dependency on the system with no third-party dependency for researchers. However, one of the biggest challenges is to avoid significant performance degradation when migrating these applications from physical nodes to a cloud environment. Also, we lack more research investigations for multi-tenant cloud instances. In this paper, our goal is to perform a comparative performance evaluation of scientific applications with single and multi-tenancy cloud instances using KVM and LXC virtualization technologies under private cloud conditions. All analyses and evaluations were carried out based on NAS Benchmark kernels to simulate different types of workloads. We applied statistic significance tests to highlight the differences. The results have shown that applications running on LXC-based cloud instances outperform KVM-based cloud instances in 93.75% of the experiments w.r.t single tenant. Regarding multi-tenant, LXC instances outperform KVM instances in 45% of the results, where the performance differences were not as significant as expected.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Private IaaS clouds are an attractive environment for scientific workloads and applications. It provides advantages such as almost instantaneous availability of high-performance computing in a single node as well as compute clusters, easy access for researchers, and users that do not have access to conventional supercomputers. Furthermore, a cloud infrastructure provides elasticity and scalability to ensure and manage any software dependency on the system with no third-party dependency for researchers. However, one of the biggest challenges is to avoid significant performance degradation when migrating these applications from physical nodes to a cloud environment. Also, we lack more research investigations for multi-tenant cloud instances. In this paper, our goal is to perform a comparative performance evaluation of scientific applications with single and multi-tenancy cloud instances using KVM and LXC virtualization technologies under private cloud conditions. All analyses and evaluations were carried out based on NAS Benchmark kernels to simulate different types of workloads. We applied statistic significance tests to highlight the differences. The results have shown that applications running on LXC-based cloud instances outperform KVM-based cloud instances in 93.75% of the experiments w.r.t single tenant. Regarding multi-tenant, LXC instances outperform KVM instances in 45% of the results, where the performance differences were not as significant as expected. |
Rista, Cassiano; Teixeira, Marcelo; Griebler, Dalvan; Fernandes, Luiz Gustavo Evaluating, Estimating, and Improving Network Performance in Container-based Clouds Inproceedings doi 23rd IEEE Symposium on Computers and Communications (ISCC), IEEE, Natal, Brazil, 2018. @inproceedings{network_performance_container:ISCC:2018, title = {Evaluating, Estimating, and Improving Network Performance in Container-based Clouds}, author = {Cassiano Rista and Marcelo Teixeira and Dalvan Griebler and Luiz Gustavo Fernandes}, doi = {10.1109/ISCC.2018.8538558}, year = {2018}, date = {2018-04-16}, booktitle = {23rd IEEE Symposium on Computers and Communications (ISCC)}, publisher = {IEEE}, address = {Natal, Brazil}, abstract = {Cloud computing has recently attracted a great deal of interest from both industry and academia, emerging as an important paradigm to improve resource utilization, efficiency,flexibility, and pay-per-use. However, cloud platforms inherently include a virtualization layer that imposes performance degradation on network-intensive applications. Thus, it is crucial to anticipate possible performance degradation to resolve system bottlenecks. This paper uses the Petri Nets approach to create different models for evaluating, estimating, and improving network performance in container-based cloud environments. Based on model estimations, we assessed the network bandwidth utilization of the system under different setups. Then, by identifying possible bottlenecks, we show how the system could be modified to hopefully improve performance. We tested how the model would behave through real-world experiments. When the model indicates probable bandwidth saturation, we propose a link aggregation approach to increase bandwidth, using lightweight virtualization to reduce virtualization overhead. Results reveal that our model anticipates the structural and behavioral characteristics of the network in the cloud environment. Therefore, it systematically improves network efficiency, which saves effort, time, and money.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Cloud computing has recently attracted a great deal of interest from both industry and academia, emerging as an important paradigm to improve resource utilization, efficiency,flexibility, and pay-per-use. However, cloud platforms inherently include a virtualization layer that imposes performance degradation on network-intensive applications. Thus, it is crucial to anticipate possible performance degradation to resolve system bottlenecks. This paper uses the Petri Nets approach to create different models for evaluating, estimating, and improving network performance in container-based cloud environments. Based on model estimations, we assessed the network bandwidth utilization of the system under different setups. Then, by identifying possible bottlenecks, we show how the system could be modified to hopefully improve performance. We tested how the model would behave through real-world experiments. When the model indicates probable bandwidth saturation, we propose a link aggregation approach to increase bandwidth, using lightweight virtualization to reduce virtualization overhead. Results reveal that our model anticipates the structural and behavioral characteristics of the network in the cloud environment. Therefore, it systematically improves network efficiency, which saves effort, time, and money. |
Griebler, Dalvan; Vogel, Adriano; Maron, Carlos A F; Maliszewski, Anderson M; Schepke, Claudio; Fernandes, Luiz Gustavo Performance of Data Mining, Media, and Financial Applications under Private Cloud Conditions Inproceedings doi 23rd IEEE Symposium on Computers and Communications (ISCC), IEEE, Natal, Brazil, 2018. @inproceedings{parsec_cloudstack_lxc_kvm:ISCC:2018, title = {Performance of Data Mining, Media, and Financial Applications under Private Cloud Conditions}, author = {Dalvan Griebler and Adriano Vogel and Carlos A F Maron and Anderson M Maliszewski and Claudio Schepke and Luiz Gustavo Fernandes}, doi = {10.1109/ISCC.2018.8538759}, year = {2018}, date = {2018-06-01}, booktitle = {23rd IEEE Symposium on Computers and Communications (ISCC)}, publisher = {IEEE}, address = {Natal, Brazil}, abstract = {This paper contributes to a performance analysis of real-world workloads under private cloud conditions. We selected six benchmarks from PARSEC related to three mainstream application domains (financial, data mining, and media processing). Our goal was to evaluate these application domains in different cloud instances and deployment environments, concerning container or kernel-based instances and using dedicated or shared machine resources. Experiments have shown that performance varies according to the application characteristics, virtualization technology, and cloud environment. Results highlighted that financial, data mining, and media processing applications running in the LXC instances tend to outperform KVM when there is a dedicated machine resource environment. However, when two instances are sharing the same machine resources, these applications tend to achieve better performance in the KVM instances. Finally, financial applications achieved better performance in the cloud than media and data mining.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper contributes to a performance analysis of real-world workloads under private cloud conditions. We selected six benchmarks from PARSEC related to three mainstream application domains (financial, data mining, and media processing). Our goal was to evaluate these application domains in different cloud instances and deployment environments, concerning container or kernel-based instances and using dedicated or shared machine resources. Experiments have shown that performance varies according to the application characteristics, virtualization technology, and cloud environment. Results highlighted that financial, data mining, and media processing applications running in the LXC instances tend to outperform KVM when there is a dedicated machine resource environment. However, when two instances are sharing the same machine resources, these applications tend to achieve better performance in the KVM instances. Finally, financial applications achieved better performance in the cloud than media and data mining. |
Ewald, Endrius; Vogel, Adriano; Griebler, Dalvan; Manssour, Isabel; Fernandes, Luiz Gustavo Suporte ao Processamento Paralelo e Distribuído em uma DSL para Visualização de Dados Geoespaciais Inproceedings XIX Simpósio em Sistemas Computacionais de Alto Desempenho, pp. 1-12, SBC, São Paulo, Brazil, 2018. @inproceedings{EWALD:WSCAD:18b, title = {Suporte ao Processamento Paralelo e Distribuído em uma DSL para Visualização de Dados Geoespaciais}, author = {Endrius Ewald and Adriano Vogel and Dalvan Griebler and Isabel Manssour and Luiz Gustavo Fernandes}, url = {https://gmap.pucrs.br/gmap/files/publications/articles/9f9c9dc7d5d4eaf8bb379c4bef8e00cb.pdf}, year = {2018}, date = {2018-10-01}, booktitle = {XIX Simpósio em Sistemas Computacionais de Alto Desempenho}, pages = {1-12}, publisher = {SBC}, address = {São Paulo, Brazil}, abstract = {The amount of data generated worldwide related to geolocalization has exponentially increased. However, the fast processing of this amount of data is a challenge from the programming perspective, and many available solutions require learning a variety of tools and programming languages. This paper introduces the support for parallel and distributed processing in a DSL for Geospatial Data Visualization to speed up the data pre-processing phase. The results have shown the MPI version with dynamic data distribution performing better under medium and large data set files, while MPI-I/O version achieved the best performance with small data set files.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The amount of data generated worldwide related to geolocalization has exponentially increased. However, the fast processing of this amount of data is a challenge from the programming perspective, and many available solutions require learning a variety of tools and programming languages. This paper introduces the support for parallel and distributed processing in a DSL for Geospatial Data Visualization to speed up the data pre-processing phase. The results have shown the MPI version with dynamic data distribution performing better under medium and large data set files, while MPI-I/O version achieved the best performance with small data set files. |
Vogel, Adriano; Griebler, Dalvan; Sensi, Daniele De; Danelutto, Marco; Fernandes, Luiz Gustavo Autonomic and Latency-Aware Degree of Parallelism Management in SPar Inproceedings doi Euro-Par 2018: Parallel Processing Workshops, pp. 28-39, Springer, Turin, Italy, 2018. @inproceedings{VOGEL:Adaptive-Latency-SPar:AutoDaSP:18, title = {Autonomic and Latency-Aware Degree of Parallelism Management in SPar}, author = {Adriano Vogel and Dalvan Griebler and Daniele De Sensi and Marco Danelutto and Luiz Gustavo Fernandes}, url = {http://dx.doi.org/10.1007/978-3-030-10549-5_3}, doi = {10.1007/978-3-030-10549-5_3}, year = {2018}, date = {2018-08-01}, booktitle = {Euro-Par 2018: Parallel Processing Workshops}, pages = {28-39}, publisher = {Springer}, address = {Turin, Italy}, series = {Lecture Notes in Computer Science}, abstract = {Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes’ annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes’ annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency. |
Griebler, Dalvan; Sensi, Daniele De; Vogel, Adriano; Danelutto, Marco; Fernandes, Luiz Gustavo Service Level Objectives via C++11 Attributes Inproceedings doi Euro-Par 2018: Parallel Processing Workshops, pp. 745-756, Springer, Turin, Italy, 2018. @inproceedings{GRIEBLER:SLO-SPar-Nornir:REPARA:18, title = {Service Level Objectives via C++11 Attributes}, author = {Dalvan Griebler and Daniele De Sensi and Adriano Vogel and Marco Danelutto and Luiz Gustavo Fernandes}, url = {http://dx.doi.org/10.1007/978-3-030-10549-5_58}, doi = {10.1007/978-3-030-10549-5_58}, year = {2018}, date = {2018-08-01}, booktitle = {Euro-Par 2018: Parallel Processing Workshops}, pages = {745-756}, publisher = {Springer}, address = {Turin, Italy}, series = {Lecture Notes in Computer Science}, abstract = {In recent years, increasing attention has been given to the possibility of guaranteeing Service Level Objectives (SLOs) to users about their applications, either regarding performance or power consumption. SLO can be implemented for parallel applications since they can provide many control knobs (e.g., the number of threads to use, the clock frequency of the cores, etc.) to tune the performance and power consumption of the application. Different from most of the existing approaches, we target sequential stream processing applications by proposing a solution based on C++ annotations. The user specifies which parts of the code to parallelize and what type of requirements should be enforced on that part of the code. Our solution first automatically parallelizes the annotated code and then applies self-adaptation approaches at run-time to enforce the user-expressed objectives. We ran experiments on different real-world applications, showing its simplicity and effectiveness.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In recent years, increasing attention has been given to the possibility of guaranteeing Service Level Objectives (SLOs) to users about their applications, either regarding performance or power consumption. SLO can be implemented for parallel applications since they can provide many control knobs (e.g., the number of threads to use, the clock frequency of the cores, etc.) to tune the performance and power consumption of the application. Different from most of the existing approaches, we target sequential stream processing applications by proposing a solution based on C++ annotations. The user specifies which parts of the code to parallelize and what type of requirements should be enforced on that part of the code. Our solution first automatically parallelizes the annotated code and then applies self-adaptation approaches at run-time to enforce the user-expressed objectives. We ran experiments on different real-world applications, showing its simplicity and effectiveness. |
Griebler, Dalvan; Loff, Junior; Mencagli, Gabriele; Danelutto, Marco; Fernandes, Luiz Gustavo Efficient NAS Benchmark Kernels with C++ Parallel Programming Inproceedings doi 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 733-740, IEEE, Cambridge, UK, 2018. @inproceedings{GRIEBLER:NAS-CPP:PDP:18, title = {Efficient NAS Benchmark Kernels with C++ Parallel Programming}, author = {Dalvan Griebler and Junior Loff and Gabriele Mencagli and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP2018.2018.00120}, doi = {10.1109/PDP2018.2018.00120}, year = {2018}, date = {2018-03-01}, booktitle = {26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {733-740}, publisher = {IEEE}, address = {Cambridge, UK}, series = {PDP'18}, abstract = {Benchmarking is a way to study the performance of new architectures and parallel programming frameworks. Well-established benchmark suites such as the NAS Parallel Benchmarks (NPB) comprise legacy codes that still lack portability to C++ language. As a consequence, a set of high-level and easy-to-use C++ parallel programming frameworks cannot be tested in NPB. Our goal is to describe a C++ porting of the NPB kernels and to analyze the performance achieved by different parallel implementations written using the Intel TBB, OpenMP and FastFlow frameworks for Multi-Cores. The experiments show an efficient code porting from Fortran to C++ and an efficient parallelization on average.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Benchmarking is a way to study the performance of new architectures and parallel programming frameworks. Well-established benchmark suites such as the NAS Parallel Benchmarks (NPB) comprise legacy codes that still lack portability to C++ language. As a consequence, a set of high-level and easy-to-use C++ parallel programming frameworks cannot be tested in NPB. Our goal is to describe a C++ porting of the NPB kernels and to analyze the performance achieved by different parallel implementations written using the Intel TBB, OpenMP and FastFlow frameworks for Multi-Cores. The experiments show an efficient code porting from Fortran to C++ and an efficient parallelization on average. |
Hoffmann, Renato B; Griebler, Dalvan; Fernandes, Luiz G Paralelização de uma Aplicação de Detecção e Eliminação de Ruídos em Streaming de Vídeo com a DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. @inproceedings{HOFFMANN:ERAD:18, title = {Paralelização de uma Aplicação de Detecção e Eliminação de Ruídos em Streaming de Vídeo com a DSL SPar}, author = {Renato B. Hoffmann and Dalvan Griebler and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2018/CR_ERAD_IC_Hoffmann_2018.pdf}, year = {2018}, date = {2018-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {2}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre, BR}, abstract = {Restauração de imagem é uma importante etapa de qualquer sistema de computação gráfica. Este trabalho tem como objetivo apresentar e avaliar o paralelismo de Denoiser, uma aplicação para detecção e eliminação de ruído em streaming de vídeo. Foram avaliados o speed-up e programabilidade das interfaces SPar, Thread Building Blocks e FastFlow. Os resultados mostram que a SPar obteve bons resultados de programabilidade e desempenho.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Restauração de imagem é uma importante etapa de qualquer sistema de computação gráfica. Este trabalho tem como objetivo apresentar e avaliar o paralelismo de Denoiser, uma aplicação para detecção e eliminação de ruído em streaming de vídeo. Foram avaliados o speed-up e programabilidade das interfaces SPar, Thread Building Blocks e FastFlow. Os resultados mostram que a SPar obteve bons resultados de programabilidade e desempenho. |
Ewald, Endrius; Vogel, Adriano; Rista, Cassiano; Griebler, Dalvan; Manssour, Isabel; Fernandes, Luiz G Parallel and Distributed Processing Support for a Geospatial Data Visualization DSL Inproceedings doi Symposium on High Performance Computing Systems (WSCAD), pp. 221-228, IEEE, São Paulo, Brazil, 2018. @inproceedings{EWALD:WSCAD:18, title = {Parallel and Distributed Processing Support for a Geospatial Data Visualization DSL}, author = {Endrius Ewald and Adriano Vogel and Cassiano Rista and Dalvan Griebler and Isabel Manssour and Luiz G. Fernandes}, url = {https://doi.org/10.1109/WSCAD.2018.00042}, doi = {10.1109/WSCAD.2018.00042}, year = {2018}, date = {2018-10-01}, booktitle = {Symposium on High Performance Computing Systems (WSCAD)}, pages = {221-228}, publisher = {IEEE}, address = {São Paulo, Brazil}, abstract = {The amount of data generated worldwide related to geolocalization has exponentially increased. However, the fast processing of this amount of data is a challenge from the programming perspective, and many available solutions require learning a variety of tools and programming languages. This paper introduces the support for parallel and distributed processing in a DSL for Geospatial Data Visualization to speed up the data pre-processing phase. The results have shown the MPI version with dynamic data distribution performing better under medium and large data set files, while MPI-I/O version achieved the best performance with small data set files.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The amount of data generated worldwide related to geolocalization has exponentially increased. However, the fast processing of this amount of data is a challenge from the programming perspective, and many available solutions require learning a variety of tools and programming languages. This paper introduces the support for parallel and distributed processing in a DSL for Geospatial Data Visualization to speed up the data pre-processing phase. The results have shown the MPI version with dynamic data distribution performing better under medium and large data set files, while MPI-I/O version achieved the best performance with small data set files. |
Vogel, Adriano; Fernandes, Luiz G Grau de Paralelismo Adaptativo na DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. @inproceedings{VOGEL:ERAD:18, title = {Grau de Paralelismo Adaptativo na DSL SPar}, author = {Adriano Vogel and Luiz G. Fernandes}, url = {https://sol.sbc.org.br/index.php/eradrs/article/view/4698/4615}, year = {2018}, date = {2018-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {2}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre, BR}, abstract = {As aplicações de stream apresentam características que as diferem de outras classes de aplicações, como variação nas entradas/saídas e execuções por períodos indefinidos de tempo. Uma das formas de responder a natureza dinâmica dessas aplicações é adaptando continuamente o grau de paralelismo. Nesse estudo é apresentado o suporte ao grau de paralelismo adaptativo na DSL (Domain-Specific Language) SPar.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } As aplicações de stream apresentam características que as diferem de outras classes de aplicações, como variação nas entradas/saídas e execuções por períodos indefinidos de tempo. Uma das formas de responder a natureza dinâmica dessas aplicações é adaptando continuamente o grau de paralelismo. Nesse estudo é apresentado o suporte ao grau de paralelismo adaptativo na DSL (Domain-Specific Language) SPar. |
Maron, Carlos A F; Fernandes, Luiz G Uma Suíte de Benchmarks Parametrizáveis para o Domínio de Processamento de Stream em Sistemas Multi-Core Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. @inproceedings{MARON:ERAD:18, title = {Uma Suíte de Benchmarks Parametrizáveis para o Domínio de Processamento de Stream em Sistemas Multi-Core}, author = {Carlos A. F. Maron and Luiz G. Fernandes}, url = {https://sol.sbc.org.br/index.php/eradrs/article/view/4723/4640}, year = {2018}, date = {2018-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {2}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre, BR}, abstract = {Avaliar o desempenho é importante para computação. Porém, assim como o hardware, o software também deve ser avaliado quando características podem influenciar no seu comportamento. Nestes casos, a suíte de benchmarks parametrizáveis para o processamento de stream serve como uma ferramenta de apoio ao usuário e até programadores.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Avaliar o desempenho é importante para computação. Porém, assim como o hardware, o software também deve ser avaliado quando características podem influenciar no seu comportamento. Nestes casos, a suíte de benchmarks parametrizáveis para o processamento de stream serve como uma ferramenta de apoio ao usuário e até programadores. |
Rista, Cassiano; Fernandes, Luiz G Proposta de Provisionamento Elástico de Recursos com MPI-2 para a DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. @inproceedings{RISTA:ERAD:18, title = {Proposta de Provisionamento Elástico de Recursos com MPI-2 para a DSL SPar}, author = {Cassiano Rista and Luiz G. Fernandes}, url = {https://sol.sbc.org.br/index.php/eradrs/article/view/4709/4626}, year = {2018}, date = {2018-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {2}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre, BR}, abstract = {Este artigo apresenta uma proposta para desenvolvimento de um módulo de provisionamento elástico e autônomo a ser integrado em uma linguagem especifica de domínio (DSL) voltada para o paralelismo de stream. O módulo deverá explorar a elasticidade como uso de MPI-2 em um ambiente de cluster de computadores, permitindo a criação de processos em tempo de execução, serialização, ordenamento e balanceamento de carga.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Este artigo apresenta uma proposta para desenvolvimento de um módulo de provisionamento elástico e autônomo a ser integrado em uma linguagem especifica de domínio (DSL) voltada para o paralelismo de stream. O módulo deverá explorar a elasticidade como uso de MPI-2 em um ambiente de cluster de computadores, permitindo a criação de processos em tempo de execução, serialização, ordenamento e balanceamento de carga. |
Bairros, Gildomiro; Fernandes, Luiz G Suporte para Computação Autonômica com Elasticidade Vertical para a DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. @inproceedings{BAIRROS:ERAD:18, title = {Suporte para Computação Autonômica com Elasticidade Vertical para a DSL SPar}, author = {Gildomiro Bairros and Luiz G. Fernandes}, url = {https://sol.sbc.org.br/index.php/eradrs/article/view/4716/4633}, year = {2018}, date = {2018-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {2}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre, BR}, abstract = {O objetivo deste trabalho é propor uma solução para o suporte de elasticidade vertical em aplicações de processamento de stream desenvolvidas com a SPar. Trata-se de uma linguagem específica de domínio para expressar paralelismo de stream em alto nível. Nossa fornece suporte a elasticidade automática para ambientes em Linux Contêineres e rotinas que abstraem detalhes da infraestrutura de nuvem através da VEL (Vertical Elasticity Library)}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } O objetivo deste trabalho é propor uma solução para o suporte de elasticidade vertical em aplicações de processamento de stream desenvolvidas com a SPar. Trata-se de uma linguagem específica de domínio para expressar paralelismo de stream em alto nível. Nossa fornece suporte a elasticidade automática para ambientes em Linux Contêineres e rotinas que abstraem detalhes da infraestrutura de nuvem através da VEL (Vertical Elasticity Library) |
Loff, Junior; Griebler, Dalvan; Sandes, Edans; Melo, Alba; Fernandes, Luiz G Suporte ao Paralelismo Multi-Core com FastFlow e TBB em uma Aplicação de Alinhamento de Sequências de DNA Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. @inproceedings{LOFF:ERAD:18, title = {Suporte ao Paralelismo Multi-Core com FastFlow e TBB em uma Aplicação de Alinhamento de Sequências de DNA}, author = {Junior Loff and Dalvan Griebler and Edans Sandes and Alba Melo and Luiz G. Fernandes}, url = {https://gmap.pucrs.br/dalvan/papers/2018/CR_ERAD_IC_Loff_2018.pdf}, year = {2018}, date = {2018-04-01}, booktitle = {Escola Regional de Alto Desempenho (ERAD/RS)}, pages = {2}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre, BR}, abstract = {Quando uma sequência biológica é obtida, é comum alinhá-la com outra já estudada para determinar suas características. O desafio é processar este alinhamento em tempo útil. Neste trabalho exploramos o paralelismo em uma aplicação de alinhamento de sequências de DNA utilizando as bibliotecas FastFlow e Intel TBB. Os experimentos mostram que a versão TBB obteve até 4% melhor tempo de execução em comparação à versão original em OpenMP.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Quando uma sequência biológica é obtida, é comum alinhá-la com outra já estudada para determinar suas características. O desafio é processar este alinhamento em tempo útil. Neste trabalho exploramos o paralelismo em uma aplicação de alinhamento de sequências de DNA utilizando as bibliotecas FastFlow e Intel TBB. Os experimentos mostram que a versão TBB obteve até 4% melhor tempo de execução em comparação à versão original em OpenMP. |
Maron, Carlos A F Parametrização do paralelismo de stream em Benchmarks da suíte PARSEC Masters Thesis School of Technology - PPGCC - PUCRS, 2018. @mastersthesis{MARON:DM:18, title = {Parametrização do paralelismo de stream em Benchmarks da suíte PARSEC}, author = {Carlos A. F. Maron}, url = {http://tede2.pucrs.br/tede2/handle/tede/8556}, year = {2018}, date = {2018-08-01}, address = {Porto Alegre, Brazil}, school = {School of Technology - PPGCC - PUCRS}, abstract = {The parallel software designer aims to deliver efficient and scalable applications. This can be done by understanding the performance impacts of the application’s characteristics. Parallel applications of the same domain use to present similar patterns of behavior and characteristics. One way to go for understanding and evaluating the applications’ characteristics is using parametrizable benchmarks, which enables users to play with the important characteristics when running the benchmark. However, the parametrization technique must be better exploited in the available benchmarks, especially on stream processing application domain. Our challenge is to enable the parametrization of the stream processing applications’ characteristics (also known as stream parallelism) through benchmarks. Mainly because this application domain is widely used and the benchmarks available for it usually do not support the evaluation of important characteristics from this domain (e.g., PARSEC). Therefore, the goal is to identify the stream parallelism characteristics present in the PARSEC benchmarks and implement the parametrization support for ready to use. We selected the Dedup and Ferret applications, which represent the stream parallelism domain. In the experimental results, we observed that our implemented parametrization has caused performance impacts in this application domain. In the most cases, our parametrization improved the throughput, latency, service time, and execution time. Moreover, since we have not evaluated the computer architectures and parallel programming frameworks’ performance, the results have shown new potential research investigations to understand other patterns of behavior caused by the parametrization.}, keywords = {}, pubstate = {published}, tppubtype = {mastersthesis} } The parallel software designer aims to deliver efficient and scalable applications. This can be done by understanding the performance impacts of the application’s characteristics. Parallel applications of the same domain use to present similar patterns of behavior and characteristics. One way to go for understanding and evaluating the applications’ characteristics is using parametrizable benchmarks, which enables users to play with the important characteristics when running the benchmark. However, the parametrization technique must be better exploited in the available benchmarks, especially on stream processing application domain. Our challenge is to enable the parametrization of the stream processing applications’ characteristics (also known as stream parallelism) through benchmarks. Mainly because this application domain is widely used and the benchmarks available for it usually do not support the evaluation of important characteristics from this domain (e.g., PARSEC). Therefore, the goal is to identify the stream parallelism characteristics present in the PARSEC benchmarks and implement the parametrization support for ready to use. We selected the Dedup and Ferret applications, which represent the stream parallelism domain. In the experimental results, we observed that our implemented parametrization has caused performance impacts in this application domain. In the most cases, our parametrization improved the throughput, latency, service time, and execution time. Moreover, since we have not evaluated the computer architectures and parallel programming frameworks’ performance, the results have shown new potential research investigations to understand other patterns of behavior caused by the parametrization. |
Vogel, Adriano Adaptive Degree of Parallelism for the SPar Runtime Masters Thesis School of Technology - PPGCC - PUCRS, 2018. @mastersthesis{VOGEL:DM:18, title = {Adaptive Degree of Parallelism for the SPar Runtime}, author = {Adriano Vogel}, url = {http://tede2.pucrs.br/tede2/handle/tede/8255}, year = {2018}, date = {2018-03-01}, address = {Porto Alegre, Brazil}, school = {School of Technology - PPGCC - PUCRS}, abstract = {In recent years, stream processing applications have become a traditional workload in computing systems. They are traditionally found in video, audio, graphic and image processing. Many of these applications demand parallelism to increase performance. However, programmers must often face the trade-off between coding productivity and performance that introducing parallelism creates. SPar Domain-Specific Language (DSL) was created to achieve the optimal balance for programmers, with the C++-11 attribute annotation mechanism to ensure that essential properties of stream parallelism could be represented (stage, input, output, and replicate). The compiler recognizes the SPar attributes and generates parallel code automatically. The need to manually define parallelism is tne crucial challenge for increasing SPAR's abstraction level, because it is time consuming and error prone. Also, executing several applications can fail to be efficient when running a non-suitable number of replicas. This occurs when the defined number of replicas in a parallel region is not optimal or when a static number is used, which ignores the dynamic nature of stream processing applications. In order to solve this problem, we introduced the concept of the abstracted and adaptive number of replicas for SPar. Moreover, we described our implemented mechanism as well as transformation rules that enable SPar to generate parallel code with the adaptive degree of parallelism support. We experimentally evaluated the implemented adaptive mechanisms regarding their effectiveness. Thus, we used real-world applications to demonstrate that our adaptive mechanism implementations can provide higher abstraction levels without significant performance degradation.}, keywords = {}, pubstate = {published}, tppubtype = {mastersthesis} } In recent years, stream processing applications have become a traditional workload in computing systems. They are traditionally found in video, audio, graphic and image processing. Many of these applications demand parallelism to increase performance. However, programmers must often face the trade-off between coding productivity and performance that introducing parallelism creates. SPar Domain-Specific Language (DSL) was created to achieve the optimal balance for programmers, with the C++-11 attribute annotation mechanism to ensure that essential properties of stream parallelism could be represented (stage, input, output, and replicate). The compiler recognizes the SPar attributes and generates parallel code automatically. The need to manually define parallelism is tne crucial challenge for increasing SPAR's abstraction level, because it is time consuming and error prone. Also, executing several applications can fail to be efficient when running a non-suitable number of replicas. This occurs when the defined number of replicas in a parallel region is not optimal or when a static number is used, which ignores the dynamic nature of stream processing applications. In order to solve this problem, we introduced the concept of the abstracted and adaptive number of replicas for SPar. Moreover, we described our implemented mechanism as well as transformation rules that enable SPar to generate parallel code with the adaptive degree of parallelism support. We experimentally evaluated the implemented adaptive mechanisms regarding their effectiveness. Thus, we used real-world applications to demonstrate that our adaptive mechanism implementations can provide higher abstraction levels without significant performance degradation. |
2021 |
Providing High‐Level Self‐Adaptive Abstractions for Stream Parallelism on Multicores Journal Article doi Software: Practice and Experience, na (na), pp. na, 2021. |
Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces Inproceedings Forthcoming 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, Valladolid, Spain, Forthcoming. |
Towards On-the-fly Self-Adaptation of Stream Parallel Patterns Inproceedings Forthcoming 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, Valladolid, Spain, Forthcoming. |
2020 |
DSPBench: a Suite of Benchmark Applications for Distributed Data Stream Processing Systems Journal Article doi IEEE Access, 8 (na), pp. 222900-222917, 2020. |
Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. |
Avaliação da Usabilidade de Interfaces de Programação Paralela para Sistemas Multi-Core em Aplicação de Vídeo Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 149-150, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures Inproceedings doi International Conference on Computational Science and its Applications (ICCSA), pp. 142-157, Springer, Cagliari, Italy, 2020. |
Stream Parallelism Annotations for Multi-Core Frameworks Inproceedings doi XXIV Brazilian Symposium on Programming Languages (SBLP), pp. 48-55, ACM, Natal, Brazil, 2020. |
Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 37-40, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Implementação CUDA dos Kernels NPB Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 85-88, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Geração Automática de Código TBB na SPar Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 97-100, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Acelerando uma Aplicação de Detecção de Pistas com MPI Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 117-120, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Proposta de uma Suíte de Benchmarks para Processamento de Stream em Sistemas Multi-Core Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 167-168, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Parallel Stream Processing with MPI for Video Analytics and Data Visualization Inproceedings doi High Performance Computing Systems, pp. 102-116, Springer, Cham, 2020. |
Efficient NAS Parallel Benchmark Kernels with CUDA Inproceedings doi 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 9-16, IEEE, Västerås, Sweden, Sweden, 2020. |
2019 |
Simplifying and implementing service level objectives for stream parallelism Journal Article doi Journal of Supercomputing, pp. 1-26, 2019, ISSN: 0920-8542. |
Raising the Parallel Abstraction Level for Streaming Analytics Applications Journal Article doi IEEE Access, 7 , pp. 131944 - 131961, 2019. |
Minimizing Self-Adaptation Overhead in Parallel Stream Processing for Multi-Cores Inproceedings doi Euro-Par 2019: Parallel Processing Workshops, pp. 12, Springer, Göttingen, Germany, 2019. |
Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges Inproceedings doi International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 834-841, IEEE, Rio de Janeiro, Brazil, 2019. |
Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 247-251, IEEE, Pavia, Italy, 2019. |
Should PARSEC Benchmarks be More Parametric? A Case Study with Dedup Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 217-221, IEEE, Pavia, Italy, 2019. |
Memory Performance and Bottlenecks in Multicore and GPU Architectures Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 233-236, IEEE, Pavia, Italy, 2019. |
Structured Stream Parallelism for Rust Inproceedings doi XXIII Brazilian Symposium on Programming Languages (SBLP), pp. 54-61, ACM, Salvador, Brazil, 2019. |
Seamless Parallelism Management for Multi-core Stream Processing Inproceedings doi Advances in Parallel Computing, Proceedings of the International Conference on Parallel Computing (ParCo), pp. 533-542, IOS Press, Prague, Czech Republic, 2019. |
High-Level Stream Parallelism Abstractions with SPar Targeting GPUs Inproceedings doi Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo), pp. 543-552, IOS Press, Prague, Czech Republic, 2019. |
Acelerando o Reconhecimento de Pessoas em Vídeos com MPI Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. |
Revisando a Programação Paralela com CUDA nos Benchmarks EP e FT Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. |
Avaliando o Paralelismo de Stream com Pthreads, OpenMP e SPar em Aplicações de Vídeo Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. |
Proposta de Grau de Paralelismo Autoadaptativo com MPI-2 para a DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. |
Benchmark Paramétrico para o Domínio do Paralelismo de Stream: Um Estudo de Caso com o Ferret da Suíte PARSEC Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. |
Adaptando o Paralelismo em Aplicações de Stream Conforme Objetivos de Throughput Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. |
Proposta de Suporte ao Paralelismo de GPU na SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 4, Sociedade Brasileira de Computação (SBC), Três de Maio, BR, 2019. |
2018 |
High-Level and Productive Stream Parallelism for Dedup, Ferret, and Bzip2 Journal Article doi International Journal of Parallel Programming, 47 (1), pp. 253-271, 2018, ISSN: 1573-7640. |
Stream Parallelism with Ordered Data Constraints on Multi-Core Systems Journal Article doi Journal of Supercomputing, 75 (8), pp. 4042-4061, 2018, ISSN: 0920-8542. |
The NAS Benchmark Kernels for Single and Multi-Tenant Cloud Instances with LXC/KVM Inproceedings doi International Conference on High Performance Computing & Simulation (HPCS), IEEE, Orléans, France, 2018. |
Evaluating, Estimating, and Improving Network Performance in Container-based Clouds Inproceedings doi 23rd IEEE Symposium on Computers and Communications (ISCC), IEEE, Natal, Brazil, 2018. |
Performance of Data Mining, Media, and Financial Applications under Private Cloud Conditions Inproceedings doi 23rd IEEE Symposium on Computers and Communications (ISCC), IEEE, Natal, Brazil, 2018. |
Suporte ao Processamento Paralelo e Distribuído em uma DSL para Visualização de Dados Geoespaciais Inproceedings XIX Simpósio em Sistemas Computacionais de Alto Desempenho, pp. 1-12, SBC, São Paulo, Brazil, 2018. |
Autonomic and Latency-Aware Degree of Parallelism Management in SPar Inproceedings doi Euro-Par 2018: Parallel Processing Workshops, pp. 28-39, Springer, Turin, Italy, 2018. |
Service Level Objectives via C++11 Attributes Inproceedings doi Euro-Par 2018: Parallel Processing Workshops, pp. 745-756, Springer, Turin, Italy, 2018. |
Efficient NAS Benchmark Kernels with C++ Parallel Programming Inproceedings doi 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 733-740, IEEE, Cambridge, UK, 2018. |
Paralelização de uma Aplicação de Detecção e Eliminação de Ruídos em Streaming de Vídeo com a DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. |
Parallel and Distributed Processing Support for a Geospatial Data Visualization DSL Inproceedings doi Symposium on High Performance Computing Systems (WSCAD), pp. 221-228, IEEE, São Paulo, Brazil, 2018. |
Grau de Paralelismo Adaptativo na DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. |
Uma Suíte de Benchmarks Parametrizáveis para o Domínio de Processamento de Stream em Sistemas Multi-Core Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. |
Proposta de Provisionamento Elástico de Recursos com MPI-2 para a DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. |
Suporte para Computação Autonômica com Elasticidade Vertical para a DSL SPar Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. |
Suporte ao Paralelismo Multi-Core com FastFlow e TBB em uma Aplicação de Alinhamento de Sequências de DNA Inproceedings Escola Regional de Alto Desempenho (ERAD/RS), pp. 2, Sociedade Brasileira de Computação (SBC), Porto Alegre, BR, 2018. |
Parametrização do paralelismo de stream em Benchmarks da suíte PARSEC Masters Thesis School of Technology - PPGCC - PUCRS, 2018. |
Adaptive Degree of Parallelism for the SPar Runtime Masters Thesis School of Technology - PPGCC - PUCRS, 2018. |