2023 |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Micro-batch and data frequency for stream processing on multi-cores Journal Article doi The Journal of Supercomputing, In press (In press), pp. 1-39, 2023. @article{GARCIA:JS:23, title = {Micro-batch and data frequency for stream processing on multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-022-05024-y}, doi = {10.1007/s11227-022-05024-y}, year = {2023}, date = {2023-01-01}, journal = {The Journal of Supercomputing}, volume = {In press}, number = {In press}, pages = {1-39}, publisher = {Springer}, abstract = {Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines. |
2022 |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo SPBench: a framework for creating benchmarks of stream processing applications Journal Article doi Computing, In press (In press), pp. 1-23, 2022. @article{GARCIA:Computing:22, title = {SPBench: a framework for creating benchmarks of stream processing applications}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s00607-021-01025-6}, doi = {10.1007/s00607-021-01025-6}, year = {2022}, date = {2022-01-01}, journal = {Computing}, volume = {In press}, number = {In press}, pages = {1-23}, publisher = {Springer}, abstract = {In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures. |
Hoffmann, Renato Barreto; Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz Gustavo OpenMP as runtime for providing high-level stream parallelism on multi-cores Journal Article doi The Journal of Supercomputing, 1 (1), pp. 7655–7676, 2022. @article{HOFFMANN:Jsuper:2022, title = {OpenMP as runtime for providing high-level stream parallelism on multi-cores}, author = {Renato Barreto Hoffmann and Júnior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-021-04182-9}, doi = {10.1007/s11227-021-04182-9}, year = {2022}, date = {2022-01-01}, journal = {The Journal of Supercomputing}, volume = {1}, number = {1}, pages = {7655–7676}, publisher = {Springer}, address = {New York, United States}, abstract = {OpenMP is an industry and academic standard for parallel programming. However, using it for developing parallel stream processing applications is complex and challenging. OpenMP lacks key programming mechanisms and abstractions for this particular domain. To tackle this problem, we used a high-level parallel programming framework (named SPar) for automatically generating parallel OpenMP code. We achieved this by leveraging SPar's language and its domain-specific code annotations for simplifying the complexity and verbosity added by OpenMP in this application domain. Consequently, we implemented a new compiler algorithm in SPar for automatically generating parallel code targeting the OpenMP runtime using source-to-source code transformations. The experiments in four different stream processing applications demonstrated that the execution time of SPar was improved up to 25.42% when using the OpenMP runtime. Additionally, our abstraction over OpenMP introduced at most 1.72% execution time overhead when compared to handwritten parallel codes. Furthermore, SPar significantly reduces the total source lines of code required to express parallelism with respect to plain OpenMP parallel codes.}, keywords = {}, pubstate = {published}, tppubtype = {article} } OpenMP is an industry and academic standard for parallel programming. However, using it for developing parallel stream processing applications is complex and challenging. OpenMP lacks key programming mechanisms and abstractions for this particular domain. To tackle this problem, we used a high-level parallel programming framework (named SPar) for automatically generating parallel OpenMP code. We achieved this by leveraging SPar's language and its domain-specific code annotations for simplifying the complexity and verbosity added by OpenMP in this application domain. Consequently, we implemented a new compiler algorithm in SPar for automatically generating parallel code targeting the OpenMP runtime using source-to-source code transformations. The experiments in four different stream processing applications demonstrated that the execution time of SPar was improved up to 25.42% when using the OpenMP runtime. Additionally, our abstraction over OpenMP introduced at most 1.72% execution time overhead when compared to handwritten parallel codes. Furthermore, SPar significantly reduces the total source lines of code required to express parallelism with respect to plain OpenMP parallel codes. |
Vogel, Adriano; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Self-adaptation on Parallel Stream Processing: A Systematic Review Journal Article doi Concurrency and Computation: Practice and Experience, 34 (6), pp. e6759, 2022. @article{VOGEL:Survey:CCPE:2022, title = {Self-adaptation on Parallel Stream Processing: A Systematic Review}, author = {Adriano Vogel and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/cpe.6759}, doi = {10.1002/cpe.6759}, year = {2022}, date = {2022-03-01}, journal = {Concurrency and Computation: Practice and Experience}, volume = {34}, number = {6}, pages = {e6759}, publisher = {Wiley}, abstract = {A recurrent challenge in real-world applications is autonomous management of the executions at run-time. In this vein, stream processing is a class of applications that compute data flowing in the form of streams (e.g., video feeds, images, and data analytics), where parallel computing can help accelerate the executions. On the one hand, stream processing applications are becoming more complex, dynamic, and long-running. On the other hand, it is unfeasible for humans to monitor and manually change the executions continuously. Hence, self-adaptation can reduce costs and human efforts by providing a higher-level abstraction with an autonomic/seamless management of executions. In this work, we aim at providing a literature review regarding self-adaptation applied to the parallel stream processing domain. We present a comprehensive revision using a systematic literature review method. Moreover, we propose a taxonomy to categorize and classify the existing self-adaptive approaches. Finally, applying the taxonomy made it possible to characterize the state-of-the-art, identify trends, and discuss open research challenges and future opportunities.}, keywords = {}, pubstate = {published}, tppubtype = {article} } A recurrent challenge in real-world applications is autonomous management of the executions at run-time. In this vein, stream processing is a class of applications that compute data flowing in the form of streams (e.g., video feeds, images, and data analytics), where parallel computing can help accelerate the executions. On the one hand, stream processing applications are becoming more complex, dynamic, and long-running. On the other hand, it is unfeasible for humans to monitor and manually change the executions continuously. Hence, self-adaptation can reduce costs and human efforts by providing a higher-level abstraction with an autonomic/seamless management of executions. In this work, we aim at providing a literature review regarding self-adaptation applied to the parallel stream processing domain. We present a comprehensive revision using a systematic literature review method. Moreover, we propose a taxonomy to categorize and classify the existing self-adaptive approaches. Finally, applying the taxonomy made it possible to characterize the state-of-the-art, identify trends, and discuss open research challenges and future opportunities. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores Inproceedings doi 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 10-17, IEEE, Valladolid, Spain, 2022. @inproceedings{GARCIA:PDP:22, title = {Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP55904.2022.00011}, doi = {10.1109/PDP55904.2022.00011}, year = {2022}, date = {2022-04-01}, booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {10-17}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'22}, abstract = {In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Um Framework para Criar Benchmarks de Aplicações Paralelas de Stream Inproceedings doi Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 97–98, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. @inproceedings{GARCIA:ERAD:22, title = {Um Framework para Criar Benchmarks de Aplicações Paralelas de Stream}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2022.19180}, doi = {10.5753/eradrs.2022.19180}, year = {2022}, date = {2022-04-01}, booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul}, pages = {97--98}, publisher = {Sociedade Brasileira de Computação}, address = {Curitiba, Brazil}, abstract = {Este trabalho apresenta o SPBench, um framework para o desenvolvimento de benchmarks de processamento de stream em C++. O SPBench fornece um conjunto de aplicações realísticas através de abstrações de alto nível e permite customizações nos dados de entrada e métricas de desempenho.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho apresenta o SPBench, um framework para o desenvolvimento de benchmarks de processamento de stream em C++. O SPBench fornece um conjunto de aplicações realísticas através de abstrações de alto nível e permite customizações nos dados de entrada e métricas de desempenho. |
Andrade, Gabriella; Griebler, Dalvan; Fernandes, Luiz Gustavo Avaliação do Esforço de Programação em GPU: Estudo Piloto Inproceedings doi Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 95–96, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. @inproceedings{ANDRADE:ERAD:22, title = {Avaliação do Esforço de Programação em GPU: Estudo Piloto}, author = {Gabriella Andrade and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2022.19179}, doi = {10.5753/eradrs.2022.19179}, year = {2022}, date = {2022-04-01}, booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul}, pages = {95--96}, publisher = {Sociedade Brasileira de Computação}, address = {Curitiba, Brazil}, abstract = {O desenvolvimento de aplicações para GPU não é uma tarefa fácil, pois exige um maior conhecimento da arquitetura. Neste trabalho realizamos um estudo piloto para avaliar o esforço de programadores não-especialistas ao desenvolver aplicações para GPU. Os resultados revelaram que a GSParLib requer menos esforço em relação as demais interfaces de programação paralela. Entretanto, mais investigações são necessárias a fim de complementar o estudo.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } O desenvolvimento de aplicações para GPU não é uma tarefa fácil, pois exige um maior conhecimento da arquitetura. Neste trabalho realizamos um estudo piloto para avaliar o esforço de programadores não-especialistas ao desenvolver aplicações para GPU. Os resultados revelaram que a GSParLib requer menos esforço em relação as demais interfaces de programação paralela. Entretanto, mais investigações são necessárias a fim de complementar o estudo. |
Müller, Caetano; Löff, Junior; Griebler, Dalvan; Eizirik, Eduardo Avaliação da aplicação de paralelismo em classificadores taxonômicos usando Qiime2 Inproceedings doi Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 25–28, Sociedade Brasileira de Computação (SBC), Porto Alegre, RS, Brasil, 2022. @inproceedings{gmap:MULLER:ERAD-RS:22, title = {Avaliação da aplicação de paralelismo em classificadores taxonômicos usando Qiime2}, author = {Caetano Müller and Junior Löff and Dalvan Griebler and Eduardo Eizirik}, url = {https://sol.sbc.org.br/index.php/eradrs/article/view/19152}, doi = {10.5753/eradrs.2022.19152}, year = {2022}, date = {2022-01-01}, booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul}, pages = {25--28}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre, RS, Brasil}, abstract = {A classificação de sequências de DNA usando algoritmos de aprendizado de máquina ainda tem espaço para evoluir, tanto na qualidade do resultado quanto na eficiência computacional dos algoritmos. Nesse trabalho, realizou-se uma avaliação de desempenho em dois algoritmos de aprendizado de máquina da ferramenta Qiime2 para classificação de sequências de DNA. Os resultados mostram que o desempenho melhorou em até 9,65 vezes utilizando 9 threads.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } A classificação de sequências de DNA usando algoritmos de aprendizado de máquina ainda tem espaço para evoluir, tanto na qualidade do resultado quanto na eficiência computacional dos algoritmos. Nesse trabalho, realizou-se uma avaliação de desempenho em dois algoritmos de aprendizado de máquina da ferramenta Qiime2 para classificação de sequências de DNA. Os resultados mostram que o desempenho melhorou em até 9,65 vezes utilizando 9 threads. |
Rockenbach, Dinei A; Löff, Júnior; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz G High-Level Stream and Data Parallelism in C++ for GPUs Inproceedings doi XXVI Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Uberlândia, Brazil, 2022. @inproceedings{ROCKENBACH:SBLP:22, title = {High-Level Stream and Data Parallelism in C++ for GPUs}, author = {Dinei A. Rockenbach and Júnior Löff and Gabriell Araujo and Dalvan Griebler and Luiz G Fernandes}, url = {https://doi.org/10.1145/3561320.3561327}, doi = {10.1145/3561320.3561327}, year = {2022}, date = {2022-10-01}, booktitle = {XXVI Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-49}, publisher = {ACM}, address = {Uberlândia, Brazil}, series = {SBLP'22}, abstract = {GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching. |
Vogel, Adriano Self-adaptive abstractions for efficient high-level parallel computing in multi-cores PhD Thesis School of Technology - PUCRS, 2022. @phdthesis{VOGEL:PHD_PUCRS:22, title = {Self-adaptive abstractions for efficient high-level parallel computing in multi-cores}, author = {Adriano Vogel}, url = {https://tede2.pucrs.br/tede2/handle/tede/10232}, year = {2022}, date = {2022-05-01}, address = {Porto Alegre, Brazil}, school = {School of Technology - PUCRS}, abstract = {Nowadays, a significant part of computing systems and real-world applications demand parallelism to accelerate their executions. Although high-level and structured parallel programming aims to facilitate parallelism exploitation, there are still issues to be addressed to improve existing parallel programming abstractions. Usually, application developers still have to set non-intuitive or complex parallelism configurations. In this context, self-adaptation is a potential alternative to provide a higher-level of autonomic abstractions and runtime responsiveness in parallel executions. However, a recurrent problem is that self-adaptation is still limited in terms of flexibility, efficiency, and abstractions. For instance, there is a lack of mechanisms to apply adaptation actions and efficient decision-making strategies to decide which configurations to be enforced at run-time. In this work, we are interested in abstractions achievable with self-adaptation transparently managing the executions while the parallel programs are running (at run-time). Our main goals are to increase the adaptation space to be more representative of real-world applications and make self-adaptation more efficient with comprehensive evaluation methodologies, which can provide use-cases demonstrating the true potentials of self-adaptation. Therefore, this doctoral dissertation provides the following scientific contributions: I) An Systematic Literature Review (SLR) providing a taxonomy of the state-of-the-art. II) A conceptual framework to support designing and abstracting the decision-making process within self-adaptive solutions, such a conceptual framework is then employed in the technical contributions to assist in making the solutions more modular and potentially generalizable. III) Mechanisms and strategies for self-adaptive replicas in applications with single and multiple parallel stages, supporting multiple customizable non-functional requirements. IV) Mechanism, strategy, and optimizations for self-adaptation of Parallel Patterns/applications’ graphs topologies. We apply the proposed solutions to the context of stream processing applications, a representative paradigm present in several real-world applications that compute data flowing in the form of streams (e.g., video feeds, image, and data analytics). A part of the proposed solutions is evaluated with SPar and another part with the FastFlow programming framework. The results demonstrate that self-adaptation can provide efficient parallelism abstractions and autonomous responsiveness at run-time, yet achieve a competitive performance w.r.t. the best static executions. Moreover, when appropriate, we compare state-of-the-art solutions and demonstrate that our highly optimized decision-making strategies achieve significant performance and efficiency gains.}, keywords = {}, pubstate = {published}, tppubtype = {phdthesis} } Nowadays, a significant part of computing systems and real-world applications demand parallelism to accelerate their executions. Although high-level and structured parallel programming aims to facilitate parallelism exploitation, there are still issues to be addressed to improve existing parallel programming abstractions. Usually, application developers still have to set non-intuitive or complex parallelism configurations. In this context, self-adaptation is a potential alternative to provide a higher-level of autonomic abstractions and runtime responsiveness in parallel executions. However, a recurrent problem is that self-adaptation is still limited in terms of flexibility, efficiency, and abstractions. For instance, there is a lack of mechanisms to apply adaptation actions and efficient decision-making strategies to decide which configurations to be enforced at run-time. In this work, we are interested in abstractions achievable with self-adaptation transparently managing the executions while the parallel programs are running (at run-time). Our main goals are to increase the adaptation space to be more representative of real-world applications and make self-adaptation more efficient with comprehensive evaluation methodologies, which can provide use-cases demonstrating the true potentials of self-adaptation. Therefore, this doctoral dissertation provides the following scientific contributions: I) An Systematic Literature Review (SLR) providing a taxonomy of the state-of-the-art. II) A conceptual framework to support designing and abstracting the decision-making process within self-adaptive solutions, such a conceptual framework is then employed in the technical contributions to assist in making the solutions more modular and potentially generalizable. III) Mechanisms and strategies for self-adaptive replicas in applications with single and multiple parallel stages, supporting multiple customizable non-functional requirements. IV) Mechanism, strategy, and optimizations for self-adaptation of Parallel Patterns/applications’ graphs topologies. We apply the proposed solutions to the context of stream processing applications, a representative paradigm present in several real-world applications that compute data flowing in the form of streams (e.g., video feeds, image, and data analytics). A part of the proposed solutions is evaluated with SPar and another part with the FastFlow programming framework. The results demonstrate that self-adaptation can provide efficient parallelism abstractions and autonomous responsiveness at run-time, yet achieve a competitive performance w.r.t. the best static executions. Moreover, when appropriate, we compare state-of-the-art solutions and demonstrate that our highly optimized decision-making strategies achieve significant performance and efficiency gains. |
Vogel, Adriano Self-adaptive abstractions for efficient high-level parallel computing in multi-cores PhD Thesis Computer Science Department - University of Pisa, 2022. @phdthesis{VOGEL:PHD_PISA:22, title = {Self-adaptive abstractions for efficient high-level parallel computing in multi-cores}, author = {Adriano Vogel}, url = {https://etd.adm.unipi.it/theses/available/etd-04142022-142258/unrestricted/Vogel_PhD_Dissertation_UNIPI.pdf}, year = {2022}, date = {2022-05-01}, address = {Pisa, Italy}, school = {Computer Science Department - University of Pisa}, abstract = {Nowadays, a significant part of computing systems and real-world applications demand parallelism to accelerate their executions. Although high-level and structured parallel programming aims to facilitate parallelism exploitation, there are still issues to be addressed to improve existing parallel programming abstractions. Usually, application developers still have to set non-intuitive or complex parallelism configurations. In this context, self-adaptation is a potential alternative to provide a higher-level of autonomic abstractions and runtime responsiveness in parallel executions. However, a recurrent problem is that self-adaptation is still limited in terms of flexibility, efficiency, and abstractions. For instance, there is a lack of mechanisms to apply adaptation actions and efficient decision-making strategies to decide which configurations to be enforced at run-time. In this work, we are interested in abstractions achievable with self-adaptation transparently managing the executions while the parallel programs are running (at run-time). Our main goals are to increase the adaptation space to be more representative of real-world applications and make self-adaptation more efficient with comprehensive evaluation methodologies, which can provide use-cases demonstrating the true potentials of self-adaptation. Therefore, this doctoral dissertation provides the following scientific contributions: I) An Systematic Literature Review (SLR) providing a taxonomy of the state-of-the-art. II) A conceptual framework to support designing and abstracting the decision-making process within self-adaptive solutions, such a conceptual framework is then employed in the technical contributions to assist in making the solutions more modular and potentially generalizable. III) Mechanisms and strategies for self-adaptive replicas in applications with single and multiple parallel stages, supporting multiple customizable non-functional requirements. IV) Mechanism, strategy, and optimizations for self-adaptation of Parallel Patterns/applications’ graphs topologies. We apply the proposed solutions to the context of stream processing applications, a representative paradigm present in several real-world applications that compute data flowing in the form of streams (e.g., video feeds, image, and data analytics). A part of the proposed solutions is evaluated with SPar and another part with the FastFlow programming framework. The results demonstrate that self-adaptation can provide efficient parallelism abstractions and autonomous responsiveness at run-time, yet achieve a competitive performance w.r.t. the best static executions. Moreover, when appropriate, we compare state-of-the-art solutions and demonstrate that our highly optimized decision-making strategies achieve significant performance and efficiency gains.}, keywords = {}, pubstate = {published}, tppubtype = {phdthesis} } Nowadays, a significant part of computing systems and real-world applications demand parallelism to accelerate their executions. Although high-level and structured parallel programming aims to facilitate parallelism exploitation, there are still issues to be addressed to improve existing parallel programming abstractions. Usually, application developers still have to set non-intuitive or complex parallelism configurations. In this context, self-adaptation is a potential alternative to provide a higher-level of autonomic abstractions and runtime responsiveness in parallel executions. However, a recurrent problem is that self-adaptation is still limited in terms of flexibility, efficiency, and abstractions. For instance, there is a lack of mechanisms to apply adaptation actions and efficient decision-making strategies to decide which configurations to be enforced at run-time. In this work, we are interested in abstractions achievable with self-adaptation transparently managing the executions while the parallel programs are running (at run-time). Our main goals are to increase the adaptation space to be more representative of real-world applications and make self-adaptation more efficient with comprehensive evaluation methodologies, which can provide use-cases demonstrating the true potentials of self-adaptation. Therefore, this doctoral dissertation provides the following scientific contributions: I) An Systematic Literature Review (SLR) providing a taxonomy of the state-of-the-art. II) A conceptual framework to support designing and abstracting the decision-making process within self-adaptive solutions, such a conceptual framework is then employed in the technical contributions to assist in making the solutions more modular and potentially generalizable. III) Mechanisms and strategies for self-adaptive replicas in applications with single and multiple parallel stages, supporting multiple customizable non-functional requirements. IV) Mechanism, strategy, and optimizations for self-adaptation of Parallel Patterns/applications’ graphs topologies. We apply the proposed solutions to the context of stream processing applications, a representative paradigm present in several real-world applications that compute data flowing in the form of streams (e.g., video feeds, image, and data analytics). A part of the proposed solutions is evaluated with SPar and another part with the FastFlow programming framework. The results demonstrate that self-adaptation can provide efficient parallelism abstractions and autonomous responsiveness at run-time, yet achieve a competitive performance w.r.t. the best static executions. Moreover, when appropriate, we compare state-of-the-art solutions and demonstrate that our highly optimized decision-making strategies achieve significant performance and efficiency gains. |
2021 |
Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz G Providing High‐Level Self‐Adaptive Abstractions for Stream Parallelism on Multicores Journal Article doi Software: Practice and Experience, 51 (6), pp. 1194-1217, 2021. @article{VOGEL:SPE:21, title = {Providing High‐Level Self‐Adaptive Abstractions for Stream Parallelism on Multicores}, author = {Adriano Vogel and Dalvan Griebler and Luiz G Fernandes}, url = {https://doi.org/10.1002/spe.2948}, doi = {10.1002/spe.2948}, year = {2021}, date = {2021-01-01}, journal = {Software: Practice and Experience}, volume = {51}, number = {6}, pages = {1194-1217}, publisher = {Wiley Online Library}, abstract = {Stream processing applications are common computing workloads that demand parallelism to increase their performance. As in the past, parallel programming remains a difficult task for application programmers. The complexity increases when application programmers must set non-intuitive parallelism parameters, i.e. the degree of parallelism. The main problem is that state-of-the-art libraries use a static degree of parallelism and are not sufficiently abstracted for developing stream processing applications. In this paper, we propose a self-adaptive regulation of the degree of parallelism to provide higher-level abstractions. Flexibility is provided to programmers with two new self-adaptive strategies, one is for performance experts, and the other abstracts the need to set a performance goal. We evaluated our solution using compiler transformation rules to generate parallel code with the SPar domain-specific language. The experimental results with real-world applications highlighted higher abstraction levels without significant performance degradation in comparison to static executions. The strategy for performance experts achieved slightly higher performance than the one that works without user-defined performance goals.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Stream processing applications are common computing workloads that demand parallelism to increase their performance. As in the past, parallel programming remains a difficult task for application programmers. The complexity increases when application programmers must set non-intuitive parallelism parameters, i.e. the degree of parallelism. The main problem is that state-of-the-art libraries use a static degree of parallelism and are not sufficiently abstracted for developing stream processing applications. In this paper, we propose a self-adaptive regulation of the degree of parallelism to provide higher-level abstractions. Flexibility is provided to programmers with two new self-adaptive strategies, one is for performance experts, and the other abstracts the need to set a performance goal. We evaluated our solution using compiler transformation rules to generate parallel code with the SPar domain-specific language. The experimental results with real-world applications highlighted higher abstraction levels without significant performance degradation in comparison to static executions. The strategy for performance experts achieved slightly higher performance than the one that works without user-defined performance goals. |
Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz Gustavo Melhorando a Geração Automática de Código Paralelo para o Paradigma de Processamento de Stream em Multi-cores Journal Article Revista Eletrônica de Iniciação Científica em Computação, 19 (2), pp. 2083, 2021. @article{LOFF:REIC:21, title = {Melhorando a Geração Automática de Código Paralelo para o Paradigma de Processamento de Stream em Multi-cores}, author = {Júnior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://sol.sbc.org.br/journals/index.php/reic/article/view/2083}, year = {2021}, date = {2021-06-01}, journal = {Revista Eletrônica de Iniciação Científica em Computação}, volume = {19}, number = {2}, pages = {2083}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre}, abstract = {A programação paralela ainda é um desafio para desenvolvedores, pois exibe demasiados detalhes de baixo nível e de sistemas operacionais. Programadores precisam lidar com detalhes como escalonamento, balanceamento de carga e sincronizações. Esse trabalho contribui com otimizações para uma abstração de programação paralela para expressar paralelismo de stream em multi-cores. O trabalho estendeu a SPar adicionando dois novos atributos na sua linguagem, e implementou melhorias no seu compilador a fim de proporcionar melhor desempenho ao código paralelo gerado automaticamente. Os experimentos revelaram que a nova versão da SPar consegue abstrair detalhes do paralelismo com desempenho similar às versões paralelizadas manualmente.}, keywords = {}, pubstate = {published}, tppubtype = {article} } A programação paralela ainda é um desafio para desenvolvedores, pois exibe demasiados detalhes de baixo nível e de sistemas operacionais. Programadores precisam lidar com detalhes como escalonamento, balanceamento de carga e sincronizações. Esse trabalho contribui com otimizações para uma abstração de programação paralela para expressar paralelismo de stream em multi-cores. O trabalho estendeu a SPar adicionando dois novos atributos na sua linguagem, e implementou melhorias no seu compilador a fim de proporcionar melhor desempenho ao código paralelo gerado automaticamente. Os experimentos revelaram que a nova versão da SPar consegue abstrair detalhes do paralelismo com desempenho similar às versões paralelizadas manualmente. |
Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Geração de Código OpenMP para o Paralelismo de Stream Journal Article Revista Eletrônica de Iniciação Científica em Computação, 19 (2), pp. 2082, 2021. @article{HOFFMANN:REIC:21, title = {Geração de Código OpenMP para o Paralelismo de Stream}, author = {Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://sol.sbc.org.br/journals/index.php/reic/article/view/2082}, year = {2021}, date = {2021-06-01}, journal = {Revista Eletrônica de Iniciação Científica em Computação}, volume = {19}, number = {2}, pages = {2082}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre}, abstract = {OpenMP é uma interface para a programação paralela padrão e amplamente usada na indústria e academia, porém, torna-se complexa quando usada para desenvolver aplicações paralelas de fluxo de dados ou stream. Para resolver esse problema, foi proposto usar uma interface de programação paralela de alto nível (chamada SPar) e seu compilador para a geração de código estruturado de mais baixo nível com OpenMP em aplicações de fluxo de dados. O objetivo é diminuir a complexidade e verbosidade introduzida pelo OpenMP nas aplicações de stream. Nos experimentos em 4 aplicações, notou-se uma redução no tempo de execução de até 25,42%. Além do mais, requer-se um número de linhas de código fonte menor para expressar o paralelismo.}, keywords = {}, pubstate = {published}, tppubtype = {article} } OpenMP é uma interface para a programação paralela padrão e amplamente usada na indústria e academia, porém, torna-se complexa quando usada para desenvolver aplicações paralelas de fluxo de dados ou stream. Para resolver esse problema, foi proposto usar uma interface de programação paralela de alto nível (chamada SPar) e seu compilador para a geração de código estruturado de mais baixo nível com OpenMP em aplicações de fluxo de dados. O objetivo é diminuir a complexidade e verbosidade introduzida pelo OpenMP nas aplicações de stream. Nos experimentos em 4 aplicações, notou-se uma redução no tempo de execução de até 25,42%. Além do mais, requer-se um número de linhas de código fonte menor para expressar o paralelismo. |
Pieper, Ricardo; Löff, Júnior; Hoffmann, Renato Berreto; Griebler, Dalvan; Fernandes, Luiz Gustavo High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores Journal Article doi Journal of Computer Languages, 65 (na), pp. 101054, 2021, ISSN: 2590-1184. @article{PIEPER:COLA:21, title = {High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores}, author = {Ricardo Pieper and Júnior Löff and Renato Berreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://www.sciencedirect.com/science/article/pii/S2590118421000332}, doi = {10.1016/j.cola.2021.101054}, issn = {2590-1184}, year = {2021}, date = {2021-08-01}, journal = {Journal of Computer Languages}, volume = {65}, number = {na}, pages = {101054}, publisher = {Elsevier}, abstract = {This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism..}, keywords = {}, pubstate = {published}, tppubtype = {article} } This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism.. |
Löff, Júnior; Griebler, Dalvan; Mencagli, Gabriele; Araujo, Gabriell; Torquati, Massimo; Danelutto, Marco; Fernandes, Luiz Gustavo The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures Journal Article doi Future Generation Computer Systems, na (na), pp. na, 2021. @article{LOFF:FGCS:21, title = {The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures}, author = {Júnior Löff and Dalvan Griebler and Gabriele Mencagli and Gabriell Araujo and Massimo Torquati and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.future.2021.07.021}, doi = {10.1016/j.future.2021.07.021}, year = {2021}, date = {2021-07-01}, journal = {Future Generation Computer Systems}, volume = {na}, number = {na}, pages = {na}, publisher = {Elsevier}, abstract = {The NAS Parallel Benchmarks (NPB), originally implemented mostly in Fortran, is a consolidated suite containing several benchmarks extracted from Computational Fluid Dynamics (CFD) models. The benchmark suite has important characteristics such as intensive memory communications, complex data dependencies, different memory access patterns, and hardware components/sub-systems overload. Parallel programming APIs, libraries, and frameworks that are written in C++ as well as new optimizations and parallel processing techniques can benefit if NPB is made fully available in this programming language. In this paper we present NPB-CPP, a fully C++ translated version of NPB consisting of all the NPB kernels and pseudo-applications developed using OpenMP, Intel TBB, and FastFlow parallel frameworks for multicores. The design of NPB-CPP leverages the Structured Parallel Programming methodology (essentially based on parallel design patterns). We show the structure of each benchmark application in terms of composition of few patterns (notably Map and MapReduce constructs) provided by the selected C++ frameworks. The experimental evaluation shows the accuracy of NPB-CPP with respect to the original NPB source code. Furthermore, we carefully evaluate the parallel performance on three multi-core systems (Intel, IBM Power and AMD) with different C++ compilers (gcc, icc and clang) by discussing the performance differences in order to give to the researchers useful insights to choose the best parallel programming framework for a given type of problem.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The NAS Parallel Benchmarks (NPB), originally implemented mostly in Fortran, is a consolidated suite containing several benchmarks extracted from Computational Fluid Dynamics (CFD) models. The benchmark suite has important characteristics such as intensive memory communications, complex data dependencies, different memory access patterns, and hardware components/sub-systems overload. Parallel programming APIs, libraries, and frameworks that are written in C++ as well as new optimizations and parallel processing techniques can benefit if NPB is made fully available in this programming language. In this paper we present NPB-CPP, a fully C++ translated version of NPB consisting of all the NPB kernels and pseudo-applications developed using OpenMP, Intel TBB, and FastFlow parallel frameworks for multicores. The design of NPB-CPP leverages the Structured Parallel Programming methodology (essentially based on parallel design patterns). We show the structure of each benchmark application in terms of composition of few patterns (notably Map and MapReduce constructs) provided by the selected C++ frameworks. The experimental evaluation shows the accuracy of NPB-CPP with respect to the original NPB source code. Furthermore, we carefully evaluate the parallel performance on three multi-core systems (Intel, IBM Power and AMD) with different C++ compilers (gcc, icc and clang) by discussing the performance differences in order to give to the researchers useful insights to choose the best parallel programming framework for a given type of problem. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces Inproceedings doi 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 84-88, IEEE, Valladolid, Spain, 2021. @inproceedings{GARCIA:PDP:21, title = {Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP52278.2021.00021}, doi = {10.1109/PDP52278.2021.00021}, year = {2021}, date = {2021-03-01}, booktitle = {29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {84-88}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'21}, abstract = {Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently computation. It can be done through Stream Parallelism, which is still a challenging task and most reserved for experts. We introduce a Stream Processing framework for assessing Parallel Programming Interfaces (PPIs). Our framework targets multi-core architectures and C++ stream processing applications, providing an API that abstracts the details of the stream operators of these applications. Therefore, users can easily identify all the basic operators and implement parallelism through different PPIs. In this paper, we present the proposed framework, implement three applications using its API, and show how it works, by using it to parallelize and evaluate the applications with the PPIs Intel TBB, FastFlow, and SPar. The performance results were consistent with the literature.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently computation. It can be done through Stream Parallelism, which is still a challenging task and most reserved for experts. We introduce a Stream Processing framework for assessing Parallel Programming Interfaces (PPIs). Our framework targets multi-core architectures and C++ stream processing applications, providing an API that abstracts the details of the stream operators of these applications. Therefore, users can easily identify all the basic operators and implement parallelism through different PPIs. In this paper, we present the proposed framework, implement three applications using its API, and show how it works, by using it to parallelize and evaluate the applications with the PPIs Intel TBB, FastFlow, and SPar. The performance results were consistent with the literature. |
Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Danelutto, Marco; Fernandes, Luiz Gustavo Assessing Coding Metrics for Parallel Programming of Stream Processing Programs on Multi-cores Inproceedings doi 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 291-295, IEEE, Pavia, Italy, 2021, ISBN: 978-1-6654-2705-0. @inproceedings{ANDRADE:SEAA:21, title = {Assessing Coding Metrics for Parallel Programming of Stream Processing Programs on Multi-cores}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/SEAA53835.2021.00044}, doi = {10.1109/SEAA53835.2021.00044}, isbn = {978-1-6654-2705-0}, year = {2021}, date = {2021-09-01}, booktitle = {2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)}, pages = {291-295}, publisher = {IEEE}, address = {Pavia, Italy}, series = {SEAA'21}, abstract = {From the popularization of multi-core architectures, several parallel APIs have emerged, helping to abstract the programming complexity and increasing productivity in application development. Unfortunately, only a few research efforts in this direction managed to show the usability pay-back of the programming abstraction created, because it is not easy and poses many challenges for conducting empirical software engineering. We believe that coding metrics commonly used in software engineering code measurements can give useful indicators on the programming effort of parallel applications and APIs. These metrics were designed for general purposes without considering the evaluation of applications from a specific domain. In this study, we aim to evaluate the feasibility of seven coding metrics to be used in the parallel programming domain. To do so, five stream processing applications implemented with different parallel APIs for multi-cores were considered. Our experiments have shown COCOMO II is a suitable model for evaluating the productivity of different parallel APIs targeting multi-cores on stream processing applications while other metrics are restricted to the code size.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } From the popularization of multi-core architectures, several parallel APIs have emerged, helping to abstract the programming complexity and increasing productivity in application development. Unfortunately, only a few research efforts in this direction managed to show the usability pay-back of the programming abstraction created, because it is not easy and poses many challenges for conducting empirical software engineering. We believe that coding metrics commonly used in software engineering code measurements can give useful indicators on the programming effort of parallel applications and APIs. These metrics were designed for general purposes without considering the evaluation of applications from a specific domain. In this study, we aim to evaluate the feasibility of seven coding metrics to be used in the parallel programming domain. To do so, five stream processing applications implemented with different parallel APIs for multi-cores were considered. Our experiments have shown COCOMO II is a suitable model for evaluating the productivity of different parallel APIs targeting multi-cores on stream processing applications while other metrics are restricted to the code size. |
Scheer, Claudio; Griebler, Dalvan; Fernandes, Luiz Gustavo Proposta de Otimização do Tamanho de Batch em Aplicações de Stream para Multicores usando Aprendizado de Máquina Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 127-128, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{SCHEER:ERAD:21, title = {Proposta de Otimização do Tamanho de Batch em Aplicações de Stream para Multicores usando Aprendizado de Máquina}, author = {Claudio Scheer and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14802}, doi = {10.5753/eradrs.2021.14802}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {127-128}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {Este trabalho apresenta uma proposta de estudo e avaliação de features e algoritmos de aprendizado de máquina visando melhorar a desempenho através do ajuste/regulagem do tamanho do batch em aplicações paralelas de stream para arquiteturas multicore.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho apresenta uma proposta de estudo e avaliação de features e algoritmos de aprendizado de máquina visando melhorar a desempenho através do ajuste/regulagem do tamanho do batch em aplicações paralelas de stream para arquiteturas multicore. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Proposta de um Framework para Avaliar Interfaces de Programação Paralela em Aplicações de Stream Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 119-120, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{GARCIA:ERAD:21, title = {Proposta de um Framework para Avaliar Interfaces de Programação Paralela em Aplicações de Stream}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14798}, doi = {10.5753/eradrs.2021.14798}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {119-120}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {Este trabalho propõe um framework que auxilia no desenvolvimento de benchmarks para avaliar Interfaces de Programação Paralela no domínio de paralelismo de stream em C++.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho propõe um framework que auxilia no desenvolvimento de benchmarks para avaliar Interfaces de Programação Paralela no domínio de paralelismo de stream em C++. |
Rockenbach, Dinei A; Griebler, Dalvan; Fernandes, Luiz Gustavo Provendo Abstrações de Alto Nível para GPUs na SPar Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 109-110, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{ROCKENBACH:ERAD:21, title = {Provendo Abstrações de Alto Nível para GPUs na SPar}, author = {Dinei A. Rockenbach and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14793}, doi = {10.5753/eradrs.2021.14793}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {109-110}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {O presente trabalho apresenta uma extensão à linguagem SPar para suportar o paralelismo heterogêneo combinado de CPU e GPU através de anotações C++11 em aplicações de processamento de stream. Os testes sugerem melhoras significativas de desempenho com poucas modificações no código.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } O presente trabalho apresenta uma extensão à linguagem SPar para suportar o paralelismo heterogêneo combinado de CPU e GPU através de anotações C++11 em aplicações de processamento de stream. Os testes sugerem melhoras significativas de desempenho com poucas modificações no código. |
Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz Gustavo Proposta de Suporte à Parametrização no NPB com CUDA Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 103-104, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{ARAUJO:ERAD:21, title = {Proposta de Suporte à Parametrização no NPB com CUDA}, author = {Gabriell Araujo and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14790}, doi = {10.5753/eradrs.2021.14790}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {103-104}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {Este trabalho propõe a introdução de parâmetros configuráveis para GPUs no NPB. A etapa inicial do estudo contemplou a parametrização do número de threads por bloco e seu impacto no desempenho de GPUs.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho propõe a introdução de parâmetros configuráveis para GPUs no NPB. A etapa inicial do estudo contemplou a parametrização do número de threads por bloco e seu impacto no desempenho de GPUs. |
Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz Gustavo Proposta de Adaptação Dinâmica de Padrões Paralelos Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 101-102, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{VOGEL:ERAD:21, title = {Proposta de Adaptação Dinâmica de Padrões Paralelos}, author = {Adriano Vogel and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14789}, doi = {10.5753/eradrs.2021.14789}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {101-102}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {Este trabalho apresenta uma perspectiva para adaptar dinamicamente os padrões paralelos em tempo de execução, objetivando abstrair dos programadores a definição de qual padrão paralelo usar e aumentar a flexibilidade. Os resultados preliminares demonstram a eficácia da solução proposta.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho apresenta uma perspectiva para adaptar dinamicamente os padrões paralelos em tempo de execução, objetivando abstrair dos programadores a definição de qual padrão paralelo usar e aumentar a flexibilidade. Os resultados preliminares demonstram a eficácia da solução proposta. |
Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo Uso de Métricas de Codificação para Avaliar a Programação Paralela nas Aplicações de Stream em Sistemas Multi-core Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 93-94, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{ANDRADE:ERAD:21, title = {Uso de Métricas de Codificação para Avaliar a Programação Paralela nas Aplicações de Stream em Sistemas Multi-core}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14785}, doi = {10.5753/eradrs.2021.14785}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {93-94}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {Neste trabalho, sete métricas de codificação são avaliadas considerando quatro aplicações do mundo real implementadas com FastFlow, Pthreads, SPar e TBB. Nossos resultados mostram que SPar apresenta os melhores indicadores de acordo com as métricas utilizadas.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Neste trabalho, sete métricas de codificação são avaliadas considerando quatro aplicações do mundo real implementadas com FastFlow, Pthreads, SPar e TBB. Nossos resultados mostram que SPar apresenta os melhores indicadores de acordo com as métricas utilizadas. |
Mello, Fernanda; Griebler, Dalvan; Manssour, Isabel; Fernandes, Luiz Gustavo Compressão de Dados em Multicores com Flink ou SPar? Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 77-80, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{MELLO:ERAD:21, title = {Compressão de Dados em Multicores com Flink ou SPar?}, author = {Fernanda Mello and Dalvan Griebler and Isabel Manssour and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14779}, doi = {10.5753/eradrs.2021.14779}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {77-80}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {Neste trabalho, foi implementada uma versão do algoritmo de compressão de dados Bzip2 com o framework para processamento de stream Apache Flink, a fim de avaliar seu desempenho em comparação com a versão do Bzip2 já existente na linguagem de domínio específica SPar. Os experimentos revelaram que a versão com SPar possui um desempenho muito superior ao Flink.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Neste trabalho, foi implementada uma versão do algoritmo de compressão de dados Bzip2 com o framework para processamento de stream Apache Flink, a fim de avaliar seu desempenho em comparação com a versão do Bzip2 já existente na linguagem de domínio específica SPar. Os experimentos revelaram que a versão com SPar possui um desempenho muito superior ao Flink. |
Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Abstraindo o OpenMP no Desenvolvimento de Aplicações de Fluxo de Dados Contínuo Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 69-72, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{HOFFMANN:ERAD:21, title = {Abstraindo o OpenMP no Desenvolvimento de Aplicações de Fluxo de Dados Contínuo}, author = {Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14777}, doi = {10.5753/eradrs.2021.14777}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {69-72}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {OpenMP é complexo quando usado para desenvolver aplicações de fluxo de dados. Com o objetivo de mitigar essa dificuldade, foi utilizada uma metodologia existente, chamada SPar, para aumentar o nível de abstração. Portanto, foram utilizadas anotações mais alto-nível da SPar para gerar código mais baixo-nível de fluxo de dados com OpenMP. Os experimentos revelaram que a SPar teve desempenho 0,86% inferior no caso mais extremo.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } OpenMP é complexo quando usado para desenvolver aplicações de fluxo de dados. Com o objetivo de mitigar essa dificuldade, foi utilizada uma metodologia existente, chamada SPar, para aumentar o nível de abstração. Portanto, foram utilizadas anotações mais alto-nível da SPar para gerar código mais baixo-nível de fluxo de dados com OpenMP. Os experimentos revelaram que a SPar teve desempenho 0,86% inferior no caso mais extremo. |
Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz Gustavo Melhorando a Geração Automática de Código Paralelo em Arquiteturas Multi-core na SPar Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 65-68, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. @inproceedings{LOFF:ERAD:21, title = {Melhorando a Geração Automática de Código Paralelo em Arquiteturas Multi-core na SPar}, author = {Júnior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2021.14776}, doi = {10.5753/eradrs.2021.14776}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {65-68}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, Brazil}, abstract = {Neste trabalho, a fim de melhorar a eficiência do código paralelo gerado em arquiteturas multi-core, foi estendida a linguagem e o compilador da SPar para permitir a geração automática de padrões paralelos pertencentes aos dois principais domínios de paralelismo, o de stream e de dados. Experimentos mostram que a nova versão da SPar obteve resultados similares, ou até mesmo melhores, que as versões implementadas manualmente.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Neste trabalho, a fim de melhorar a eficiência do código paralelo gerado em arquiteturas multi-core, foi estendida a linguagem e o compilador da SPar para permitir a geração automática de padrões paralelos pertencentes aos dois principais domínios de paralelismo, o de stream e de dados. Experimentos mostram que a nova versão da SPar obteve resultados similares, ou até mesmo melhores, que as versões implementadas manualmente. |
Löff, Júnior; Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz G High-Level Stream and Data Parallelism in C++ for Multi-Cores Inproceedings doi XXV Brazilian Symposium on Programming Languages (SBLP), pp. 41-48, ACM, Joinville, Brazil, 2021. @inproceedings{LOFF:SBLP:21, title = {High-Level Stream and Data Parallelism in C++ for Multi-Cores}, author = {Júnior Löff and Renato Barreto Hoffmann and Dalvan Griebler and Luiz G Fernandes}, url = {https://dl.acm.org/doi/10.1145/3475061.3475078}, doi = {10.1145/3475061.3475078}, year = {2021}, date = {2021-09-01}, booktitle = {XXV Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-48}, publisher = {ACM}, address = {Joinville, Brazil}, series = {SBLP'21}, abstract = {Stream processing applications have seen an increasing demand with the increased availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. However, parallel programming is often difficult and error-prone, because programmers must deal with low-level system and architecture details. In this work, we introduce a new strategy for automatic data-parallel code generation in C++ targeting multi-core architectures. This strategy was integrated with an annotation-based parallel programming abstraction named SPar. We have increased SPar’s expressiveness for supporting stream and data parallelism, and their arbitrary composition. Therefore, we added two new attributes to its language and improved the compiler parallel code generation. We conducted a set of experiments on different stream and data-parallel applications to assess the efficiency of our solution. The results showed that the new SPar version obtained similar performance with respect to handwritten parallelizations. Moreover, the new SPar version is able to achieve up to 74.9x better performance with respect to the original ones due to this work.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Stream processing applications have seen an increasing demand with the increased availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. However, parallel programming is often difficult and error-prone, because programmers must deal with low-level system and architecture details. In this work, we introduce a new strategy for automatic data-parallel code generation in C++ targeting multi-core architectures. This strategy was integrated with an annotation-based parallel programming abstraction named SPar. We have increased SPar’s expressiveness for supporting stream and data parallelism, and their arbitrary composition. Therefore, we added two new attributes to its language and improved the compiler parallel code generation. We conducted a set of experiments on different stream and data-parallel applications to assess the efficiency of our solution. The results showed that the new SPar version obtained similar performance with respect to handwritten parallelizations. Moreover, the new SPar version is able to achieve up to 74.9x better performance with respect to the original ones due to this work. |
Vogel, Adriano; Mencagli, Gabriele; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Towards On-the-fly Self-Adaptation of Stream Parallel Patterns Inproceedings doi 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 889-93, IEEE, Valladolid, Spain, 2021. @inproceedings{VOGEL:PDP:21, title = {Towards On-the-fly Self-Adaptation of Stream Parallel Patterns}, author = {Adriano Vogel and Gabriele Mencagli and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP52278.2021.00022}, doi = {10.1109/PDP52278.2021.00022}, year = {2021}, date = {2021-03-01}, booktitle = {29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {889-93}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'21}, abstract = {Stream processing applications compute streams of data and provide insightful results in a timely manner, where parallel computing is necessary for accelerating the application executions. Considering that these applications are becoming increasingly dynamic and long-running, a potential solution is to apply dynamic runtime changes. However, it is challenging for humans to continuously monitor and manually self-optimize the executions. In this paper, we propose self-adaptiveness of the parallel patterns used, enabling flexible on-the-fly adaptations. The proposed solution is evaluated with an existing programming framework and running experiments with a synthetic and a real-world application. The results show that the proposed solution is able to dynamically self-adapt to the most suitable parallel pattern configuration and achieve performance competitive with the best static cases. The feasibility of the proposed solution encourages future optimizations and other applicabilities.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Stream processing applications compute streams of data and provide insightful results in a timely manner, where parallel computing is necessary for accelerating the application executions. Considering that these applications are becoming increasingly dynamic and long-running, a potential solution is to apply dynamic runtime changes. However, it is challenging for humans to continuously monitor and manually self-optimize the executions. In this paper, we propose self-adaptiveness of the parallel patterns used, enabling flexible on-the-fly adaptations. The proposed solution is evaluated with an existing programming framework and running experiments with a synthetic and a real-world application. The results show that the proposed solution is able to dynamically self-adapt to the most suitable parallel pattern configuration and achieve performance competitive with the best static cases. The feasibility of the proposed solution encourages future optimizations and other applicabilities. |
2020 |
Bordin, Maycon Viana; Griebler, Dalvan; Mencagli, Gabriele; Geyer, Claudio F R; Fernandes, Luiz Gustavo DSPBench: a Suite of Benchmark Applications for Distributed Data Stream Processing Systems Journal Article doi IEEE Access, 8 (na), pp. 222900-222917, 2020. @article{BORDIN:IEEEAccess:20, title = {DSPBench: a Suite of Benchmark Applications for Distributed Data Stream Processing Systems}, author = {Maycon Viana Bordin and Dalvan Griebler and Gabriele Mencagli and Claudio F R Geyer and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/ACCESS.2020.3043948}, doi = {10.1109/ACCESS.2020.3043948}, year = {2020}, date = {2020-12-01}, journal = {IEEE Access}, volume = {8}, number = {na}, pages = {222900-222917}, publisher = {IEEE}, abstract = {Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this paper is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This paper describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this paper is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This paper describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis. |
Stein, Charles Michael; Rockenbach, Dinei A; Griebler, Dalvan; Torquati, Massimo; Mencagli, Gabriele; Danelutto, Marco; Fernandes, Luiz Gustavo Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. @article{STEIN:CCPE:20, title = {Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units}, author = {Charles Michael Stein and Dinei A. Rockenbach and Dalvan Griebler and Massimo Torquati and Gabriele Mencagli and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/cpe.5786}, doi = {10.1002/cpe.5786}, year = {2020}, date = {2020-05-01}, journal = {Concurrency and Computation: Practice and Experience}, volume = {na}, number = {na}, pages = {e5786}, publisher = {Wiley Online Library}, abstract = {Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads. |
Andrade, Gabriella; Griebler, Dalvan; Fernandes, Luiz Gustavo Avaliação da Usabilidade de Interfaces de Programação Paralela para Sistemas Multi-Core em Aplicação de Vídeo Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 149-150, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{ANDRADE:ERAD:20, title = {Avaliação da Usabilidade de Interfaces de Programação Paralela para Sistemas Multi-Core em Aplicação de Vídeo}, author = {Gabriella Andrade and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10781}, doi = {10.5753/eradrs.2020.10781}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {149-150}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Com a ampla variedade de interfaces para a programação paralela em ambientes multi-core é difícil determinar quais destas oferecem a melhor usabilidade. Esse trabalho realiza um experimento comparando a paralelização de uma aplicação de vídeo com as ferramentas FastFlow, SPar e TBB. Os resultados revelaram que a SPar requer menos esforço na paralelização de uma aplicação de vídeo do que as demais interfaces de programação paralela.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Com a ampla variedade de interfaces para a programação paralela em ambientes multi-core é difícil determinar quais destas oferecem a melhor usabilidade. Esse trabalho realiza um experimento comparando a paralelização de uma aplicação de vídeo com as ferramentas FastFlow, SPar e TBB. Os resultados revelaram que a SPar requer menos esforço na paralelização de uma aplicação de vídeo do que as demais interfaces de programação paralela. |
Justo, Gabriel; Hoffmann, Renato Barreto; Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz Gustavo Acelerando uma Aplicação de Detecção de Pistas com MPI Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 117-120, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{JUSTO:ERAD:20, title = {Acelerando uma Aplicação de Detecção de Pistas com MPI}, author = {Gabriel Justo and Renato Barreto Hoffmann and Adriano Vogel and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10770}, doi = {10.5753/eradrs.2020.10770}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {117-120}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Aplicações de stream de vídeo demandam processamento de alto desempenho para atender requisitos de tempo real. Nesse cenário, a programação paralela distribuída é uma alternativa para acelerar e escalar o desempenho. Neste trabalho, o objetivo é paralelizar uma aplicação de detecção de pistas com a biblioteca MPI usando o padrão Farm e implementando duas estratégias de distribuição de tarefas. Os resultados evidenciam os ganhos de desempenho.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Aplicações de stream de vídeo demandam processamento de alto desempenho para atender requisitos de tempo real. Nesse cenário, a programação paralela distribuída é uma alternativa para acelerar e escalar o desempenho. Neste trabalho, o objetivo é paralelizar uma aplicação de detecção de pistas com a biblioteca MPI usando o padrão Farm e implementando duas estratégias de distribuição de tarefas. Os resultados evidenciam os ganhos de desempenho. |
Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Geração Automática de Código TBB na SPar Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 97-100, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{HOFFMANN:ERAD:20, title = {Geração Automática de Código TBB na SPar}, author = {Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10765}, doi = {10.5753/eradrs.2020.10765}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {97-100}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Técnicas de programação paralela são necessárias para extrair todo o potencial dos processadores de múltiplos núcleos. Para isso, foi criada a SPar, uma linguagem para abstração do paralelismo de stream. Esse trabalho descreve a implementação da geração de código automática para a biblioteca TBB na SPar, uma vez que gerava-se código para FastFlow. Os testes com aplicações resultaram em tempos de execução até 12,76 vezes mais rápidos.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Técnicas de programação paralela são necessárias para extrair todo o potencial dos processadores de múltiplos núcleos. Para isso, foi criada a SPar, uma linguagem para abstração do paralelismo de stream. Esse trabalho descreve a implementação da geração de código automática para a biblioteca TBB na SPar, uma vez que gerava-se código para FastFlow. Os testes com aplicações resultaram em tempos de execução até 12,76 vezes mais rápidos. |
de Araújo, Gabriell Alves; Griebler, Dalvan; Fernandes, Luiz Gustavo Implementação CUDA dos Kernels NPB Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 85-88, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{ARAUJO:ERAD:20, title = {Implementação CUDA dos Kernels NPB}, author = {Gabriell Alves de Araújo and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10762}, doi = {10.5753/eradrs.2020.10762}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {85-88}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {NAS Parallel Benchmarks (NPB) é um conjunto de benchmarks utilizado para avaliar hardware e software, que ao longo dos anos foi portado para diferentes frameworks. Concernente a GPUs, atualmente existem apenas versões OpenCL e OpenACC. Este trabalho contribui com a literatura provendo a primeira implementação CUDA completa dos kernels do NPB, realizando experimentos com carga de trabalho inédita e revelando novos fatos sobre o NPB.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } NAS Parallel Benchmarks (NPB) é um conjunto de benchmarks utilizado para avaliar hardware e software, que ao longo dos anos foi portado para diferentes frameworks. Concernente a GPUs, atualmente existem apenas versões OpenCL e OpenACC. Este trabalho contribui com a literatura provendo a primeira implementação CUDA completa dos kernels do NPB, realizando experimentos com carga de trabalho inédita e revelando novos fatos sobre o NPB. |
Löff, Junior; Griebler, Dalvan; Fernandes, Luiz Gustavo Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 37-40, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{LOFF:ERAD:20, title = {Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito}, author = {Junior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10750}, doi = {10.5753/eradrs.2020.10750}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {37-40}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {Neste trabalho, um pipeline implícito com o padrão map foi implementado na aplicação LU do NAS Parallel Benchmarks em C++. O LU possui dependência de dados no tempo, o que dificulta a exploração do paralelismo. Ele foi convertido de Fortran para C++, a fim de ser paralelizado com diferentes bibliotecas de sistemas multi-core. O uso desta estratégia com as bibliotecas permitiu ganhos de desempenho de até 10.6% em relação a versão original.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Neste trabalho, um pipeline implícito com o padrão map foi implementado na aplicação LU do NAS Parallel Benchmarks em C++. O LU possui dependência de dados no tempo, o que dificulta a exploração do paralelismo. Ele foi convertido de Fortran para C++, a fim de ser paralelizado com diferentes bibliotecas de sistemas multi-core. O uso desta estratégia com as bibliotecas permitiu ganhos de desempenho de até 10.6% em relação a versão original. |
Hoffmann, Renato B; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism Annotations for Multi-Core Frameworks Inproceedings doi XXIV Brazilian Symposium on Programming Languages (SBLP), pp. 48-55, ACM, Natal, Brazil, 2020. @inproceedings{HOFFMANN:SBLP:20, title = {Stream Parallelism Annotations for Multi-Core Frameworks}, author = {Renato B. Hoffmann and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1145/3427081.3427088}, doi = {10.1145/3427081.3427088}, year = {2020}, date = {2020-10-01}, booktitle = {XXIV Brazilian Symposium on Programming Languages (SBLP)}, pages = {48-55}, publisher = {ACM}, address = {Natal, Brazil}, series = {SBLP'20}, abstract = {Data generation, collection, and processing is an important workload of modern computer architectures. Stream or high-intensity data flow applications are commonly employed in extracting and interpreting the information contained in this data. Due to the computational complexity of these applications, high-performance ought to be achieved using parallel computing. Indeed, the efficient exploitation of available parallel resources from the architecture remains a challenging task for the programmers. Techniques and methodologies are required to help shift the efforts from the complexity of parallelism exploitation to specific algorithmic solutions. To tackle this problem, we propose a methodology that provides the developer with a suitable abstraction layer between a clean and effective parallel programming interface targeting different multi-core parallel programming frameworks. We used standard C++ code annotations that may be inserted in the source code by the programmer. Then, a compiler parses C++ code with the annotations and generates calls to the desired parallel runtime API. Our experiments demonstrate the feasibility of our methodology and the performance of the abstraction layer, where the difference is negligible in four applications with respect to the state-of-the-art C++ parallel programming frameworks. Additionally, our methodology allows improving the application performance since the developers can choose the runtime that best performs in their system.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Data generation, collection, and processing is an important workload of modern computer architectures. Stream or high-intensity data flow applications are commonly employed in extracting and interpreting the information contained in this data. Due to the computational complexity of these applications, high-performance ought to be achieved using parallel computing. Indeed, the efficient exploitation of available parallel resources from the architecture remains a challenging task for the programmers. Techniques and methodologies are required to help shift the efforts from the complexity of parallelism exploitation to specific algorithmic solutions. To tackle this problem, we propose a methodology that provides the developer with a suitable abstraction layer between a clean and effective parallel programming interface targeting different multi-core parallel programming frameworks. We used standard C++ code annotations that may be inserted in the source code by the programmer. Then, a compiler parses C++ code with the annotations and generates calls to the desired parallel runtime API. Our experiments demonstrate the feasibility of our methodology and the performance of the abstraction layer, where the difference is negligible in four applications with respect to the state-of-the-art C++ parallel programming frameworks. Additionally, our methodology allows improving the application performance since the developers can choose the runtime that best performs in their system. |
Garcia, Adriano Marques; Serpa, Matheus; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo; Navaux, Philippe O A The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures Inproceedings doi International Conference on Computational Science and its Applications (ICCSA), pp. 142-157, Springer, Cagliari, Italy, 2020. @inproceedings{GARCIA:ICCSA:20, title = {The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures}, author = {Adriano Marques Garcia and Matheus Serpa and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes and Philippe O A Navaux}, url = {https://doi.org/10.1007/978-3-030-58817-5_12}, doi = {10.1007/978-3-030-58817-5_12}, year = {2020}, date = {2020-07-01}, booktitle = {International Conference on Computational Science and its Applications (ICCSA)}, volume = {12254}, pages = {142-157}, publisher = {Springer}, address = {Cagliari, Italy}, series = {ICCSA'20}, abstract = {Since the demand for computing power increases, new architectures emerged to obtain better performance. Reducing the power and energy consumption of these architectures is one of the main challenges to achieving high-performance computing. Current research trends aim at developing new software and hardware techniques to achieve the best performance and energy trade-offs. In this work, we investigate the impact of different CPU frequency scaling techniques such as ondemand, performance, and powersave on the power and energy consumption of multi-core based computer infrastructure. We apply these techniques in PAMPAR, a parallel benchmark suite implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn). We measure the energy and execution time of 10 benchmarks, varying the number of threads. Our results show that although powersave consumes up to 43.1% less power than performance and ondemand governors, it consumes the triple of energy due to the high execution time. Our experiments also show that the performance governor consumes up to 9.8% more energy than ondemand for CPU-bound benchmarks. Finally, our results show that PThreads has the lowest power consumption, consuming less than the sequential version for memory-bound benchmarks. Regarding performance, the performance governor achieved 3% of performance over the ondemand.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Since the demand for computing power increases, new architectures emerged to obtain better performance. Reducing the power and energy consumption of these architectures is one of the main challenges to achieving high-performance computing. Current research trends aim at developing new software and hardware techniques to achieve the best performance and energy trade-offs. In this work, we investigate the impact of different CPU frequency scaling techniques such as ondemand, performance, and powersave on the power and energy consumption of multi-core based computer infrastructure. We apply these techniques in PAMPAR, a parallel benchmark suite implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn). We measure the energy and execution time of 10 benchmarks, varying the number of threads. Our results show that although powersave consumes up to 43.1% less power than performance and ondemand governors, it consumes the triple of energy due to the high execution time. Our experiments also show that the performance governor consumes up to 9.8% more energy than ondemand for CPU-bound benchmarks. Finally, our results show that PThreads has the lowest power consumption, consuming less than the sequential version for memory-bound benchmarks. Regarding performance, the performance governor achieved 3% of performance over the ondemand. |
Garcia, Adriano Marques; Griebler, Dalvan; Fernandes, Luiz Gustavo Proposta de uma Suíte de Benchmarks para Processamento de Stream em Sistemas Multi-Core Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 167-168, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. @inproceedings{GARCIA:ERAD:20, title = {Proposta de uma Suíte de Benchmarks para Processamento de Stream em Sistemas Multi-Core}, author = {Adriano Marques Garcia and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2020.10790}, doi = {10.5753/eradrs.2020.10790}, year = {2020}, date = {2020-04-01}, booktitle = {XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {167-168}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Santa Maria, BR}, abstract = {O aumento no volume de dados gerados por sistemas computacionais e a necessidade por processamento rápido desses dados vem alavancando a área de processamento de stream. Entretanto, ainda não existe um benchmark para auxiliar desenvolvedores e pesquisadores. Este trabalho visa propor uma suíte de benchmarks para processamento de stream em arquiteturas multi-core e discute as características necessárias no desenvolvimento dessa suíte.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } O aumento no volume de dados gerados por sistemas computacionais e a necessidade por processamento rápido desses dados vem alavancando a área de processamento de stream. Entretanto, ainda não existe um benchmark para auxiliar desenvolvedores e pesquisadores. Este trabalho visa propor uma suíte de benchmarks para processamento de stream em arquiteturas multi-core e discute as características necessárias no desenvolvimento dessa suíte. |
Vogel, Adriano; Rista, Cassiano; Justo, Gabriel; Ewald, Endrius; Griebler, Dalvan; Mencagli, Gabriele; Fernandes, Luiz Gustavo Parallel Stream Processing with MPI for Video Analytics and Data Visualization Inproceedings doi High Performance Computing Systems, pp. 102-116, Springer, Cham, 2020. @inproceedings{VOGEL:CCIS:20, title = {Parallel Stream Processing with MPI for Video Analytics and Data Visualization}, author = {Adriano Vogel and Cassiano Rista and Gabriel Justo and Endrius Ewald and Dalvan Griebler and Gabriele Mencagli and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/978-3-030-41050-6_7}, doi = {10.1007/978-3-030-41050-6_7}, year = {2020}, date = {2020-02-01}, booktitle = {High Performance Computing Systems}, volume = {1171}, pages = {102-116}, publisher = {Springer}, address = {Cham}, series = {Communications in Computer and Information Science (CCIS)}, abstract = {The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. Parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. Parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications. |
de Araujo, Gabriell Alves; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Efficient NAS Parallel Benchmark Kernels with CUDA Inproceedings doi 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 9-16, IEEE, Västerås, Sweden, Sweden, 2020. @inproceedings{ARAUJO:PDP:20, title = {Efficient NAS Parallel Benchmark Kernels with CUDA}, author = {Gabriell Alves de Araujo and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP50117.2020.00009}, doi = {10.1109/PDP50117.2020.00009}, year = {2020}, date = {2020-03-01}, booktitle = {28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {9-16}, publisher = {IEEE}, address = {Västerås, Sweden, Sweden}, series = {PDP'20}, abstract = {NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels. |
Rockenbach, Dinei A High-Level Programming Abstractions for Stream Parallelism on GPUs Masters Thesis School of Technology - PPGCC - PUCRS, 2020. @mastersthesis{ROCKENBACH:DM:20, title = {High-Level Programming Abstractions for Stream Parallelism on GPUs}, author = {Dinei A. Rockenbach}, url = {https://tede2.pucrs.br/tede2/handle/tede/9592}, year = {2020}, date = {2020-11-27}, address = {Porto Alegre, Brazil}, school = {School of Technology - PPGCC - PUCRS}, abstract = {The growth and spread of parallel architectures have driven the pursuit of greater computing power with massively parallel hardware such as the Graphics Processing Units (GPUs). This new heterogeneous computer architecture composed of multi-core Central Processing Units (CPUs) and many-core GPUs became usual, enabling novel software applications such as self-driving cars, real-time ray tracing, deep learning, and Virtual Reality (VR), which are characterized as stream processing applications. However, this heterogeneous environment poses an additional challenge to software development, which is still in the process of adapting to the parallel processing paradigm on multi-core systems, where programmers are supported by several Application Programming Interfaces (APIs) that offer different abstraction levels. The parallelism exploitation in GPU is done using both CUDA and OpenCL for academia and industry, whose developers have to deal with low-level architecture concepts to efficiently exploit GPU parallelism in their applications. There is still a lack of parallel programming abstractions when: 1) parallelizing code on GPUs, and 2) needing higher-level programming abstractions that deal with both CPU and GPU parallelism. Unfortunately, developers still have to be expert programmers on system and architecture to enable efficient hardware parallelism exploitation in this architectural environment. To contribute to the first problem, we created GSPARLIB, a novel structured parallel programming library for exploiting GPU parallelism that provides a unified programming API and driver-agnostic runtime. It offers Map and Reduce parallel patterns on top of CUDA and OpenCL drivers. We evaluate its performance comparing with state-of-the-art APIs, where the experiments revealed a comparable and efficient performance. For contributing to the second problem, we extended the SPar Domain-Specific Language (DSL), which has been proved to be high-level and productive for expressing stream parallelism with C++ annotations in multi-core CPUs. In this work, we propose and implement new annotations that increase expressiveness to combine the current stream parallelism on CPUs and data parallelism on GPUs. We also provide new pattern-based transformation rules that were implemented in the compiler targeting automatic source-to-source code transformations using GSPARLIB for GPU parallelism exploitation. Our experiments demonstrate that SPar compiler is able to generate stream and data parallel patterns without significant performance penalty compared to handwritten code. Thanks to these advances in SPar, our work is the first on providing high-level C++11 annotations as an API that does not require significant code refactoring in sequential programs while enabling multi-core CPU and many-core GPU parallelism exploitation for stream processing applications.}, keywords = {}, pubstate = {published}, tppubtype = {mastersthesis} } The growth and spread of parallel architectures have driven the pursuit of greater computing power with massively parallel hardware such as the Graphics Processing Units (GPUs). This new heterogeneous computer architecture composed of multi-core Central Processing Units (CPUs) and many-core GPUs became usual, enabling novel software applications such as self-driving cars, real-time ray tracing, deep learning, and Virtual Reality (VR), which are characterized as stream processing applications. However, this heterogeneous environment poses an additional challenge to software development, which is still in the process of adapting to the parallel processing paradigm on multi-core systems, where programmers are supported by several Application Programming Interfaces (APIs) that offer different abstraction levels. The parallelism exploitation in GPU is done using both CUDA and OpenCL for academia and industry, whose developers have to deal with low-level architecture concepts to efficiently exploit GPU parallelism in their applications. There is still a lack of parallel programming abstractions when: 1) parallelizing code on GPUs, and 2) needing higher-level programming abstractions that deal with both CPU and GPU parallelism. Unfortunately, developers still have to be expert programmers on system and architecture to enable efficient hardware parallelism exploitation in this architectural environment. To contribute to the first problem, we created GSPARLIB, a novel structured parallel programming library for exploiting GPU parallelism that provides a unified programming API and driver-agnostic runtime. It offers Map and Reduce parallel patterns on top of CUDA and OpenCL drivers. We evaluate its performance comparing with state-of-the-art APIs, where the experiments revealed a comparable and efficient performance. For contributing to the second problem, we extended the SPar Domain-Specific Language (DSL), which has been proved to be high-level and productive for expressing stream parallelism with C++ annotations in multi-core CPUs. In this work, we propose and implement new annotations that increase expressiveness to combine the current stream parallelism on CPUs and data parallelism on GPUs. We also provide new pattern-based transformation rules that were implemented in the compiler targeting automatic source-to-source code transformations using GSPARLIB for GPU parallelism exploitation. Our experiments demonstrate that SPar compiler is able to generate stream and data parallel patterns without significant performance penalty compared to handwritten code. Thanks to these advances in SPar, our work is the first on providing high-level C++11 annotations as an API that does not require significant code refactoring in sequential programs while enabling multi-core CPU and many-core GPU parallelism exploitation for stream processing applications. |
2019 |
Mencagli, Gabriele; Torquati, Massimo; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Raising the Parallel Abstraction Level for Streaming Analytics Applications Journal Article doi IEEE Access, 7 , pp. 131944 - 131961, 2019. @article{MENCAGLI:IEEEAccess:19, title = {Raising the Parallel Abstraction Level for Streaming Analytics Applications}, author = {Gabriele Mencagli and Massimo Torquati and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/ACCESS.2019.2941183}, doi = {10.1109/ACCESS.2019.2941183}, year = {2019}, date = {2019-09-01}, journal = {IEEE Access}, volume = {7}, pages = {131944 - 131961}, publisher = {IEEE}, abstract = {In the stream processing domain, applications are represented by graphs of operators arbitrarily connected and filled with their business logic code. The APIs of existing Stream Processing Systems (SPSs) ease the development of transformations that recur in the streaming practice (e.g., filtering, aggregation and joins). In contrast, their parallelism abstractions are quite limited since they provide support to stateless operators only, or when the state is organized in a set of key-value pairs. This paper presents how the parallel patterns methodology can be revisited for sliding-window streaming analytics. Our vision fosters a design process of the application as composition and nesting of ready-to-use patterns provided through a C++17 fluent interface. Our prototype implements the run-time system of the patterns in the FastFlow parallel library expressing thread-based parallelism. The experimental analysis shows interesting outcomes. First, our pattern-based approach allows easy prototyping of different versions of the application, and the programmer can leverage nesting of patterns to increase performance (up to 37% in one of the two considered test-bed cases). Second, our FastFlow implementation outperforms (three times faster) the handmade porting of our patterns in popular JVM-based SPSs. Finally, in the concluding part of this paper, we explore the use of a task-based run-time system, by deriving interesting insights into how to make our patterns library suitable for multi backends.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In the stream processing domain, applications are represented by graphs of operators arbitrarily connected and filled with their business logic code. The APIs of existing Stream Processing Systems (SPSs) ease the development of transformations that recur in the streaming practice (e.g., filtering, aggregation and joins). In contrast, their parallelism abstractions are quite limited since they provide support to stateless operators only, or when the state is organized in a set of key-value pairs. This paper presents how the parallel patterns methodology can be revisited for sliding-window streaming analytics. Our vision fosters a design process of the application as composition and nesting of ready-to-use patterns provided through a C++17 fluent interface. Our prototype implements the run-time system of the patterns in the FastFlow parallel library expressing thread-based parallelism. The experimental analysis shows interesting outcomes. First, our pattern-based approach allows easy prototyping of different versions of the application, and the programmer can leverage nesting of patterns to increase performance (up to 37% in one of the two considered test-bed cases). Second, our FastFlow implementation outperforms (three times faster) the handmade porting of our patterns in popular JVM-based SPSs. Finally, in the concluding part of this paper, we explore the use of a task-based run-time system, by deriving interesting insights into how to make our patterns library suitable for multi backends. |
Griebler, Dalvan; Vogel, Adriano; Sensi, Daniele De; Danelutto, Marco; Fernandes, Luiz Gustavo Simplifying and implementing service level objectives for stream parallelism Journal Article doi Journal of Supercomputing, pp. 1-26, 2019, ISSN: 0920-8542. @article{GRIEBLER:JS:19, title = {Simplifying and implementing service level objectives for stream parallelism}, author = {Dalvan Griebler and Adriano Vogel and Daniele De Sensi and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-019-02914-6}, doi = {10.1007/s11227-019-02914-6}, issn = {0920-8542}, year = {2019}, date = {2019-06-01}, journal = {Journal of Supercomputing}, pages = {1-26}, publisher = {Springer}, abstract = {An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application’s source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications.}, keywords = {}, pubstate = {published}, tppubtype = {article} } An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application’s source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications. |
Vogel, Adriano; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Minimizing Self-Adaptation Overhead in Parallel Stream Processing for Multi-Cores Inproceedings doi Euro-Par 2019: Parallel Processing Workshops, pp. 12, Springer, Göttingen, Germany, 2019. @inproceedings{VOGEL:adaptive-overhead:AutoDaSP:19, title = {Minimizing Self-Adaptation Overhead in Parallel Stream Processing for Multi-Cores}, author = {Adriano Vogel and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/978-3-030-48340-1_3}, doi = {10.1007/978-3-030-48340-1_3}, year = {2019}, date = {2019-08-01}, booktitle = {Euro-Par 2019: Parallel Processing Workshops}, volume = {11997}, pages = {12}, publisher = {Springer}, address = {Göttingen, Germany}, series = {Lecture Notes in Computer Science}, abstract = {Stream processing paradigm is present in several applications that apply computations over continuous data flowing in the form of streams (e.g., video feeds, image, and data analytics). Employing self-adaptivity to stream processing applications can provide higher-level programming abstractions and autonomic resource management. However, there are cases where the performance is suboptimal. In this paper, the goal is to optimize parallelism adaptations in terms of stability and accuracy, which can improve the performance of parallel stream processing applications. Therefore, we present a new optimized self-adaptive strategy that is experimentally evaluated. The proposed solution provided high-level programming abstractions, reduced the adaptation overhead, and achieved a competitive performance with the best static executions.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Stream processing paradigm is present in several applications that apply computations over continuous data flowing in the form of streams (e.g., video feeds, image, and data analytics). Employing self-adaptivity to stream processing applications can provide higher-level programming abstractions and autonomic resource management. However, there are cases where the performance is suboptimal. In this paper, the goal is to optimize parallelism adaptations in terms of stability and accuracy, which can improve the performance of parallel stream processing applications. Therefore, we present a new optimized self-adaptive strategy that is experimentally evaluated. The proposed solution provided high-level programming abstractions, reduced the adaptation overhead, and achieved a competitive performance with the best static executions. |
Rockenbach, Dinei A; Stein, Charles Michael; Griebler, Dalvan; Mencagli, Gabriele; Torquati, Massimo; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges Inproceedings doi International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 834-841, IEEE, Rio de Janeiro, Brazil, 2019. @inproceedings{ROCKENBACH:stream-multigpus:IPDPSW:19, title = {Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges}, author = {Dinei A. Rockenbach and Charles Michael Stein and Dalvan Griebler and Gabriele Mencagli and Massimo Torquati and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/IPDPSW.2019.00137}, doi = {10.1109/IPDPSW.2019.00137}, year = {2019}, date = {2019-05-01}, booktitle = {International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}, pages = {834-841}, publisher = {IEEE}, address = {Rio de Janeiro, Brazil}, series = {IPDPSW'19}, abstract = {The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models. |
Stein, Charles Michael; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 247-251, IEEE, Pavia, Italy, 2019. @inproceedings{STEIN:LZSS-multigpu:PDP:19, title = {Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs}, author = {Charles Michael Stein and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671624}, doi = {10.1109/EMPDP.2019.8671624}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {247-251}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques. |
Maron, Carlos A F; Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz Gustavo Should PARSEC Benchmarks be More Parametric? A Case Study with Dedup Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 217-221, IEEE, Pavia, Italy, 2019. @inproceedings{MARON:parametric-parsec:PDP:19, title = {Should PARSEC Benchmarks be More Parametric? A Case Study with Dedup}, author = {Carlos A. F. Maron and Adriano Vogel and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671592}, doi = {10.1109/EMPDP.2019.8671592}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {217-221}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {Parallel applications of the same domain can present similar patterns of behavior and characteristics. Characterizing common application behaviors can help for understanding performance aspects in the real-world scenario. One way to better understand and evaluate applications' characteristics is by using customizable/parametric benchmarks that enable users to represent important characteristics at run-time. We observed that parameterization techniques should be better exploited in the available benchmarks, especially on stream processing domain. For instance, although widely used, the stream processing benchmarks available in PARSEC do not support the simulation and evaluation of relevant and modern characteristics. Therefore, our goal is to identify the stream parallelism characteristics present in PARSEC. We also implemented a ready to use parameterization support and evaluated the application behaviors considering relevant performance metrics for stream parallelism (service time, throughput, latency). We choose Dedup to be our case study. The experimental results have shown performance improvements in our parameterization support for Dedup. Moreover, this support increased the customization space for benchmark users, which is simple to use. In the future, our solution can be potentially explored on different parallel architectures and parallel programming frameworks.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Parallel applications of the same domain can present similar patterns of behavior and characteristics. Characterizing common application behaviors can help for understanding performance aspects in the real-world scenario. One way to better understand and evaluate applications' characteristics is by using customizable/parametric benchmarks that enable users to represent important characteristics at run-time. We observed that parameterization techniques should be better exploited in the available benchmarks, especially on stream processing domain. For instance, although widely used, the stream processing benchmarks available in PARSEC do not support the simulation and evaluation of relevant and modern characteristics. Therefore, our goal is to identify the stream parallelism characteristics present in PARSEC. We also implemented a ready to use parameterization support and evaluated the application behaviors considering relevant performance metrics for stream parallelism (service time, throughput, latency). We choose Dedup to be our case study. The experimental results have shown performance improvements in our parameterization support for Dedup. Moreover, this support increased the customization space for benchmark users, which is simple to use. In the future, our solution can be potentially explored on different parallel architectures and parallel programming frameworks. |
Serpa, Matheus S; Moreira, Francis B; Navaux, Philippe O A; Cruz, Eduardo H M; Diener, Matthias; Griebler, Dalvan; Fernandes, Luiz Gustavo Memory Performance and Bottlenecks in Multicore and GPU Architectures Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 233-236, IEEE, Pavia, Italy, 2019. @inproceedings{SERPA:memory-gpu-multicore:PDP:19, title = {Memory Performance and Bottlenecks in Multicore and GPU Architectures}, author = {Matheus S. Serpa and Francis B. Moreira and Philippe O. A. Navaux and Eduardo H. M. Cruz and Matthias Diener and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671628}, doi = {10.1109/EMPDP.2019.8671628}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {233-236}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {Nowadays, there are several different architectures available not only for the industry, but also for normal consumers. Traditional multicore processors, GPUs, accelerators such as the Sunway SW26010, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. Therefore, the same application can perform well when executing on one architecture, but poorly on another architecture. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. The related work in this area mostly focuses on a limited analysis encompassing execution time and energy. In this paper, we perform a detailed investigation on the impact of the memory subsystem of different architectures, which is one of the most important aspects to be considered. For this study, we performed experiments in the Broadwell CPU and Pascal GPU, using applications from the Rodinia benchmark suite. In this way, we were able to understand why an application performs well on one architecture and poorly on others.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Nowadays, there are several different architectures available not only for the industry, but also for normal consumers. Traditional multicore processors, GPUs, accelerators such as the Sunway SW26010, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. Therefore, the same application can perform well when executing on one architecture, but poorly on another architecture. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. The related work in this area mostly focuses on a limited analysis encompassing execution time and energy. In this paper, we perform a detailed investigation on the impact of the memory subsystem of different architectures, which is one of the most important aspects to be considered. For this study, we performed experiments in the Broadwell CPU and Pascal GPU, using applications from the Rodinia benchmark suite. In this way, we were able to understand why an application performs well on one architecture and poorly on others. |
Pieper, Ricardo; Griebler, Dalvan; Fernandes, Luiz Gustavo Structured Stream Parallelism for Rust Inproceedings doi XXIII Brazilian Symposium on Programming Languages (SBLP), pp. 54-61, ACM, Salvador, Brazil, 2019. @inproceedings{PIEPER:SBLP:19b, title = {Structured Stream Parallelism for Rust}, author = {Ricardo Pieper and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1145/3355378.3355384}, doi = {10.1145/3355378.3355384}, year = {2019}, date = {2019-10-01}, booktitle = {XXIII Brazilian Symposium on Programming Languages (SBLP)}, pages = {54-61}, publisher = {ACM}, address = {Salvador, Brazil}, series = {SBLP'19}, abstract = {Structured parallel programming has been studied and applied in several programming languages. This approach has proven to be suitable for abstracting low-level and architecture-dependent parallelism implementations. Our goal is to provide a structured and high-level library for the Rust language, targeting parallel stream processing applications for multi-core servers. Rust is an emerging programming language that has been developed by Mozilla Research group, focusing on performance, memory safety, and thread-safety. However, it lacks parallel programming abstractions, especially for stream processing applications. This paper contributes to a new API based on the structured parallel programming approach to simplify parallel software developing. Our experiments highlight that our solution provides higher-level parallel programming abstractions for stream processing applications in Rust. We also show that the throughput and speedup are comparable to the state-of-the-art for certain workloads.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Structured parallel programming has been studied and applied in several programming languages. This approach has proven to be suitable for abstracting low-level and architecture-dependent parallelism implementations. Our goal is to provide a structured and high-level library for the Rust language, targeting parallel stream processing applications for multi-core servers. Rust is an emerging programming language that has been developed by Mozilla Research group, focusing on performance, memory safety, and thread-safety. However, it lacks parallel programming abstractions, especially for stream processing applications. This paper contributes to a new API based on the structured parallel programming approach to simplify parallel software developing. Our experiments highlight that our solution provides higher-level parallel programming abstractions for stream processing applications in Rust. We also show that the throughput and speedup are comparable to the state-of-the-art for certain workloads. |
2023 |
Micro-batch and data frequency for stream processing on multi-cores Journal Article doi The Journal of Supercomputing, In press (In press), pp. 1-39, 2023. |
2022 |
SPBench: a framework for creating benchmarks of stream processing applications Journal Article doi Computing, In press (In press), pp. 1-23, 2022. |
OpenMP as runtime for providing high-level stream parallelism on multi-cores Journal Article doi The Journal of Supercomputing, 1 (1), pp. 7655–7676, 2022. |
Self-adaptation on Parallel Stream Processing: A Systematic Review Journal Article doi Concurrency and Computation: Practice and Experience, 34 (6), pp. e6759, 2022. |
Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores Inproceedings doi 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 10-17, IEEE, Valladolid, Spain, 2022. |
Um Framework para Criar Benchmarks de Aplicações Paralelas de Stream Inproceedings doi Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 97–98, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. |
Avaliação do Esforço de Programação em GPU: Estudo Piloto Inproceedings doi Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 95–96, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. |
Avaliação da aplicação de paralelismo em classificadores taxonômicos usando Qiime2 Inproceedings doi Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 25–28, Sociedade Brasileira de Computação (SBC), Porto Alegre, RS, Brasil, 2022. |
High-Level Stream and Data Parallelism in C++ for GPUs Inproceedings doi XXVI Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Uberlândia, Brazil, 2022. |
Self-adaptive abstractions for efficient high-level parallel computing in multi-cores PhD Thesis School of Technology - PUCRS, 2022. |
Self-adaptive abstractions for efficient high-level parallel computing in multi-cores PhD Thesis Computer Science Department - University of Pisa, 2022. |
2021 |
Providing High‐Level Self‐Adaptive Abstractions for Stream Parallelism on Multicores Journal Article doi Software: Practice and Experience, 51 (6), pp. 1194-1217, 2021. |
Melhorando a Geração Automática de Código Paralelo para o Paradigma de Processamento de Stream em Multi-cores Journal Article Revista Eletrônica de Iniciação Científica em Computação, 19 (2), pp. 2083, 2021. |
Geração de Código OpenMP para o Paralelismo de Stream Journal Article Revista Eletrônica de Iniciação Científica em Computação, 19 (2), pp. 2082, 2021. |
High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores Journal Article doi Journal of Computer Languages, 65 (na), pp. 101054, 2021, ISSN: 2590-1184. |
The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures Journal Article doi Future Generation Computer Systems, na (na), pp. na, 2021. |
Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces Inproceedings doi 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 84-88, IEEE, Valladolid, Spain, 2021. |
Assessing Coding Metrics for Parallel Programming of Stream Processing Programs on Multi-cores Inproceedings doi 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 291-295, IEEE, Pavia, Italy, 2021, ISBN: 978-1-6654-2705-0. |
Proposta de Otimização do Tamanho de Batch em Aplicações de Stream para Multicores usando Aprendizado de Máquina Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 127-128, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
Proposta de um Framework para Avaliar Interfaces de Programação Paralela em Aplicações de Stream Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 119-120, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
Provendo Abstrações de Alto Nível para GPUs na SPar Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 109-110, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
Proposta de Suporte à Parametrização no NPB com CUDA Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 103-104, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
Proposta de Adaptação Dinâmica de Padrões Paralelos Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 101-102, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
Uso de Métricas de Codificação para Avaliar a Programação Paralela nas Aplicações de Stream em Sistemas Multi-core Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 93-94, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
Compressão de Dados em Multicores com Flink ou SPar? Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 77-80, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
Abstraindo o OpenMP no Desenvolvimento de Aplicações de Fluxo de Dados Contínuo Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 69-72, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
Melhorando a Geração Automática de Código Paralelo em Arquiteturas Multi-core na SPar Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 65-68, Sociedade Brasileira de Computação, Joinville, Brazil, 2021. |
High-Level Stream and Data Parallelism in C++ for Multi-Cores Inproceedings doi XXV Brazilian Symposium on Programming Languages (SBLP), pp. 41-48, ACM, Joinville, Brazil, 2021. |
Towards On-the-fly Self-Adaptation of Stream Parallel Patterns Inproceedings doi 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 889-93, IEEE, Valladolid, Spain, 2021. |
2020 |
DSPBench: a Suite of Benchmark Applications for Distributed Data Stream Processing Systems Journal Article doi IEEE Access, 8 (na), pp. 222900-222917, 2020. |
Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. |
Avaliação da Usabilidade de Interfaces de Programação Paralela para Sistemas Multi-Core em Aplicação de Vídeo Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 149-150, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Acelerando uma Aplicação de Detecção de Pistas com MPI Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 117-120, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Geração Automática de Código TBB na SPar Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 97-100, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Implementação CUDA dos Kernels NPB Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 85-88, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Implementação Paralela do LU no NPB C++ Utilizando um Pipeline Implícito Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 37-40, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Stream Parallelism Annotations for Multi-Core Frameworks Inproceedings doi XXIV Brazilian Symposium on Programming Languages (SBLP), pp. 48-55, ACM, Natal, Brazil, 2020. |
The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures Inproceedings doi International Conference on Computational Science and its Applications (ICCSA), pp. 142-157, Springer, Cagliari, Italy, 2020. |
Proposta de uma Suíte de Benchmarks para Processamento de Stream em Sistemas Multi-Core Inproceedings doi XX Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 167-168, Sociedade Brasileira de Computação (SBC), Santa Maria, BR, 2020. |
Parallel Stream Processing with MPI for Video Analytics and Data Visualization Inproceedings doi High Performance Computing Systems, pp. 102-116, Springer, Cham, 2020. |
Efficient NAS Parallel Benchmark Kernels with CUDA Inproceedings doi 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 9-16, IEEE, Västerås, Sweden, Sweden, 2020. |
High-Level Programming Abstractions for Stream Parallelism on GPUs Masters Thesis School of Technology - PPGCC - PUCRS, 2020. |
2019 |
Raising the Parallel Abstraction Level for Streaming Analytics Applications Journal Article doi IEEE Access, 7 , pp. 131944 - 131961, 2019. |
Simplifying and implementing service level objectives for stream parallelism Journal Article doi Journal of Supercomputing, pp. 1-26, 2019, ISSN: 0920-8542. |
Minimizing Self-Adaptation Overhead in Parallel Stream Processing for Multi-Cores Inproceedings doi Euro-Par 2019: Parallel Processing Workshops, pp. 12, Springer, Göttingen, Germany, 2019. |
Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges Inproceedings doi International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 834-841, IEEE, Rio de Janeiro, Brazil, 2019. |
Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 247-251, IEEE, Pavia, Italy, 2019. |
Should PARSEC Benchmarks be More Parametric? A Case Study with Dedup Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 217-221, IEEE, Pavia, Italy, 2019. |
Memory Performance and Bottlenecks in Multicore and GPU Architectures Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 233-236, IEEE, Pavia, Italy, 2019. |
Structured Stream Parallelism for Rust Inproceedings doi XXIII Brazilian Symposium on Programming Languages (SBLP), pp. 54-61, ACM, Salvador, Brazil, 2019. |