2025 |
Faé, Leonardo; Griebler, Dalvan Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines Inproceedings doi Anais do XXIX Simpósio Brasileiro de Linguagens de Programação, pp. 75-83, SBC, Recife/PE, 2025. @inproceedings{FAE:SBLP:25, title = {Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines}, author = {Leonardo Faé and Dalvan Griebler}, url = {https://sol.sbc.org.br/index.php/sblp/article/view/36951/36736}, doi = {10.5753/sblp.2025.13152}, year = {2025}, date = {2025-09-01}, booktitle = {Anais do XXIX Simpósio Brasileiro de Linguagens de Programação}, pages = {75-83}, publisher = {SBC}, address = {Recife/PE}, series = {SBLP'25}, abstract = {Programming Graphics Processing Units (GPUs) for general-purpose computation remains a daunting task, often requiring specialized knowledge of low-level APIs like CUDA or OpenCL. While Rust has emerged as a modern, safe, and performant systems programming language, its adoption in the GPU computing domain is still nascent. Existing approaches often involve intricate compiler modifications or complex static analysis to adapt CPU-centric Rust code for GPU execution. This paper presents a novel high-level abstraction in Rust, leveraging procedural macros to automatically generate GPU-executable code from constrained Rust functions. Our approach simplifies the code generation process by imposing specific limitations on how these functions can be written, thereby avoiding the need for complex static analysis. We demonstrate the feasibility and effectiveness of our abstraction through a case study involving linear pipeline parallel patterns, a common structure in data-parallel applications. By transforming Rust functions annotated as source, stage, or sink in a pipeline, we enable straightforward execution on the GPU. We evaluate our abstraction's performance and programmability using two benchmark applications: sobel (image filtering) and latbol (fluid simulation), comparing it against manual OpenCL implementations. Our results indicate that while incurring a small performance overhead in some cases, our approach significantly reduces development effort and, in certain scenarios, achieves comparable or even superior throughput compared to CPU-based parallelism.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Programming Graphics Processing Units (GPUs) for general-purpose computation remains a daunting task, often requiring specialized knowledge of low-level APIs like CUDA or OpenCL. While Rust has emerged as a modern, safe, and performant systems programming language, its adoption in the GPU computing domain is still nascent. Existing approaches often involve intricate compiler modifications or complex static analysis to adapt CPU-centric Rust code for GPU execution. This paper presents a novel high-level abstraction in Rust, leveraging procedural macros to automatically generate GPU-executable code from constrained Rust functions. Our approach simplifies the code generation process by imposing specific limitations on how these functions can be written, thereby avoiding the need for complex static analysis. We demonstrate the feasibility and effectiveness of our abstraction through a case study involving linear pipeline parallel patterns, a common structure in data-parallel applications. By transforming Rust functions annotated as source, stage, or sink in a pipeline, we enable straightforward execution on the GPU. We evaluate our abstraction's performance and programmability using two benchmark applications: sobel (image filtering) and latbol (fluid simulation), comparing it against manual OpenCL implementations. Our results indicate that while incurring a small performance overhead in some cases, our approach significantly reduces development effort and, in certain scenarios, achieves comparable or even superior throughput compared to CPU-based parallelism. |
Löff, Júnior; Hoffmann, Renato B; Bianchessi, Arthur S; Mallmann, Leonardo; Griebler, Dalvan; Binder, Walter NPB-PSTL: C++ STL Algorithms with Parallel Execution Policies in NAS Parallel Benchmarks Inproceedings doi 33rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 162-169, IEEE, Torino, Italy, 2025. @inproceedings{LOFF:PDP:25, title = {NPB-PSTL: C++ STL Algorithms with Parallel Execution Policies in NAS Parallel Benchmarks}, author = {Júnior Löff and Renato B. Hoffmann and Arthur S. Bianchessi and Leonardo Mallmann and Dalvan Griebler and Walter Binder}, url = {https://doi.org/10.1109/PDP66500.2025.00030}, doi = {10.1109/PDP66500.2025.00030}, year = {2025}, date = {2025-03-01}, booktitle = {33rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {162-169}, publisher = {IEEE}, address = {Torino, Italy}, series = {PDP'25}, abstract = {The C++ language continually evolves through formal specifications established by its standards committee, proposing new features to maintain C++ as a relevant programming language while improving usability, performance, and portability across platforms. With the addition of parallel Standard Template Library (STL) algorithms in C++17, programmers can now leverage parallel processing capabilities via vendor-neutral parallel execution policies. This study presents an adaptation of the NAS Parallel Benchmarks (NPB)—a well-established suite of applications for evaluating parallel architectures-by porting its sequential C-style code to use C++ STL abstractions and performance-portable parallelism features. Our goals are to (1) assess the suitability of C++ STL for scientific applications like the ones in the NPB and (2) provide a comparative performance and portability of STL algorithms' parallel execution policies across different multicore architectures (x86 and AArch64). Results indicate that the performance of parallel STL algorithms is often close to that of optimized handwritten versions (OpenMP, Intel TBB, and FastFlow) on different architectures, with notable shortfalls. Across all NPB benchmarks, the STL algorithms’ geometric mean shows sequential execution times that are between 3.76% and 6.9% higher, while parallel executions may reach a geometric mean of up to 21.21% higher execution time.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The C++ language continually evolves through formal specifications established by its standards committee, proposing new features to maintain C++ as a relevant programming language while improving usability, performance, and portability across platforms. With the addition of parallel Standard Template Library (STL) algorithms in C++17, programmers can now leverage parallel processing capabilities via vendor-neutral parallel execution policies. This study presents an adaptation of the NAS Parallel Benchmarks (NPB)—a well-established suite of applications for evaluating parallel architectures-by porting its sequential C-style code to use C++ STL abstractions and performance-portable parallelism features. Our goals are to (1) assess the suitability of C++ STL for scientific applications like the ones in the NPB and (2) provide a comparative performance and portability of STL algorithms' parallel execution policies across different multicore architectures (x86 and AArch64). Results indicate that the performance of parallel STL algorithms is often close to that of optimized handwritten versions (OpenMP, Intel TBB, and FastFlow) on different architectures, with notable shortfalls. Across all NPB benchmarks, the STL algorithms’ geometric mean shows sequential execution times that are between 3.76% and 6.9% higher, while parallel executions may reach a geometric mean of up to 21.21% higher execution time. |
Hoffmann, Renato B; Faé, Leonardo G; Griebler, Dalvan; Li, Xinliang David; Pereira, Fernando Magno Quintão Automatic Synthesis of Specialized Hash Functions Inproceedings doi Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, pp. 317–330, Association for Computing Machinery, Las Vegas, NV, USA, 2025, ISBN: 9798400712753. @inproceedings{HOFFMANN:CGO:25, title = {Automatic Synthesis of Specialized Hash Functions}, author = {Renato B Hoffmann and Leonardo G Faé and Dalvan Griebler and Xinliang David Li and Fernando Magno Quintão Pereira}, url = {https://doi.org/10.1145/3696443.3708940}, doi = {10.1145/3696443.3708940}, isbn = {9798400712753}, year = {2025}, date = {2025-03-01}, booktitle = {Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization}, pages = {317–330}, publisher = {Association for Computing Machinery}, address = {Las Vegas, NV, USA}, series = {CGO '25}, abstract = {This paper introduces a technique for synthesizing hash functions specialized to particular byte formats. This code generation method leverages three prevalent patterns: (i) fixed-length keys, (ii) keys with common subsequences, and (iii) keys ranging on predetermined sequences of bytes. Code generation involves two algorithms: one identifies relevant regular expressions within key examples, and the other generates specialized hash functions based on these expressions. Comparative analysis demonstrates that the synthetic functions outperform the general-purpose hashes in the C++ Standard Template Library and the Google Abseil Library when keys are given in ascending, normal or uniform distribution. In applications where low-mixing hashes are acceptable, the synthetic functions achieve speedups ranging from 2% to 11% on full benchmarks, and speedups of almost 50x once only hashing speed is considered.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper introduces a technique for synthesizing hash functions specialized to particular byte formats. This code generation method leverages three prevalent patterns: (i) fixed-length keys, (ii) keys with common subsequences, and (iii) keys ranging on predetermined sequences of bytes. Code generation involves two algorithms: one identifies relevant regular expressions within key examples, and the other generates specialized hash functions based on these expressions. Comparative analysis demonstrates that the synthetic functions outperform the general-purpose hashes in the C++ Standard Template Library and the Google Abseil Library when keys are given in ascending, normal or uniform distribution. In applications where low-mixing hashes are acceptable, the synthetic functions achieve speedups ranging from 2% to 11% on full benchmarks, and speedups of almost 50x once only hashing speed is considered. |
Mencagli, Gabriele; Rymarchuk, Yuriy; Griebler, Dalvan PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams Inproceedings doi Proceedings of the 19th ACM International Conference on Distributed and Event-Based Systems, pp. 51-61, Association for Computing Machinery, New York, NY, USA, 2025. @inproceedings{MENCAGLI:DEBS:25, title = {PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams}, author = {Gabriele Mencagli and Yuriy Rymarchuk and Dalvan Griebler}, url = {https://doi.org/10.1145/3701717.3730542}, doi = {10.1145/3701717.3730542}, year = {2025}, date = {2025-01-01}, booktitle = {Proceedings of the 19th ACM International Conference on Distributed and Event-Based Systems}, pages = {51-61}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, series = {DEBS'25}, abstract = {Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis. |
Leonarczyk, Ricardo; Mencagli, Gabriele; Griebler, Dalvan Self-Adaptive Micro-Batching for Low-Latency GPU-Accelerated Stream Processing Journal Article doi International Journal of Parallel Programming, 53 (2), pp. 14, 2025, ISSN: 0885-7458. @article{LEONARCZYK:IJPP:25b, title = {Self-Adaptive Micro-Batching for Low-Latency GPU-Accelerated Stream Processing}, author = {Ricardo Leonarczyk and Gabriele Mencagli and Dalvan Griebler}, doi = {10.1007/s10766-025-00793-4}, issn = {0885-7458}, year = {2025}, date = {2025-01-01}, journal = {International Journal of Parallel Programming}, volume = {53}, number = {2}, pages = {14}, abstract = {Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by GPUs. However, efficient GPU utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a GPU-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by GPUs. However, efficient GPU utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a GPU-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms. |
Parallel Applications Modelling Group
Research Lines
Compilers and Abstractions:
GPU Parallelism:
Distributed Computing:
Benchmarks and Parallel Applications:
Team

Prof. Dr. Luiz Gustavo Leão Fernandes
General Coordinator

Prof. Dr. Dalvan Griebler
Research Coordinator
Last Papers
2025 |
Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines Inproceedings doi Anais do XXIX Simpósio Brasileiro de Linguagens de Programação, pp. 75-83, SBC, Recife/PE, 2025. |
NPB-PSTL: C++ STL Algorithms with Parallel Execution Policies in NAS Parallel Benchmarks Inproceedings doi 33rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 162-169, IEEE, Torino, Italy, 2025. |
Automatic Synthesis of Specialized Hash Functions Inproceedings doi Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, pp. 317–330, Association for Computing Machinery, Las Vegas, NV, USA, 2025, ISBN: 9798400712753. |
PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams Inproceedings doi Proceedings of the 19th ACM International Conference on Distributed and Event-Based Systems, pp. 51-61, Association for Computing Machinery, New York, NY, USA, 2025. |
Self-Adaptive Micro-Batching for Low-Latency GPU-Accelerated Stream Processing Journal Article doi International Journal of Parallel Programming, 53 (2), pp. 14, 2025, ISSN: 0885-7458. |
Projects
Software
Last News
Contact us!
Or, feel free to use the form below to contact us.



