HKS: Efficient Data Partitioning for Stateful Streaming

Adeel Aslam, Giovanni Simonini, Luca Gagliardelli, Angelo Mozzillo, Sonia Bergamaschi

August, 2023

Abstract

Data partitioning among processing instances of distributed stream processing systems (DSPSs) plays a significant role in the performance of overall stream processing. Several data partitioning schemes, including round-robin and hash-based key-splitting strategies, are employed in this context. However, stateful operations introduce challenges such as data aggregation overhead and load imbalance among processing instances due to the skewed nature of real data. In this paper, we propose a partitioning strategy (HKS) that considers the popularity of the tuples on the fly and partitions them according to their frequency - higher frequent tuples are routed by employing power-of-the-two-choices, whereas low ones by using a single hash function. We perform a comprehensive experimental evaluation on synthetic and real-world data sets on well-known Apache Storm DSPS. Results demonstrate the superior performance of the HKS against state-of-the-art data partitioning schemes in terms of load imbalance and aggregation cost.

Type

Journal article

Publication

Big Data Analytics and Knowledge Discovery

Stream Processing