HKS: Efficient Data Partitioning for Stateful Streaming

Abstract

Data partitioning among processing instances of distributed stream processing systems (DSPSs) plays a significant role in the performance of overall stream processing. Several data partitioning schemes, including round-robin and hash-based key-splitting strategies, are employed in this context. However, stateful operations introduce challenges such as data aggregation overhead and load imbalance among processing instances due to the skewed nature of real data. In this paper, we propose a partitioning strategy (HKS) that considers the popularity of the tuples on the fly and partitions them according to their frequency - higher frequent tuples are routed by employing power-of-the-two-choices, whereas low ones by using a single hash function. We perform a comprehensive experimental evaluation on synthetic and real-world data sets on well-known Apache Storm DSPS. Results demonstrate the superior performance of the HKS against state-of-the-art data partitioning schemes in terms of load imbalance and aggregation cost.

Publication
Big Data Analytics and Knowledge Discovery
Angelo Mozzillo
Angelo Mozzillo
Ph.D. Student in Information and Communication Technologies

My research interests include big data management, data preparation, and machine learning for databases.