Maximizing Efficiency in Existing Data Preparation Pipelines

Abstract

Data preparation involves transforming, cleaning, and converting raw data into a usable format for further analysis. This process can be time-consuming and resource-intensive. Typically, data to be analyzed is placed in two-dimensional tabular structures called DataFrames. These are the de facto standards for data science and machine learning tasks to store and process large amounts of structured data. Pandas is the most commonly used API for manipulating DataFrames in Python due to its popularity and comprehensive functionality. However, it is important to note that Pandas has some limitations, such as being single-core and non-distributed, which can impact its efficiency and performance for large datasets. Several libraries have been developed to expand the functionality of Pandas by utilizing multi-core and distributed computing capabilities. Therefore, the choice of library can significantly impact the efficiency and performance of the data preparation pipeline, and it is important to consider the specific requirements of the project when selecting a library. In this paper, I will discuss my primary contributions during the first months of my PhD program, which mainly focused on creating an open-source framework to evaluate the performance of seven Python libraries on five datasets of different sizes. The primary objective is to identify the best combination of these libraries to create efficient data preparation pipelines that deal with the needs of users who frequently encounter several libraries claiming to perform similar tasks. Finally, I will describe some ideas concerning the future directions of this research.

Publication
Italian Symposium on Advanced Database Systems
Angelo Mozzillo
Angelo Mozzillo
Ph.D. Student in Information and Communication Technologies

My research interests include big data management, data preparation, and machine learning for databases.