Accelerating Fresh Data Exploration with Fluid ETL Pipelines
2026-03-23 • Databases
Databases
AI summaryⓘ
The authors focus on exploring fresh data that is constantly being collected, which is hard because the data and what people want to learn can change quickly. Traditional methods to prepare data for analysis (called ETL pipelines) can be slow and use lots of resources. To fix this, the authors suggest "fluid ETL pipelines," a system that lets people start or stop data processing tasks anytime without slowing down the data flow, using spare or cheap computing power. They tested their idea on real data and found it works, but also pointed out some new problems that need solving. The paper discusses these challenges and suggests ways to tackle them.
fresh data explorationETL pipelinedata preprocessingdata ingestionquery latencyAmazon Spot Instancedata parsingdata indexingad-hoc query processingresource management
Authors
Maxwell Norfolk, Dong Xie
Abstract
Recently, we have seen an increasing need for fresh data exploration, where data analysts seek to explore the main characteristics or detect anomalies of data being actively collected. In addition to the common challenges in classic data exploration, such as a lack of prior knowledge about the data or the analysis goal, fresh data exploration also demands an ingestion system with sufficient throughput to keep up with rapid data accumulation. However, leveraging traditional Extract-Transform-Load (ETL) pipelines to achieve low query latency can still be extremely resource-intensive as they must conduct an excessive amount of data preprocessing routines (DPRs) (e.g., parsing and indexing) to cover unpredictable data characteristics and analysis goals. To overcome this challenge, we seek to approach it from a different angle: leveraging occasional idle system capacity or cheap preemptive resources (e.g., Amazon Spot Instance) during ingestion. In particular, we introduce a new type of data ingestion system called fluid ETL pipelines, which allow users to start/stop arbitrary DPRs on demand without blocking data ingestion. With fluid ETL pipelines, users can start potentially useful DPRs to accelerate future exploration queries whenever idle/cheap resources are available. Moreover, users can dynamically change which DPRs to run with limited resources to adapt to users' evolving interests. We conducted experiments on a real-world dataset and verified that our vision is viable. The introduction of fluid ETL pipelines also raises new challenges in handling essential tasks, such as ad-hoc query processing, DPR generation, and DPR management. In this paper, we discuss open research challenges in detail and outline potential directions for addressing them.