Balancing the Complexity of Data Pipeline
Engineering: A Technological Landscape
Where Human Expertise Meets Large
Language Models
Dr. Rim Moussa, Eng. School of Carthage, University of Carthage
Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage
The 11th International Symposium on Networks, Computers and Communications (ISNCC'24) @ Washington
D.C., USA
22 - 25 October 2024
Outline
Context and Motivations
Objectives and Solution
Data Pipeline Engineering: Use Case Infer Aircraft Flights in
Crowd sourced networks
Review of 5 AI assistants
Related Work
Conclusion and Future Work
1
2
3
4
5
2
6
3
Data to Insights pipelines?
● “Data pipelines are sets of processes that move and transform data from various sources
to a destination (data warehouse, data lake, data lakehouse), where new value can be
derived.” James Densmore, 2021
● Data pipelines consist of several tasks or actions that need to be executed to achieve a
desired result
● A data pipeline is represented using Directed Acyclic Graph (DAG)
● NOT EXTRACT data & LOAD them INTO a data store: Raw data is refined to clean,
structure, normalize, combine, aggregate, and sometimes anonymize.
● Companies are becoming more data driven
● Paradigm shift in implementing data pipelines
○ code-to-data with big data frameworks
4
Data Warehouses (80’s), Data lakes (2000), Lakehouses(2020)
“Catch-all” repositories
Armbrust et al., CIDR’2021
5
Get valuable insights from big data
6
Data Prep is time-consuming!
Source: CrowdFlower, 2015
Data to insights pipeline. Data science pipelines are often
complex with several stages, each with many participants.
One team prepares the data, sourced from heterogeneous
data sources in data lakes. Another team builds models on
the data. Finally, end users access the data and models
through interactive dashboards. The database community
needs to develop simple and efficient tools that support
building and maintaining data pipelines. Data scientists
repeatedly say that data cleaning, integration, and
transformation together consume 80%-90% of their
time. These are problems the database community has
experienced in the context of enterprise data for decades.
However, much of our past efforts focused on solving
algorithmic challenges for important “point problems,” such
as schema mapping and entity resolution. Moving forward,
we must adapt our community’s expertise in data cleaning,
integration, and transformation to aid the iterative
end-to-end development of the data-to-insights pipeline.
Source: The Seattle Database Report, 2022 [1]
Objectives
● showcase a complex data pipeline
■ optimized with human expertise
■ implemented using big data frameworks
● Review of 5 Conversational AI assistants
7
USE CASE : INFERRING
AIRCRAFTS ’ TRIPS IN
CROWDSOURCED
NETWORKS
8
● Multiple groups promote the
transformation of aviation into cleaner,
safer, more efficient and predictable
system, such as
○ The OpenSky Network
○ The High Level Group on Aviation
Research Europe Commission:
European Aviation Vision 2050
○ The Next Generation Air
Transportation in USA
● The OpenSky Network
○ over 6,000 sensors
○ open aircrafts’ logs
Data Sources: flights’ logs
9
Positional
data
latitude
longitude
geoaltitude
baroaltitude
Speed
data
velocity
vertical-rate
Dynamics’
data
heading
Operational
data
alert
spi-rate
Time
data
osn_ts
last-contact
…
Aircraft
data
icao24
Data Sources: airports’ data
10
Positional
data
lat_decimal
lon_decimal
altitude
…
city
country
Airport
data
id
name
iata_code
icao_code
Data pipeline
11
● Data sources (big data):
● multiple sources (e.g., OpenSky
Network, Airports dataset )
● data at rest (airports) and data in
motion (flight logs)
● Data cleansing
● valid positional data, speed data,...
● Correlate datasets using complex
operations
● build spatial indexes on batches
● combine with spatial join
● combine with outer join
● prune with filtering
● Inferred flight data
● no need for further processing
● require merge
○ either with previously inferred
flight data from previous
batches
○ or with previously inferred
flight data from previous
batches
Flight
12
● aircraft identifier
● departure airport
● destination airport
● departure timestamp
● arrival timestamp
● trajectory data
● speed data
● operational data
● dynamics data
Conversational AI assistants
● ChatGPT (OpenAI) url
● Llama-3 (Meta) url
● QWEN2 (Alibaba) url
● Gemma2 (Google) url
● Mistral-Nemo (Nvidia) url
13
Prompt #1
14
Attached two csv files in google drive link [...], in dropbox link [...]
The first file "airports.csv" is a dataset of airports. Each airport is identified by an id, is located in a country, and
each airport is located in a 3d reference system given its decimal longitude ('lon decimal' column), decimal latitude
('lat_decimal' column), and its altitude ('altitude' column).
The second file "logs.csv" is an extract of logs captured by the open sky network during one day. Each entry
denotes the
position of an aircraft, identified by the column 'icao24', in a 3d reference system given latitude ('lat' column),
longitude ('lon' column) and 'geoaltitude' column.
We want to infer the flight(s) details performed by each aircraft, determine the departure airport (takeoff event),
the arrival
airport (landing event), the first recorded timestamp, the last recorded timestamp, the duration calculated as last
recorded
timestamp minus first recorded timestamp.
Notice that for some flights, the departure airport and/or the arrival airport are unknown, consequently we could
only
extract a part of the trajectory. There are four types of inferred flights:
_type 0: a flight such that the departure airport is unknown, and the arrival airport is known
_type 1: a flight such that both departure and arrival airports are known
_type 2: a flight such that the departure airport is known, and the arrival airport is unknown
_type 3: a flight such that both departure and arrival airports are known.
-1 denotes an unknown airport either for departure or arrival.
Could you propose a solution using [....], incorporating the inferred flights derived from the shared datasets?
Prompt #2
15
Attached is a PSV file containing the flight trajectories of an aircraft. Each trajectory is represented in WKT format.
Could you visualize these trajectories in a 3D reference system and on a map using Folium, and then share the
resulting plots online?
Review of our interactions
1 Communication and Data Access
● support access to public cloud storage services like Google Drive, Dropbox, and
GitHub to upload files,
● accept voice prompts, text prompts
2 Clarifications
● ask for clarifications before providing a response,
● or automatically generate a response based on their own assumptions.
Review of our interactions
3 Results
● the code snippets may be presented in stages or as a single script, with or without explanations
● some prompts can generate and run the code on their cloud resources, providing output plots
or other results.
● If no results are delivered, the engine may explain the need for further refinement :(
● store previous prompts and answers
○ e.g. ChatGPT: today, yesterday, previous 7 days, previous 30 days, September, August,
…January, 2023, ..
Review of our interactions
Review of our interactions
4 Feedbacks
● propose multiple solutions, and ask the user to test the solutions, and select the
most appropriate one,
● ask to rate a given solution.
5 Recommendations
● refine the result code for more accurate and robust flight inference, optimize
performance,
● consider using a more advanced LLM release,
● or caution the user against using the code as-is.
Optimizations
● The prompts generally do not implement or recommend optimizations such as:
○ Indexing geospatial data before joining datasets;
○ Filtering log entries based on predicates (e.g., aircraft altitude is close to
airport altitude) better than a cross product with a all airports;
○ Handling cases of multiple flights performed by the aircraft on the same day.
6
Related Work
20
● Description Languages [3]
● Data Quality [4]–[5]
● Frameworks Apache Airflow, Dagster,...
● Implementation technologies
● Apache Hadoop -Pig Latin, Apache Spark, Nvidia RAPIDS NVTabular, …
● AutoETL [10], generate pre-processing pipelines.
● Auto-Pipeline [11], synthesize pipelines using deep reinforcement-learning.
● Pipemizer [13] - improve the performance of queries or jobs in pipeline at
Microsoft.
● LLM: [17] and [18], respectively describe LLM as aim to combine human expertise
with LLM-driven automation and to achieve a favorable cost-optimization balance in
data pipeline engineering.
● Benchmarking
● keep the pipeline cost-effective, and manage the resources, such as storage,
compute power, and network bandwidth,
Conclusion and Future Work
● Design and implementation of a complex data pipeline related
to air traffic
● Review of 5 Conversational AI assistants
● Work perspectives
○ How to train an LLM to address complex data pipelines, considering
broad domain applications and computation and storage optimizations,
○ Use the inferred data for analytical purposes, and benchmarking
OLAP/ML models
■ Analysis of aircrafts’ trajectories,
■ Fuel savings and CO2 emissions’ reduction,
21
References
[1] D. Abadi, A. Ailamaki, D. G. Andersen, P. Bailis, M. Balazinska, P. A. Bernstein, P. A. Boncz, S. Chaudhuri, A. Cheung, A. Doan, L.
Dong, M. J. Franklin, J. Freire, A. Y. Halevy, J. M. Hellerstein, S. Idreos, D. Kossmann, T. Kraska, S. Krishnamurthy, V. Markl, S.
Melnik, T. Milo, C. Mohan, T. Neumann, B. C. Ooi, F. Ozcan, J. M. Patel, A. Pavlo, R. A. Popa, R. Ramakrishnan, C. Ré, M.
Stonebraker, and D. Suciu, “The seattle report on database research,” Commun. ACM, vol. 65, no. 8, pp. 72–79, 2022. ↬
[2] Mattias Schaffer and Vincent Lenders and Ivan Martinovis, “OpenSky Network: Open Air Traffic Data for Research,”
https://opensky-network.org/, online; accessed 10 August 2024..
[3] C. Nielsen, Z. Su, and G. Indiveri, “Yak: An asynchronous bundled data pipeline description language,” in 28th IEEE International
Symposium on Asynchronous Circuits and Systems, ASYNC 2023, Beijing, China, July 16-19, 2023. IEEE, 2023, pp. 34–41.
[4] H. Foidl, V. Golendukhina, R. Ramler, and M. Felderer, “Data pipeline quality: Influencing factors, root causes of data-related
issues, and processing problem areas for developers,” J. Syst. Softw., vol. 207, p. 111855, 2024.
[5] F. J. de Haro-Olmo, Á. Valencia-Parra, Á. J. VarelaVaca, J. A. Álvarez-Bermejo, and M. T. Gómez-López, “ELI: an iot-aware big
data pipeline with data curation and data quality,” PeerJ Comput. Sci., vol. 9, p. e1605, 2023. [Online]. Available:
https://doi.org/10.7717/peerj-cs.1605
[6] P. Maymounkov, “Koji: Automating pipelines with mixed-semantics data sources,” CoRR, vol. abs/1901.01908, 2019. [Online].
Available: http://arxiv.org/abs/1901.01908
22
References
[7] S. Redyuk, Z. Kaoudi, S. Schelter, and V. Markl, “DORIAN in action: Assisted design of data science pipelines,” Proc. VLDB
Endow., vol. 15, no. 12, pp.3714–3717, 2022.
[8] G. Vargas-Solar, K. Belhajjame, J. Espinosa-Oviedo, S. Negrete-Yankelevich, and J. Zechinelli-Martini, “MATILDA: inclusive
data science pipelines design through computational creativity,” in Proceedings of the Workshops of the EDBT/ICDT Joint
Conference, vol. 3651, 2024. [Online]. Available: https://ceur-ws.org/Vol-3651/DARLI-AP-11.pdf
[9] Z. Liu, T. Hoang, J. Zhang, M. Zhu, T. Lan, S. Kokane, J. Tan, W. Yao, Z. Liu, Y. Feng, R. Murthy, L. Yang, S. Savarese, J. C. Niebles,
H. Wang, S. Heinecke, and C. Xiong, “Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets,”
CoRR, vol. abs/2406.18518, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.18518
[10] J. Giovanelli, B. Bilalli, and A. Abelló, “Data preprocessing pipeline generation for autoetl,” Inf. Syst., vol. 108, p. 101957,
2022.
[11] J. Yang, Y. He, and S. Chaudhuri, “Autopipeline: Synthesize data pipelines by-target using reinforcement learning and
search,” CoRR, vol. abs/2106.13861, 2021. [Online]. Available: https://arxiv.org/abs/2106.13861
[12] Z. Miao, “Simplifying human-in-the-loop data science pipeline: Explanations, debugging, and data preparation,” Ph.D.
dissertation, Duke University, Durham, NC, USA, 2022. [Online]. Available: https://hdl.handle.net/10161/26796
[13] S. Gakhar, J. Cahoon, W. Le, X. Li, K. Ravichandran, H. Patel, M. T. Friedman, B. Haynes, S. Qiao, A. Jindal, and J. Leeka,
“Pipemizer: An optimizer for analytics data pipelines,” Proc. VLDB Endow., vol. 15, no. 12, pp. 3710–3713, 2022.
23
References
[14] M. Dareck, C. Edelstenne, T. Enders, E. Fernandez, J.-P. Herteman, M. Kerkloh, I. King, P. Ky, M. Mathieu, G. Orsi, G.
Schotman, C. Smith, and J.-D. Worner, “FlightPath 2050: Europe’s Vision for Aviation -Maintaining Global Leadership and Serving
Society’s Needs,” http://www.sesarju.eu/ , 2010, online; accessed 10 August 2024.
[15] European Union and EuroControl and SESAR, “The DART Project: Data-Driven Aircraft Trajectory Prediction Research,”
http://dart-research.eu/ , online; accessed 10 August 2024.
[16] US NextGen, “Modernization of United States Airspace,” https://www.faa.gov/nextgen/ , 2019, online; accessed 10 August
2024.
[17] A. Remadi, K. E. Hage, Y. Hobeika, and F. Bugiotti, “To prompt or not to prompt: Navigating the use of large language models for
integrating and modeling heterogeneous data,” Data Knowl. Eng., vol. 152, p. 102313, 2024.
[18] S. Arora, B. Yang, S. Eyuboglu, A. Narayan, A. Hojel, I. Trummer, and C. Ré, “Language models enable simple systems for
generating structured views of heterogeneous data lakes,” Proc. VLDB Endow., vol. 17, no. 2, pp. 92–105, 2023.
24
Thank you for your Attention
Q&A
Dr. Rim Moussa, Eng. School of Carthage, University of Carthage
Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage
The 11th International Symposium on Networks, Computers and Communications @ Washington D.C., USA
22 - 25 October 2024

data pipelines complexity human expertise and LLM era

  • 1.
    Balancing the Complexityof Data Pipeline Engineering: A Technological Landscape Where Human Expertise Meets Large Language Models Dr. Rim Moussa, Eng. School of Carthage, University of Carthage Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage The 11th International Symposium on Networks, Computers and Communications (ISNCC'24) @ Washington D.C., USA 22 - 25 October 2024
  • 2.
    Outline Context and Motivations Objectivesand Solution Data Pipeline Engineering: Use Case Infer Aircraft Flights in Crowd sourced networks Review of 5 AI assistants Related Work Conclusion and Future Work 1 2 3 4 5 2 6
  • 3.
    3 Data to Insightspipelines? ● “Data pipelines are sets of processes that move and transform data from various sources to a destination (data warehouse, data lake, data lakehouse), where new value can be derived.” James Densmore, 2021 ● Data pipelines consist of several tasks or actions that need to be executed to achieve a desired result ● A data pipeline is represented using Directed Acyclic Graph (DAG) ● NOT EXTRACT data & LOAD them INTO a data store: Raw data is refined to clean, structure, normalize, combine, aggregate, and sometimes anonymize. ● Companies are becoming more data driven ● Paradigm shift in implementing data pipelines ○ code-to-data with big data frameworks
  • 4.
    4 Data Warehouses (80’s),Data lakes (2000), Lakehouses(2020) “Catch-all” repositories Armbrust et al., CIDR’2021
  • 5.
  • 6.
    6 Data Prep istime-consuming! Source: CrowdFlower, 2015 Data to insights pipeline. Data science pipelines are often complex with several stages, each with many participants. One team prepares the data, sourced from heterogeneous data sources in data lakes. Another team builds models on the data. Finally, end users access the data and models through interactive dashboards. The database community needs to develop simple and efficient tools that support building and maintaining data pipelines. Data scientists repeatedly say that data cleaning, integration, and transformation together consume 80%-90% of their time. These are problems the database community has experienced in the context of enterprise data for decades. However, much of our past efforts focused on solving algorithmic challenges for important “point problems,” such as schema mapping and entity resolution. Moving forward, we must adapt our community’s expertise in data cleaning, integration, and transformation to aid the iterative end-to-end development of the data-to-insights pipeline. Source: The Seattle Database Report, 2022 [1]
  • 7.
    Objectives ● showcase acomplex data pipeline ■ optimized with human expertise ■ implemented using big data frameworks ● Review of 5 Conversational AI assistants 7
  • 8.
    USE CASE :INFERRING AIRCRAFTS ’ TRIPS IN CROWDSOURCED NETWORKS 8 ● Multiple groups promote the transformation of aviation into cleaner, safer, more efficient and predictable system, such as ○ The OpenSky Network ○ The High Level Group on Aviation Research Europe Commission: European Aviation Vision 2050 ○ The Next Generation Air Transportation in USA ● The OpenSky Network ○ over 6,000 sensors ○ open aircrafts’ logs
  • 9.
    Data Sources: flights’logs 9 Positional data latitude longitude geoaltitude baroaltitude Speed data velocity vertical-rate Dynamics’ data heading Operational data alert spi-rate Time data osn_ts last-contact … Aircraft data icao24
  • 10.
    Data Sources: airports’data 10 Positional data lat_decimal lon_decimal altitude … city country Airport data id name iata_code icao_code
  • 11.
    Data pipeline 11 ● Datasources (big data): ● multiple sources (e.g., OpenSky Network, Airports dataset ) ● data at rest (airports) and data in motion (flight logs) ● Data cleansing ● valid positional data, speed data,... ● Correlate datasets using complex operations ● build spatial indexes on batches ● combine with spatial join ● combine with outer join ● prune with filtering ● Inferred flight data ● no need for further processing ● require merge ○ either with previously inferred flight data from previous batches ○ or with previously inferred flight data from previous batches
  • 12.
    Flight 12 ● aircraft identifier ●departure airport ● destination airport ● departure timestamp ● arrival timestamp ● trajectory data ● speed data ● operational data ● dynamics data
  • 13.
    Conversational AI assistants ●ChatGPT (OpenAI) url ● Llama-3 (Meta) url ● QWEN2 (Alibaba) url ● Gemma2 (Google) url ● Mistral-Nemo (Nvidia) url 13
  • 14.
    Prompt #1 14 Attached twocsv files in google drive link [...], in dropbox link [...] The first file "airports.csv" is a dataset of airports. Each airport is identified by an id, is located in a country, and each airport is located in a 3d reference system given its decimal longitude ('lon decimal' column), decimal latitude ('lat_decimal' column), and its altitude ('altitude' column). The second file "logs.csv" is an extract of logs captured by the open sky network during one day. Each entry denotes the position of an aircraft, identified by the column 'icao24', in a 3d reference system given latitude ('lat' column), longitude ('lon' column) and 'geoaltitude' column. We want to infer the flight(s) details performed by each aircraft, determine the departure airport (takeoff event), the arrival airport (landing event), the first recorded timestamp, the last recorded timestamp, the duration calculated as last recorded timestamp minus first recorded timestamp. Notice that for some flights, the departure airport and/or the arrival airport are unknown, consequently we could only extract a part of the trajectory. There are four types of inferred flights: _type 0: a flight such that the departure airport is unknown, and the arrival airport is known _type 1: a flight such that both departure and arrival airports are known _type 2: a flight such that the departure airport is known, and the arrival airport is unknown _type 3: a flight such that both departure and arrival airports are known. -1 denotes an unknown airport either for departure or arrival. Could you propose a solution using [....], incorporating the inferred flights derived from the shared datasets?
  • 15.
    Prompt #2 15 Attached isa PSV file containing the flight trajectories of an aircraft. Each trajectory is represented in WKT format. Could you visualize these trajectories in a 3D reference system and on a map using Folium, and then share the resulting plots online?
  • 16.
    Review of ourinteractions 1 Communication and Data Access ● support access to public cloud storage services like Google Drive, Dropbox, and GitHub to upload files, ● accept voice prompts, text prompts 2 Clarifications ● ask for clarifications before providing a response, ● or automatically generate a response based on their own assumptions.
  • 17.
    Review of ourinteractions 3 Results ● the code snippets may be presented in stages or as a single script, with or without explanations ● some prompts can generate and run the code on their cloud resources, providing output plots or other results. ● If no results are delivered, the engine may explain the need for further refinement :( ● store previous prompts and answers ○ e.g. ChatGPT: today, yesterday, previous 7 days, previous 30 days, September, August, …January, 2023, ..
  • 18.
    Review of ourinteractions
  • 19.
    Review of ourinteractions 4 Feedbacks ● propose multiple solutions, and ask the user to test the solutions, and select the most appropriate one, ● ask to rate a given solution. 5 Recommendations ● refine the result code for more accurate and robust flight inference, optimize performance, ● consider using a more advanced LLM release, ● or caution the user against using the code as-is. Optimizations ● The prompts generally do not implement or recommend optimizations such as: ○ Indexing geospatial data before joining datasets; ○ Filtering log entries based on predicates (e.g., aircraft altitude is close to airport altitude) better than a cross product with a all airports; ○ Handling cases of multiple flights performed by the aircraft on the same day. 6
  • 20.
    Related Work 20 ● DescriptionLanguages [3] ● Data Quality [4]–[5] ● Frameworks Apache Airflow, Dagster,... ● Implementation technologies ● Apache Hadoop -Pig Latin, Apache Spark, Nvidia RAPIDS NVTabular, … ● AutoETL [10], generate pre-processing pipelines. ● Auto-Pipeline [11], synthesize pipelines using deep reinforcement-learning. ● Pipemizer [13] - improve the performance of queries or jobs in pipeline at Microsoft. ● LLM: [17] and [18], respectively describe LLM as aim to combine human expertise with LLM-driven automation and to achieve a favorable cost-optimization balance in data pipeline engineering. ● Benchmarking ● keep the pipeline cost-effective, and manage the resources, such as storage, compute power, and network bandwidth,
  • 21.
    Conclusion and FutureWork ● Design and implementation of a complex data pipeline related to air traffic ● Review of 5 Conversational AI assistants ● Work perspectives ○ How to train an LLM to address complex data pipelines, considering broad domain applications and computation and storage optimizations, ○ Use the inferred data for analytical purposes, and benchmarking OLAP/ML models ■ Analysis of aircrafts’ trajectories, ■ Fuel savings and CO2 emissions’ reduction, 21
  • 22.
    References [1] D. Abadi,A. Ailamaki, D. G. Andersen, P. Bailis, M. Balazinska, P. A. Bernstein, P. A. Boncz, S. Chaudhuri, A. Cheung, A. Doan, L. Dong, M. J. Franklin, J. Freire, A. Y. Halevy, J. M. Hellerstein, S. Idreos, D. Kossmann, T. Kraska, S. Krishnamurthy, V. Markl, S. Melnik, T. Milo, C. Mohan, T. Neumann, B. C. Ooi, F. Ozcan, J. M. Patel, A. Pavlo, R. A. Popa, R. Ramakrishnan, C. Ré, M. Stonebraker, and D. Suciu, “The seattle report on database research,” Commun. ACM, vol. 65, no. 8, pp. 72–79, 2022. ↬ [2] Mattias Schaffer and Vincent Lenders and Ivan Martinovis, “OpenSky Network: Open Air Traffic Data for Research,” https://opensky-network.org/, online; accessed 10 August 2024.. [3] C. Nielsen, Z. Su, and G. Indiveri, “Yak: An asynchronous bundled data pipeline description language,” in 28th IEEE International Symposium on Asynchronous Circuits and Systems, ASYNC 2023, Beijing, China, July 16-19, 2023. IEEE, 2023, pp. 34–41. [4] H. Foidl, V. Golendukhina, R. Ramler, and M. Felderer, “Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers,” J. Syst. Softw., vol. 207, p. 111855, 2024. [5] F. J. de Haro-Olmo, Á. Valencia-Parra, Á. J. VarelaVaca, J. A. Álvarez-Bermejo, and M. T. Gómez-López, “ELI: an iot-aware big data pipeline with data curation and data quality,” PeerJ Comput. Sci., vol. 9, p. e1605, 2023. [Online]. Available: https://doi.org/10.7717/peerj-cs.1605 [6] P. Maymounkov, “Koji: Automating pipelines with mixed-semantics data sources,” CoRR, vol. abs/1901.01908, 2019. [Online]. Available: http://arxiv.org/abs/1901.01908 22
  • 23.
    References [7] S. Redyuk,Z. Kaoudi, S. Schelter, and V. Markl, “DORIAN in action: Assisted design of data science pipelines,” Proc. VLDB Endow., vol. 15, no. 12, pp.3714–3717, 2022. [8] G. Vargas-Solar, K. Belhajjame, J. Espinosa-Oviedo, S. Negrete-Yankelevich, and J. Zechinelli-Martini, “MATILDA: inclusive data science pipelines design through computational creativity,” in Proceedings of the Workshops of the EDBT/ICDT Joint Conference, vol. 3651, 2024. [Online]. Available: https://ceur-ws.org/Vol-3651/DARLI-AP-11.pdf [9] Z. Liu, T. Hoang, J. Zhang, M. Zhu, T. Lan, S. Kokane, J. Tan, W. Yao, Z. Liu, Y. Feng, R. Murthy, L. Yang, S. Savarese, J. C. Niebles, H. Wang, S. Heinecke, and C. Xiong, “Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets,” CoRR, vol. abs/2406.18518, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.18518 [10] J. Giovanelli, B. Bilalli, and A. Abelló, “Data preprocessing pipeline generation for autoetl,” Inf. Syst., vol. 108, p. 101957, 2022. [11] J. Yang, Y. He, and S. Chaudhuri, “Autopipeline: Synthesize data pipelines by-target using reinforcement learning and search,” CoRR, vol. abs/2106.13861, 2021. [Online]. Available: https://arxiv.org/abs/2106.13861 [12] Z. Miao, “Simplifying human-in-the-loop data science pipeline: Explanations, debugging, and data preparation,” Ph.D. dissertation, Duke University, Durham, NC, USA, 2022. [Online]. Available: https://hdl.handle.net/10161/26796 [13] S. Gakhar, J. Cahoon, W. Le, X. Li, K. Ravichandran, H. Patel, M. T. Friedman, B. Haynes, S. Qiao, A. Jindal, and J. Leeka, “Pipemizer: An optimizer for analytics data pipelines,” Proc. VLDB Endow., vol. 15, no. 12, pp. 3710–3713, 2022. 23
  • 24.
    References [14] M. Dareck,C. Edelstenne, T. Enders, E. Fernandez, J.-P. Herteman, M. Kerkloh, I. King, P. Ky, M. Mathieu, G. Orsi, G. Schotman, C. Smith, and J.-D. Worner, “FlightPath 2050: Europe’s Vision for Aviation -Maintaining Global Leadership and Serving Society’s Needs,” http://www.sesarju.eu/ , 2010, online; accessed 10 August 2024. [15] European Union and EuroControl and SESAR, “The DART Project: Data-Driven Aircraft Trajectory Prediction Research,” http://dart-research.eu/ , online; accessed 10 August 2024. [16] US NextGen, “Modernization of United States Airspace,” https://www.faa.gov/nextgen/ , 2019, online; accessed 10 August 2024. [17] A. Remadi, K. E. Hage, Y. Hobeika, and F. Bugiotti, “To prompt or not to prompt: Navigating the use of large language models for integrating and modeling heterogeneous data,” Data Knowl. Eng., vol. 152, p. 102313, 2024. [18] S. Arora, B. Yang, S. Eyuboglu, A. Narayan, A. Hojel, I. Trummer, and C. Ré, “Language models enable simple systems for generating structured views of heterogeneous data lakes,” Proc. VLDB Endow., vol. 17, no. 2, pp. 92–105, 2023. 24
  • 25.
    Thank you foryour Attention Q&A Dr. Rim Moussa, Eng. School of Carthage, University of Carthage Pr. Tarek Bejaoui, Faculty of Sciences Bizerta, University of Carthage The 11th International Symposium on Networks, Computers and Communications @ Washington D.C., USA 22 - 25 October 2024