- Published on
Data Orchestration
- Authors
- Name
- Arunabh Bora
- @arunabh223
In traditional data warehouses, data is loaded in batches. This is commonly done during off hours when the normal transactional workloads are at a low level. Data engineers have traditionally relied on cron jobs and shell scripts to schedule these batch loads. However, with increasing complexity of data lakes, streaming datasets, and other data sources, managing batch load schedules has become challenging. This is where modern concepts of data orchestration come into play.
What Is Data Orchestration?
Data orchestration refers to the automated process of managing data pipelines, workflows, and tasks across multiple systems and applications. It involves scheduling, monitoring, and optimizing the flow of data from sources to destinations with minimal human intervention. The goal is to create a consistent and reliable data pipeline that can handle large volumes of data, diverse formats, and complex dependencies. Data orchestration also helps in identifying bottlenecks and reducing delays.
Incremental data loading
Incremental data loading make our data more modular and manageable. Fact tables can only be appended and dimensions only need to scan the newest transactions instead of the entire fact table. With the incremental approach, you switch from batch to event-driven. Your updates and inserts are independent, and you get autonomous batches. If you succeed in switching, you get a near real-time analytics solution which you can scale and parallelise those batches.
Schema evolution
Schema evolution is a process that involves changing the data schema of an existing database or data warehouse without affecting the existing data. This means adding new columns, tables, or indexes, or removing old ones.
Optimistic concurrency
Optimistic concurrency ensures that multiple users can modify a piece of data simultaneously without causing conflicts.
Data catalogs
Centralised stores where all your metadata data about your data lies. This is important because we want to keep an overview and the ability to search for our data.