Articles in this design series:
- 1 - Introduction
- 2 - High-Level Design
- 3 - Trusted Data Products
- 4 - Gold Data Products
- 5 - Benefits & Drawbacks
An automotive company's ecosystem is much larger than our defined universe, but the principle is the same on a bigger scale. We're trying to make the data available on our platform, and we have a few methods to access and retrieve the data:
- Scheduled: Includes scheduled data transfers like overnight, daily, weekly, etc., which can happen as a pull or push method.
- Event-Based: Executing data ingestion when the data is available and an event is received.
- Streaming: Includes near-realtime or realtime scenarios with large amounts of data constantly received from multiple sources.
In our Trusted Data Products, we will package the data coming from source systems into separate Fabric Workspaces, including the Bronze and Silver layers. This will allow us to reuse both datasets in multiple Gold product scenarios and ensure the data is protected with proper access controls. Fabric's item-level security currently doesn't cover every scenario, so whilst creating these sources as Lakehouses in a single Workspace is possible, it would get crowded very quickly with Pipelines and Dataflows, and the access management would be a living hell. Until there's a better approach, splitting into multiple Workspaces is better.
When ingesting data, we will focus on getting the data into the Landing area and then consume that data to the Bronze area tables for safekeeping. The landing layer is transient, and the data isn't kept there long. Bronze tables have the same structure as the data files received, but the format is kept as Delta in the background.
After the data is stored in the Bronze layer, we'll validate it, transform it, and store it in the Silver layer. The purpose is to keep the data modelled and normalised, allowing proper relational storage with easy discovery.
Let's begin by going through each of these data products.