Fabric brings you many different technologies and concepts to mix and match so that you can find the most optimised solution for your data requirements. One of those concepts is the Lakehouse, and we'll be looking into how we can build a proper one in Fabric using the Medallion Architecture.
As we've covered "What is Lakehouse?" in a previous article and the Medallion Architecture in the "Designing Fabric Workspaces" post, I won't go into too much detail here. But to recap, Medallion Architecture stands on the promise of splitting your data into multiple layers with different responsibilities. If interested in reading further, you can read Databricks' Medallion Architecture article.
This article will demonstrate implementing the best practice for Medallion Architecture in Fabric, and although we'll walk through the steps, we won't build the actual pipelines here. Instead, I'll show you how everything will be connected to the Lakehouse in four steps:
- Step 1: Designing the Lake
- Step 2: Establishing the Tables
- Step 3: Building the Pipelines
- Step 4: Putting it all together
We will cover the overnight data-pulling scenario with Pipelines and Dataflows in this article, but future articles will be on Streaming datasets and connecting to other Azure data resources.
Step 1: Designing The Lake
Before going into what we're going to use to process the data, let's define our zones/layers in our OneLake:
- Landing: A layer for incoming data to arrive, ready to be picked up by our Lakehouse ingestion process. Data is kept in the original file format, with a folder structure reflecting arrival metadata. The data is kept here temporarily and deleted after the Lakehouse ingests it.
- Bronze/Raw: A layer for incoming data to be kept and archived for access. You keep the data as it comes to only store it in Delta format and in a hierarchy to access them easily (most commonly, date of arrival and data type)
- Silver/Trusted: Raw data is translated into a more standardised format. You can split a single raw file into multiple files/tables to create a normalised relationship, or you can put together numerous raw files into a single table.
- Gold/Curated: For business-level aggregations and analytics.