Deploy Microsoft Fabric Notebooks with PySpark & Fabric API

Deploying notebooks within Microsoft Fabric environments requires careful planning and adherence to established architectural principles. One such approach, the Medallion Architecture, has proven to be a robust framework for organizing data and workflows. It emphasizes clear separation of data into distinct layers—bronze, silver, and gold—each serving a specific purpose in the data processing pipeline. Applying this architecture to the deployment of notebooks ensures consistency, scalability, and efficient collaboration across environments.

While Microsoft Fabric offers a powerful and integrated platform for data analytics and processing, certain functionalities remain limited or in preview, requiring organizations to adapt their workflows or seek alternative solutions for specific use cases. Therefore, it is necessary to create a flexible solution that enables dynamic connections to specific resources (workspaces, lakehouses, or data warehouses) without requiring manual modifications to the notebook’s content.

Understanding the Medallion Architecture in the Context of Microsoft Fabric

The Medallion Architecture is a tiered structure that helps manage data at different levels of refinement.

  1. Bronze Layer: Raw data is ingested and stored with minimal transformation. This layer preserves data fidelity and acts as a staging area for further processing.
  2. Silver Layer: Data in this layer is cleansed, deduplicated, and enriched, making it ready for analytical processing.
  3. Gold Layer: Fully refined, aggregated data resides here, serving business intelligence tools and decision-making processes.

When implementing this structure in Microsoft Fabric, notebooks play a pivotal role in orchestrating data flows and applying transformations. However, their deployment across environments—such as development, testing, and production—requires meticulous planning to maintain architectural integrity and minimize risks.

Key Considerations for Notebook Deployment

Designing a strategy for deploying notebooks involves several critical factors:

  1. Environment Segregation: Each environment should be clearly delineated to ensure that development iterations do not disrupt production operations.
  2. Version Control: Leveraging tools like Git integration in Microsoft Fabric helps maintain an audit trail of changes, facilitating collaboration and rollback capabilities.
  3. Parameterization: Configuring notebooks to accept parameters ensures adaptability to different environments without requiring hardcoded changes.
  4. Automated Deployment Pipelines: Utilizing deployment pipelines allows seamless transitions of notebooks from development to production, minimizing manual intervention.

Assumptions and Current Limitations

  • 3 environments infrastructure: DEV / TEST / PROD
  • for each environment a complete set of data is being prepared, maintained and loaded, allowing for continuous development & deployment methodology
  • deployment to TEST & PROD using built-in deployment pipelines concept
  • each workspace contains several data stores (files, lakehouses, data warehouses) and notebooks digesting data from various sources
  • development starting on individual workspaces of developers; DEV env code resources linked with Git repositories & folders using native Fabric integration – this article defines current Fabric objects available for Github integration
  • our task is to create a set of notebooks in dedicated WS_Analytics workspace, where we will perform some data analytics on various data coming from several sources
  • naming convention applied:

<resourcecode>_<sourcesystem>_<resourcename>_(<envcode>) – envcode for workspaces only

e.g.
WS SAP Payments [DEV] – a DEV workspace containing resources coming from SAP system about payments
LH_Bamboo_HRData – a lakehouse containing HR data from Bamboo system
DWH_Internal_RefData – a data warehouse containing internal referential data
NB_SAP_Payments – a notebook digesting & transforming SAP Payments data
PP_SAP_Payments – a data pipeline transforming SAP Payments data

Currently, when you create your Fabric notebook manually, you can define the linked lakehouse or data warehouse on the left menu. However, when you deploy this selection it is not switching between source environments.

To make the Notebook working on multiple environments and referencing the appropriate resources, we need to dynamize the links using PySpark, FabricRestClient and Data Frames. We will connect to certain datasets using abfss:// links – and our notebook remain unlinked to any resource explicitely.

1. Retrieve list of all workspaces using FabricRestClient API and read the name of current workspace (+ environment code)

Using below code we can load available workspaces to the data frame. Thanks to this we can save the code (in this case – it’s WS Analytics [DEV]) and use for further reverence. 

Based on assumption of common naming convention strategy, when we execute this slice of code we receive DEV value, and the dataframe containing all existing workspaces.

Fabric Guru Blog – more about accesses to workspaces: https://fabric.guru/using-fabricrestclient-to-make-fabric-rest-api-calls
import sempy.fabric as fabric

# Get Current Workspace name
workspace_name = fabric.resolve_workspace_name()

# Find the position of the opening bracket defining environment name and move one step forward
start = workspace_name.find("[") + 1
# Find the position of the closing bracket
end = workspace_name.find("]")
# Extract the env_code; equals workspace_name when [ or ] not found
env_code = workspace_name[start:end]

# for personal developers workspaces - treat as DEV
if env_code == workspace_name[:-1]:
env_code = 'DEV'

# Parse workspaces as json
workspace_raw = fabric.FabricRestClient().get(f"/v1/workspaces").json()

# Extract a list of all environments and save to workspaces_df dataframe
workspaces = workspace_raw.get("value", [])
workspaces_df = spark.createDataFrame(workspaces)

# Show content of dataframe
workspaces_df.show()

2. Find specified workspace, lakehouse & table and build the abfss string to the connection

Using below snippet we can look for workspace with specified environment code, find the object type (in this case – it’s a Lakehouse) and build the unique url to a table in this lakehouse with defined name. We can make as many cells as many tables we want to digest in our notebook. 

In this case we are again calling the FabricRestClient API, this time in context of specified workspace ID. In result we get a list of all objects existing in specified workspace. Our output is the src_table_url, used in next cells and dependent on the environment we run this snippet.

In the same way we can define the URL to target table in a lakehouse.

workspace_label = "WS SAP Payments"
item_label = "LH_SAP_Payments"
item_table_label = "tb_payments"

workspace_name = workspace_label + " [" + env_code + "]"

# Find workspace
workspaces_df = workspaces_df.filter(workspaces_df["displayName"] == workspace_name)

# Collect the 'id' column
workspace_id_row = workspaces_df.select("id").collect()

# Extract the pure id value
workspace_id = workspace_id_row[0]["id"]

# Build API call content
workspace_items_path = "v1/workspaces/"+workspace_id+"/items"

# Extract all items from a workspace
workspace_items_raw = fabric.FabricRestClient().get(workspace_items_path).json()
workspace_items = workspace_items_raw.get("value", [])
workspace_items_df = spark.createDataFrame(workspace_items)

# Find workspace
items_df = workspace_items_df.filter((workspace_items_df["displayName"] == item_label) & (workspace_items_df["type"] == 'Lakehouse'))

# Collect the 'id' column
item_id_row = items_df.select("id").collect()

# Extract the pure id value
item_id = item_id_row[0]["id"]

# Build url
SRC_table_strings = ["abfss://", workspace_id, "@onelake.dfs.fabric.microsoft.com/", item_id, "/Tables/", item_table_label]
SRC_table_url = "".join(SRC_table_strings)

if not SRC_table_url:
raise ValueError(f"No configuration found for workspace: {workspace_label} - {item_label} - {item_table_label}")

3. Load data to a data frame using PySpark and provided URL, transform and save to destination lakehouse

When we have all URLs saved we can start digesting our data. It will be realised in the next PySpark cell using standard strategy of data frames concept from Databricks. We can use both standard PySpark dataframes types and the Pandas.

Link to Databricks documentation: https://docs.databricks.com/en/getting-started/dataframes.html

 

################# STAGE 1 #################
# Load data
###########################################

# Load data from source URL to df_staging pandas dataframe
df_staging = spark.read.format("delta").load(SRC_table_url).toPandas()

# Check type
# type(df_staging)


# ... numerous transformations ...

################# STAGE x #################
# Save dataframe to Lakehouse
###########################################

# Create spark dataframe from pandas staging dataframe
df_spark = spark.createDataFrame(df_staging)

# Write dataframe to LH_GOLD_InsightDWH lakehouse
delta_path = DST_finaltable_strings_url
df_spark.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(delta_path)

Fields for improvement

  • multiple tables from the same source (LH/DWH/BLOB storage) loaded at once (by defining only their names)
  • moving the parametrization part to a separate snippet/notebook
  • performance optimisations for large datasets (billions of rows and hundreds of columns)

Dodaj komentarz