Comparing Lakehouses Using Microsoft Fabric Notebook

In the ever-evolving world of data management, lakehouses have emerged as a hybrid solution combining the best features of data lakes and data warehouses. OneLake, a robust platform for managing lakehouses, offers powerful tools for data comparison and analysis. This article will explore how to use a Fabric notebook to compare two lakehouses in OneLake, focusing on key aspects such as data structure and statistics, specifically for Delta tables.

Discover how to effectively compare two lakehouses in OneLake using a powerful Fabric notebook. This guide walks you through the essential steps and code snippets to analyze and contrast lakehouse data structures and statistics, ensuring you make informed decisions for your data management needs.

Download notebook from my Github.

Setting Up the Environment

Before diving into the comparison, ensure you have the necessary libraries and configurations set up in your Fabric notebook. The primary libraries used in this notebook include pyspark, delta.tables, and sempy.fabric.

Key Parameters & Functionalities

The notebook includes several key parameters that steer its functionalities. Understanding these parameters is crucial for effectively using the notebook to compare lakehouses.

workspace_name_1 and workspace_name_2: These parameters specify the names of the two workspaces containing the lakehouses you want to compare. Ensure you provide the correct names to retrieve the relevant data.
lakehouse_name_1 and lakehouse_name_2: These parameters specify the names of the two lakehouses within the specified workspaces. Accurate lakehouse names are essential for accessing the correct Delta tables.
config_param_comparison_mode: This parameter determines the mode of comparison. It can be set to:
- rowscols: Compares row and column counts.
- structure: Compares table structures.
- full: Compares both row/column counts and table structures.
config_param_shortcuts: This parameter controls whether shortcuts are included in the comparison. It can be set to:
- included: Includes shortcuts in the comparison.
- excluded: Excludes shortcuts from the comparison.
- only: Compares only shortcuts.
config_param_results_match: This parameter specifies whether to return all records or only those that do not match. It can be set to:
- all: Returns all records.
- nomatchonly: Returns only records where there is no match in row counts, column counts, or data types.
config_param_item_filter: This parameter allows you to filter the tables or shortcuts to be scanned. You can provide specific table names, multiple names separated by commas, or use regular expressions.
config_param_progress_log: This parameter controls the display of the progress log. It can be set to:
- enabled: Displays the progress log.
- disabled: Hides the progress log.

Key Functions

The notebook includes several key functions that facilitate the comparison process:

select_array: This function returns arrays based on the specified parameter (included, only, excluded). It helps in determining which tables or shortcuts to include in the comparison.
filter_lakehouse_tables: This function filters lakehouse items based on provided patterns. It allows you to specify which tables to include in the comparison using SQL-like patterns.
get_lh_table_data_stats: This function retrieves and processes data statistics for the specified lakehouse. It extracts information such as the number of rows and columns in each table, as well as the structure of the tables.

Comparing Lakehouses

The notebook calls the get_lh_table_data_stats function twice to gather data for both lakehouses. It then formats the results for comparison, focusing on key metrics such as row counts, column counts, and data types.

The results are processed to create a comprehensive comparison. This involves joining the data from both lakehouses and creating match columns to indicate whether the row counts, column counts, and data types match between the two lakehouses.

Displaying Results

Finally, the results are displayed based on the comparison mode specified. The notebook supports different comparison modes, including rowscols (comparing row and column counts), structure (comparing table structures), and full (comparing both row/column counts and table structures).

Fields for improvement/changes

displaying more than 10’000 rows is not possible by using display; instead of this you can export it to a more robust form
comparing content of tables is not handled by this script
it is possible to parallelize loading for lakehouses containing thousands of tables