A common question about Azure Synapse Analytics is how to handle delta partitioned tables. Delta (or Delta Lake) tables are one of the technologies underlying Lakehouse architecture (see also Iceberg & Hudi). It is a storage technology that enables the separation of storage and computing and provides transactional consistency. I'm being very simplistic in my definition of the delta format here for brevity,You can read more here.
We will see how Synapse handles partitioned delta tables in Lake databases and Serverless SQL Pools databases
To learn a little more about what this blog post is about, let's take a look at how Synapse handles delta partitioned tables when using both.databases for lakesmiPools SQL Sem-Servidordatabases Well... a word of warning.
Warning for delta tables in serverless SQL pools
I get questions about delta tables in Serverless SQL Pools databases quite often and unfortunately they are only partially supported (hence the confusion). Although you can create delta tables in serverless SQL pools on partitioned and non-partitioned folders, only non-partitioned folders are supported. But as we'll see in this blog, you can create a delta table over a partitioned folder... and get unwanted results.documentation here.
I show that when creating tables with external delta partitions in Spark, the table definition is synchronized with serverless SQL pools and partition deletion in queries with serverless SQL pools works fine. If you create an external delta partitioned table directly in a serverless SQL Pools database, partitioning will not work, although the data can be selected, which is dangerous as all data will be scanned and added to your Azure account.
We have 2 types of SQL databases that can be used to frame and work with data in Data Lake, Lake Databases and Serverless SQL Pools Databases. I'll be honest, I'm not a fan of multiple database types...why can't it just be a single database type? Well, it has to do with the technology underlying each type of database.
- Serverless SQL group database: Uses the Polaris engine developed by Microsoft, the language is SQL
- lake database– Uses the Spark engine with language support for Scala, Python, SQL, etc.
I won't go into too much detail here regarding the metadata sync feature in Synapse (more here), but it is important to briefly address it here. That's aone way processBy doing so, external tables created in a Lake database are made available to the Serverless SQL Pools service. The result is that you can work with tables in a Lake database using Spark, for example. Data engineering processes of inserting/updating/deleting data in the data lake, changing schema information, etc., but actually using serverless SQL pools to query the data using the external tables. It's becoming a popular method because the pricing model for serverless SQL pools is based on the amount of data processed and has nothing to do with cluster size or uptime. For example, load serverless Power BI data models and only pay for data read (processed) without worrying about enabling or disabling Spark clusters.
Delta Partitioned Tables
Let's get into the problem here, we can work with delta tables in Lake databases and serverless SQL pool databases. However, it is currently missing some features, leading to confusion about what works and what doesn't. Although it is technically possible to create delta tables in serverless SQL pools, they don't work as expected. This is actually mentioned in the Microsoft documentation.
In the simple diagram below, we see a data processing engine targeting a specific folder in the data lake, which is partition deletion.
These are relevant scenarios for creating external tables on partitioned delta data.
|External partitioned delta table created in Lake database||You can check, including removing partitions|
|External partitioned delta table created in a serverless SQL pool database||It can be queried, but the partitioned columns are NULL so partitioning doesn't work|
|External partitioned delta table created on Lake Database and automatically synchronized with serverless SQL pools||Can query serverless SQL pools including partition deletion|
Why do we care about partition?
In short, we care about partitioning because it canImprove read and write performanceand in case of serverless SQL poolsreduce costs, since only the relevant partitions (folders) are scanned when recovering data. There are also costs associated with running Spark clusters. So if you can run smaller clusters for shorter periods of time, you're fine.
Also, partitioning is not always a good option as the benefits of splitting data into folders can result in a performance hit as the data processing engine has to search for folders and scan/read individual files. As always, plan ahead when it comes to partitions.
Step by Step
Let's review the partitioning scenarios above and do the following:
- Create an external Lake Database table on top of the partitioned delta data and queries
- Create an external table for serverless SQL pools over queries and delta partitioned data
- Create an external lake database table over partitioned delta data and queries with serverless SQL pools
Check partition pruning
I use Log Analytics to verify which folders in the Azure storage account are scanned when the queries are executed. there is oneBlog here in detailhow to configure that
Create an external table from Database Lake
We will create a delta table with the following schema:
- Benutzer Identification
- event type
- Product ID
- EventDataNoPart (data column not in partition scheme)
- EventYear(partition column)
- EventMonth (partition column)
- EventDate (Partitionsspalte)
The following code runs in a Synapse notebook connected to a Spark pool. I already loaded a dataframe calledof thesource data, we can run the following code, which saves the data frame to a data lake folder in delta format. We then run an SQL query to create a table on the delta data and finally run a SELECT statement on the delta table.
%%pysparkdftwo.filter("EventDate NO ES NULO").write.format("delta").partitionBy("EventYear", "EventMonth", "EventDate").save("abfss://firstname.lastname@example.org .core.windows.net/cleansed/webtelemetrydelta")
%%sqlCREAR BASE DE DATOS SparkDelta;CREAR TABLA SI NO EXISTE SparkDelta.webtelemetrydeltaUSING DELTALOCATION 'abfss://email@example.com/cleansed/webtelemetrydelta';
%%sqlSELECT EventType, COUNT(*) AS TotalEventCountFROM SparkDelta.webtelemetrydeltaWHERE EventDate = '2022-02-20'GROUP BY EventType
In the results (right image) we can see an aggregate total of the events for February 20, 2022. Now we can drill down into Log Analytics to see which folders in the data lake were scanned when the Spark group performed the query. I was running .
we just wait to see\EventYear=2022\EventMonth=02\EventDate=2022-02-20\Folder that will be scanned to get the results.
Check log analysis
We can now check Log Analytics to see which folders were scanned while the above query was running. The following Kusto Query Language (KQL) query selects all Azure storage account read events that target the root folder. This will display all the scanned partition folders. It's a simple query, I'm still finding my way through the KQL language.
StorageBlobLogs| where TimeGenerated contains between (datetime(2023-01-29 12:00:00) ..datetime(2023-01-30 21:00:00)) and ObjectKey "/dhstordatalakeuk/datalakehouseuk/cleansed/webtelemetrydelta" and OperationName == "ReadFile" | totalcount() by TimeGenerated,replace_string(ObjectKey,'/dhstordatalakeuk/datalakehouseuk/cleansed/webtelemetrydelta','')| sort by generated description time
The results show that only the/Event Date/2022-02-20The folder has been verified, so the partition deletion was successful.
Create an external table from serverless SQL pools
We create the outer table in the same delta location as in the previous scenario. The following code runs against serverless SQL pools and creates a new database, backs it up to allow connectivity to the data lake, and then creates the external table above the delta folder.
--create a new Serverless SQL Pools databaseCREATE DATABASE SQLDelta;--switch to the new databaseUSE SQLDelta;--create a schema for our objectsCREATE SCHEMA LDW dbo authorization;--encryption to allow authenticationCREATE MASTER KEY ENCRIPTATION BY PASSWORD = '<complex_password> ';--Create a credential with Managed IdentityCREATE DATABASE SCOPED CREDENTIAL DataLakeManagedIdentityWITH IDENTITY='Managed Identity'--Create a data source to use in queriesCREATE EXTERNAL DATA SOURCE ExternalDataSourceDataLakeUKMIWITH (LOCATION = 'https:// <datalake >.dfs.core.windows.net/datalakehouseuk', CREDENTIAL = DataLakeManagedIdentity);--Create file format DeltaCREATE EXTERNAL FILE FORMAT DeltaFormatWITH(FORMAT_TYPE = DELTA)--Create external tableCREATE EXTERNAL TABLE WebTelemetryDelta(UserID varchar(20), EventType varchar(100), ProductID varchar(100), URL Varchar(100), Device Varchar(50), SessionView Seconds int, EventDateNoPart data, EventYear int, Ev in int tMonth, EventDate data) WITH (LOCATION = 'cleansed/webtelemetrydelta', DATA_SOURCE = ExternalDataSourceDataLakeUKMI, FILE_FORMAT = DeltaFormat)GO
Although the above SQL is technically possible and we didn't get any errors executing it, when we try to query the external table using serverless SQL pools we get no results.
SELECCIONE EventType, COUNT(*) AS TotalEventCountFROM WebTelemetryDeltaWHERE EventDate = '2022-02-20'GROUP BY EventType
As you can see, we didn't get any results from the above SQL query.
If we do a SELECT to return all column values, we can see that all partition columns are returned asNULL. So while we can query the table and it actually returns the table values, it doesn't recognize/parse the partition columns...
What we also see in Log Analytics is that all folders below the delta root have been scanned, this could be potentially dangerous as the entire delta folder would be scanned and stressed.
Create an external table from the Lake database and query it using serverless SQL pools.
Now let's create another external table (we could use the existing table created in the first scenario, but keep it separate). First we load a new folder with the partitioned delta data, then we create an external table on top of the delta folder. Then we should be able to use this external table in serverless SQL pools. It usually takes a few seconds for metadata to be synced and viewed in serverless SQL pools, but sometimes I've waited a few minutes.
You can verify that an external table created in a Spark pool was successfully synced by looking at thesys.external_tablesthe system table in the Serverless SQL Pools database and verify that the table exists.
%%pysparkdftwo.filter("EventDate NO ES NULO").write.format("delta").partitionBy("EventYear", "EventMonth", "EventDate").save("abfss://firstname.lastname@example.org .core.windows.net/cleansed/webtelemetrydeltatwo")
%%sqlCRIAR TABELA SE NÃO EXISTIR SparkDelta.webtelemetrydeltatwoUSING DELTALOCATION 'abfss://email@example.com/cleansed/webtelemetrydeltatwo';
Now we run the following SQL for serverless SQL pools.
SELECCIONE EventType, COUNT(*) AS TotalEventCountFROM webtelemetrydeltatwoWHERE EventDate = '2022-02-20'GROUP BY EventType
Check log analysis
We can see that only the EventYear/EventMonth=2/EventDate=2022-02-20 folder was checked.
Alternatives to Serverless SQL Pools for Delta
We can create a view on a Serverless SQL Pools database over a delta folder in the data lake, this respects the partition scheme and the partition columns work fine and return values.
CREATE VIEW LDW.vwWebTelemetryDeltaASSELECT UserID, EventType, ProductID, [URL] AS ProductURL, Device, SessionViewSeconds, EventYear, EventMonth, EventDateFROM OPENROWSET( BULK 'cleansed/webtelemetrydeltatwo', DATA_SOURCE = 'ExternalDataSourceDataLakeUKMI', FORMAT = 'DELTA') WITH( UserID INT, EventType VARCHAR(20), ProductID SMALLINT, URL VARCHAR(25), Device VARCHAR(10), SessionViewSeconds INT, EventYear SMALLINT, EventMonth TINYINT, EventDate DATE) AS fct
We can then run an aggregate query with a filter and the partitioning scheme will be recognized and honored.
SELECCIONE EventType, COUNT(*) AS TotalEventCountFROM LDW.vwWebTelemetryDeltaWHERE EventDate = '2022-02-20'GROUP BY EventType
In this blog post, we looked at what happens when we create delta outer partitioned tables in Spark pools and SQL serverless pools and the expected and unexpected behavior.
- Delta Lake format query using a serverless SQL pool - Azure Synapse Analytics | learn
- Create and use views in a serverless SQL pool - Azure Synapse Analytics | learn
- Create and use external tables in Synapse SQL Pool: Azure Synapse Analytics | learn
- Introduction to Kusto Query Language (KQL): Azure Data Explorer | learn