- Article
- 8 minutes to read
Serverless SQL pool allows you to query data in your data lake. It provides a T-SQL query interface that supports queries on semi-structured and unstructured data. For queries, the following aspects of T-SQL are supported:
- CompleteCHOOSEsurface, including mostSQL functions and operators.
- CREATE EXTERNAL TABLE SELECTION (CETAS) A... createoutside tableand exports the results of a Transact-SQL SELECT statement to Azure Storage in parallel.
For more information on what is currently supported and what is not, seeServerless SQL pools overviewarticle or the following articles:
- Develop memory accesswhere you can learn to useoutdoor tablejSET OPEN LINESFunction to read data from memory.
- Control access to storageLearn how to enable Synapse SQL for storage access with SAS authentication or workspace managed identity.
general description
To support a seamless experience for directly querying data found in Azure Storage files, Serverless SQL pool uses theSET OPEN LINESFunction with additional features:
- Scan multiple files or folders
- PARQUET file format
- CSV query and delimited text (field terminator, row terminator, escape character)
- Format LAKE DELTA
- Read a selected subset of columns
- Schema Inference
- filename function
- file path function
- Work with complex types and nested or repeating data structures
Consult PARQUET files
To query the Parquet source data, use FORMAT = 'PARQUET':
SELECCIONE * FROMOPENROWSET(BULK N'https://myaccount.dfs.core.windows.net/mycontainer/mysubfolder/data.parquet', FORMAT = 'PARQUET') CON (C1 int, C2 varchar(20), C3 varchar( max)) como filas
Check aConsult Parquet filesArticle for application examples.
Query CSV files
To query CSV source data, use FORMAT='CSV'. You can specify the schema of the CSV file as partSET OPEN LINES
Function when querying CSV files:
SELECTION * FROMOPENROWSET(BULK N'https://myaccount.dfs.core.windows.net/mycontainer/mysubfolder/data.csv', FORMAT = 'CSV', PARSER_VERSION='2.0') CON (C1 int, C2 varchar( 20), C3 varchar(max)) als Filas
There are some additional options to adapt the parsing rules to the custom CSv format:
- ESCAPE_CHAR = 'char' Specifies the character in the file used to escape itself and any delimiter values in the file. If the escape character is followed by a value other than itself or one of the delimiter values, the escape character is discarded when the value is read. The ESCAPE_CHAR parameter is applied regardless of whether FIELDQUOTE is enabled or not. It is not used to escape the double quote. The quotation mark must be escaped with another quotation mark. The double quote can appear in the column value only if the value is enclosed in double quotes.
- FIELDTERMINATOR='field_terminator'Specifies the field terminator to use. The default field terminator is a comma (",")
- ROWTERMINATOR='row_terminator'Specifies the row terminator to use. The default line terminator is a newline character:\r\n.
DELTA LAKE query format
To query the Delta Lake source data, use FORMAT = 'DELTA' and point to the root folder that contains your Delta Lake files.
SELECTION * FROMOPENROWSET(BULK N'https://myaccount.dfs.core.windows.net/mycontainer/mysubfolder', FORMAT = 'DELTA') CON (C1 int, C2 varchar(20), C3 varchar(max)) como filas
The root folder must contain a subfolder named_delta_log
.Check aDelta Lake query formatArticle for application examples.
file schema
The SQL language in Synapse SQL allows you to define the schema of the file as part ofSET OPEN LINES
works and reads all or a subset of columns or tries to automatically determine column types from the file using schema inference.
Read a selected subset of columns
To specify the columns you want to read, you can specify an optional WITH clause in yourSET OPEN LINES
Opinion.
- If CSV data files exist, specify column names and their data types to read all columns. If you want a subset of columns, use ordinals to select columns from source data files by ordinal. Columns are connected by ordinal notation.
- If Parquet data files exist, provide column names that match the column names in the source data files. Columns are linked by name.
SELECCIONE * FROMOPENROWSET(BULK N'https://myaccount.dfs.core.windows.net/mycontainer/mysubfolder/data.parquet', FORMAT = 'PARQUET') CON (C1 int, C2 varchar(20), C3 varchar( max)) como filas
For each column you must specify and write the column nameFRAUD
Clause. See below for examplesRead CSV files without specifying all columns.
Schema Inference
By omitting the WITH clause from theSET OPEN LINES
declaration, you can tell the service to automatically detect (infer) the schema of the underlying files.
SELECCIONE * FROMOPENROWSET(BULK N'https://myaccount.dfs.core.windows.net/mycontainer/mysubfolder/data.parquet', FORMAT = 'PARQUET')
make sureappropriate derived data typesare used for optimal performance.
Scan multiple files or folders
To run a T-SQL query on a group of files in a folder or group of folders, treating them as a single entity or group of rows, specify a path to a folder or a pattern (using wildcards) within a group of files or to folders.
The following rules apply:
- Patterns can appear in part of a directory path or in a filename.
- Multiple patterns can appear in the same directory or filename step.
- If multiple wildcards are present, files in all matching paths will be included in the resulting fileset.
SELECCIONE * FROMOPENROWSET( BULK N'https://myaccount.dfs.core.windows.net/myroot/*/mysubfolder/*.parquet', FORMAT = 'PARQUET' ) aus Dateien
refer toQuery folders and multiple filesfor application examples.
filename function
This function returns the name of the file the line came from.
To refer to specific files, see the filename section in theQuery specific filesArticle.
The return data type is nvarchar(1024). For best performance, always convert the result of the filename function to the correct data type. If you use the character data type, be sure to use the correct length.
file path function
This function returns a full path or part of the path:
- When called without parameters, returns the full path of the single-line source file.
- When called with a parameter, returns a portion of the path that matches the placeholder at the position specified in the parameter. For example, a parameter value of 1 would return a portion of the path that matches the first placeholder.
For more information, see the File Path section of theQuery specific filesArticle.
The return data type is nvarchar(1024). For best performance, always convert the result of the filepath function to the correct data type. If you use the character data type, be sure to use the correct length.
Work with complex types and nested or repeating data structures
To provide a seamless experience with data stored in nested or repeating data types such asparquetfiles, the serverless SQL pool has added the following extensions.
Nested or repeating project data
To project data, run a SELECT statement on the parquet file that contains columns with nested data types. On output, the nested values are serialized to JSON and returned as a varchar(8000) SQL data type.
SELECT * FROM OPENROWSET(BULK 'unstructured_data_path', FORMAT = 'PARQUET') [AS Alias]
See the Nested or repeated data design section for more detailed informationNested type parquet queryArticle.
Access items from nested columns
To access nested elements of a nested column, e.g. B. Struct, use dot notation to concatenate field names in the path. Specify the path as the column name in the WITH clause ofSET OPEN LINES
Function.
The example syntax snippet is as follows:
OPENROWSET ( BULK 'ruta_datos_no estructurados' , FORMAT = 'PARQUET' ) WITH ({'column_name' 'column_type',}) [AS alias] 'column_name' ::= '[field_name.] field_name'
By default, theSET OPEN LINES
The function matches the source field name and path to the column names specified in the WITH clause. Elements contained at different nesting levels in the same Parquet source file can be accessed using the WITH clause.
return values
- The function returns a scalar value such as int, decimal, and varchar of the specified element and path for all Parquet types that are not in the nested Type group.
- If the path points to an element of the nested type, the function returns a JSON fragment starting at the top element in the specified path. The JSON fragment is of type varchar(8000).
- If the property cannot be found in the specified column name, the function returns an error.
- If the property cannot be found in the specified column path, depends onRoutenmodus, the function returns an error in strict mode or null in lax mode.
For example queries, see Accessing elements of nested columns in theNested type parquet queryArticle.
Access items from repeating columns
To access elements of a repeating column, e.g. B. an element of a matrix or a map, use theJSON_VALUEFor each scalar you need to design and provide:
- Nested or repeated column as first parameter
- ARota JSONwhich, as the second parameter, specifies the element or property to be accessed
To access non-scalar elements of a repeating column, use theJSON_QUERYFor each non-scalar element, you must design and provide:
- Nested or repeated column as first parameter
- ARota JSONwhich, as the second parameter, specifies the element or property to be accessed
See the following syntax snippet:
SELECT { JSON_VALUE (column_name, path_to_sub_element), } { JSON_QUERY (column_name [, path_to_sub_element]), ) FROM OPENROWSET (BULK 'path_unstructured_data', FORMAT = 'PARQUET') [AS alias]
For example queries to access items from repeating columns, seeNested type parquet queryArticle.
Examples of queries
Use the example queries to learn more about querying different types of data.
Tool
The tools required to issue queries: - Azure Synapse Studio - Azure Data Studio - SQL Server Management Studio
Demo-Setup
Your first step isCreate a databasewhere you run the queries. Then the objects are initialized by executingSetup Scriptin this database.
This setup script creates the data sources, database scope credentials, and external file formats used to read the data in these examples.
Use
Databases are only used to display metadata, not actual data. Make a note of the name of the database used, you will need it later.
CREATE DATABASE mydbname;
Demo data provided
The demo data contains the following datasets:
- NYC Taxi - Yellow Taxi Trip Records - Part of NYC public data set in CSV and Parquet format
- Population data set in CSV format
- Examples of Parquet files with nested columns
- Books in JSON format
folder path | Description |
---|---|
/csv/ | Main folder for data in CSV format |
/csv/population/ /csv/poblacion-unix/ /csv/poblacion-unix-hdr/ /csv/unix-population-hdr-escape /csv/population-unix-hdr-cited | Folder containing population data files in various CSV formats. |
/csv/taxi/ | Folder with public NYC data files in CSV format |
/Parquet/ | Main folder for data in Parquet format |
/parquet/taxi | New York City public data files in Parquet format, partitioned by year and month using the Hive/Hadoop partition scheme. |
/parquet/nested/ | Examples of Parquet files with nested columns |
/json/ | Parent folder for data in JSON format |
/json/libros/ | JSON files with book data |
Next Steps
For more information about querying different file types and creating and using views, see the following articles:
- Query CSV files
- Consult Parquet files
- Query JSON files
- Query nested values
- Querying folders and multiple CSV files
- Use file metadata in queries
- Create and use views