Edit

Share via


Ingest data into your Warehouse using Transact-SQL

Applies to: ✅ Warehouse in Microsoft Fabric

The Transact-SQL language offers options you can use to load data at scale from existing tables in your lakehouse and warehouse into new tables in your warehouse. These options are convenient if you need to create new versions of a table with aggregated data, versions of tables with a subset of the rows, or to create a table as a result of a complex query. Let's explore some examples.

Create a new table with the result of a query

Warehouse in Microsoft Fabric enables you to easily create a new table based on a result of T-SQL query, using the following T-SQL statements:

  • CREATE TABLE AS SELECT (CTAS) statement that allows you to create a new table in your warehouse from the output of a SELECT statement.
  • SELECT INTO query clause that enables you to select results from any table source, and redirect the results into a new table. This is a standard feature in T-SQL language.

These two statements are similar, so the following examples are focused on the CTAS statement.

The CTAS statement runs the ingestion operation into the new table in parallel, making it highly efficient for data transformation and creation of new tables in your workspace.

You can use the following options for the SELECT part of CTAS statement:

  • Reading a warehouse table, such as a staging table.
  • Reading a Lakehouse Delta Lake folder using an autogenerated table in SQL analytics endpoint for Lakehouse.
  • Reading CSV or Parquet files directly from Azure Data Lake or Azure Blob storage using the OPENROWSET function.

Note

The examples in this article use the Bing COVID-19 sample dataset. To load the sample dataset, follow the steps in Ingest data into your Warehouse using the COPY statement to create the sample data into your warehouse.

Create table from Warehouse table

The first example illustrates how to create a new table that is a copy of the existing dbo.bing_covid19_data_2023 table, but filtered to data from the year 2023 only:

CREATE TABLE dbo.bing_covid19_data_2023
AS
SELECT * 
FROM dbo.bing_covid19_data 
WHERE DATEPART(YEAR, updated) = '2023';

You can also create a new table with new year, month, dayofmonth columns, with values obtained from updated column in the source table. This can be useful if you're trying to visualize infection data by year, or to see months when the most COVID-19 cases are observed:

CREATE TABLE dbo.bing_covid19_data_with_year_month_day
AS
SELECT DATEPART(YEAR, updated) AS [year],
       DATEPART(MONTH, updated) AS [month],
       DATEPART(DAY, updated) AS [dayofmonth],
       * 
FROM dbo.bing_covid19_data;

As another example, you can create a new table that summarizes the number of cases observed in each month, regardless of the year, to evaluate how seasonality affects spread in a given country/region. It uses the table created in the previous example with the new month column as a source:

CREATE TABLE dbo.infections_by_month
AS
SELECT country_region, [month],
       SUM(CAST(confirmed as bigint)) AS confirmed_sum
FROM dbo.bing_covid19_data_with_year_month_day
GROUP BY country_region, [month];

Based on this new table, we can see that the United States observed more confirmed cases across all years in the month of January, followed by December and October. April is the month with the lowest number of cases overall:

SELECT * FROM dbo.infections_by_month
WHERE country_region = 'United States'
ORDER BY confirmed_sum DESC;

Create table from Delta Lake folder

The Delta Lake folders that are persisted in OneLake are automatically represented as tables if they're stored in /Tables folder in a lakehouse. The following code creates a new table bing_covid19_data_2023 from Delta Lake folder /Tables/bing_covid19_delta_lake in the MyLakehouse lakehouse:

CREATE TABLE dbo.bing_covid19_data_2023
AS
SELECT * 
FROM MyLakehouse.dbo.bing_covid19_delta_lake 
WHERE DATEPART(YEAR, updated) = '2023';

You can reference Delta Lake folder using the three-part-name notation that references the lakehouse where the files are stored. All examples shown in the previous section are applicable to Delta Lake folders.

Create table from CSV/Parquet file

Instead of reading data from the Warehouse bing_covid19_data table, you can also create a new table directly from an external file using the OPENROWSET function:

CREATE TABLE dbo.bing_covid19_data_2022
AS
SELECT *
FROM OPENROWSET(BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet') AS data;
WHERE DATEPART(YEAR, updated) = '2022'

You can also create a new table by transforming data from an external CSV file:

CREATE TABLE dbo.bing_covid19_data_with_year_month_day
AS
SELECT DATEPART(YEAR, updated) AS [year], 
       DATEPART(MONTH, updated) AS [month],
       DATEPART(DAY, updated) AS [dayofmonth],
       *
FROM OPENROWSET(BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.csv') AS data;

As another example, you can create a new table that summarizes the number of cases observed in each month, regardless of the year, to evaluate how seasonality affects spread in a given country/region. It uses the table created in the previous example with the new month column as a source:

CREATE TABLE dbo.infections_by_month_2022
AS
SELECT country_region,
       DATEPART(MONTH, updated) AS [month],
       SUM(CAST(confirmed as bigint)) AS [confirmed_sum]
FROM OPENROWSET(BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet') AS data
WHERE DATEPART(YEAR, updated) = '2022'
GROUP BY country_region, DATEPART(MONTH, updated);

Based on this new table, we can see that the United States observed more confirmed cases across all years in the month of January, followed by December and October. April is the month with the lowest number of cases overall:

SELECT * FROM dbo.infections_by_month_2022
WHERE country_region = 'United States'
ORDER BY confirmed_sum DESC;

Screenshot of the query results showing the number of infections by month in the United States, ordered by month, in descending order. The month number 1 is shown on top.

For more examples and syntax reference, see CREATE TABLE AS SELECT (Transact-SQL).

Ingest data into existing tables with T-SQL queries

The previous examples create new tables based on the result of a query. To replicate the examples but on existing tables, the INSERT ... SELECT pattern can be used.

Ingest data from Warehouse table

The following code ingests new data from a warehouse table into an existing table:

INSERT INTO dbo.bing_covid19_data_2023
SELECT *
FROM dbo.bing_covid19_data
WHERE DATEPART(YEAR, updated) = '2023';

The query criteria for the SELECT statement can be any valid query, as long as the resulting query column types align with the columns on the destination table. If column names are specified and include only a subset of the columns from the destination table, all other columns are loaded as NULL. For more information, see Using INSERT INTO...SELECT to Bulk Import data with minimal logging and parallelism.

Ingest data from Delta Lake folder

The Delta Lake folders that are persisted in OneLake are automatically represented as tables if they're stored in /Tables folder in a lakehouse. The following code ingests new data from Delta Lake folder /Tables/bing_covid19_delta_lake section in the MyLakehouse lakehouse

INSERT INTO dbo.bing_covid19_data_2023
SELECT *
FROM MyLakehouse.dbo.bing_covid19_delta_lake 
WHERE DATEPART(YEAR, updated) = '2023';

Ingest data from CSV/Parquet file

You can use the OPENROWSET function as a source in order to ingest data from Azure Data Lake or Azure Blob storage:

INSERT INTO dbo.bing_covid19_data_2023
SELECT *
FROM OPENROWSET(BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet') AS data
WHERE DATEPART(YEAR, updated) = '2023';

These example is similar to those used in ingestion with COPY INTO. The COPY INTO command is easier to use, especially for straightforward source-to-destination data loads. However, if you need to transform source data (such as converting values or joining with other tables), using INSERT ... SELECT gives you the flexibility to perform transformations during ingestion.

Ingest data from tables on different warehouses and lakehouses

For both CREATE TABLE AS SELECT and INSERT ... SELECT, the SELECT statement can also reference tables on warehouses that are different from the warehouse where your destination table is stored, by using cross-warehouse queries. This can be achieved by using the three-part naming convention [warehouse_or_lakehouse_name.][schema_name.]table_name. For example, suppose you have the following workspace assets:

  • A lakehouse named cases_lakehouse with the latest case data.
  • A warehouse named reference_warehouse with tables used for reference data.
  • A warehouse named research_warehouse where the destination table is created.

A new table can be created that uses three-part naming to combine data from tables on these workspace assets:

CREATE TABLE research_warehouse.dbo.cases_by_continent
AS
SELECT *
FROM cases_lakehouse.dbo.bing_covid19_data AS cases
INNER JOIN reference_warehouse.dbo.bing_covid19_data AS reference
ON cases.iso3 = reference.countrycode;

To learn more about cross-warehouse queries, see Write a cross-database SQL Query.