Synapse link and security

Question

Synapse link and security

MrFlinstone 706

Hi All.

I am looking for any form of white paper on securing dataverse data access from synapse link/data lake and serverless synapse formerly SQL DW.

From my investigation, when synapse link gets the data over to a data lake, there are concerns that users can access this information from the data lake directly even export it, some of the information could be sensitive data, secondly when a linked service is created from synapse workspaces, it gives access to the entire data since access is via the system principal which is required to have blob data reader role assignment, this means that end users could have access to restricted data via queries from the lake database.

Dynamic data masking would have been great here, but its not possible with serverless SQL pools, what are the options with a serverless SQL pool and options in general to lock down a dynamics synapse link implementation ?

Swapnesh Panchal 1,365 Reputation points Microsoft External Staff Moderator

2025-09-29T19:11:38.3866667+00:00

Hi @MrFlinstone,
Welcome to the Microsoft Q&A Platform.

When we synch Synapse Link for Dataverse the data always comes down as files into ADLS Gen2 and is automatically surfaced through the serverless SQL pool. That is a new dataset and doesn't inherit the security of the Dataverse, so it must be secured separately.

The first place to be careful about is storage access. Anyone who has the Storage Blob Data Reader role can grab those files directly. Do not assign that role widely at the account level, restrict it to the container or folder that Synapse Link uses. You can also use ADLS ACLs if you want smaller grained control. On that, restrict the storage account itself with private endpoints as well as turning off public access. It is worthwhile turning on the defender that is specifically designed for storage so that you pick up anomalous access or bulk downloads.

Then there’s the Synapse workspace identity. As a default, this managed identity is given the right to read the lake, but I’ve also observed people giving it too much leeway. That’s normally the largest error. Try to keep the scope as small as possible. Synapse RBAC also be careful about – don’t make people too many admin role bearers and restrict who gets to provision linked services. If passthrough under the user’s own identity with Azure AD is an option, then that’s preferable so the queries still execute under the user’s identity.

On the SQL side, serverless pools are restricted. They don’t allow row-level security, column-level security, or dynamic data masking. You can brute some of this around with views or creating filtered replicas using CETAS, but that’s not actual enforcement. The correct approach is to handle serverless as a stagging layer only. If proper security is needed, then the data must be loaded into a dedicated SQL pool or Fabric SQL. That’s where the RLS, CLS and DDM come into play. You can transfer the data over with Synapse pipelines, Poly Base or CETAS. Users then only interact across the curated dedicated pool tables, while the denormalized lake and serverless façade is kept locked down.

Now consider governance. Use Purview to classify and tagging sensitive data, enforce retention so the raw data is not kept indefinite, and keep track of access with Azure Monitor, Sentinel and the storage logs. This way you’re not only trusting permissions, you’re also monitoring and tracking usage. Synapse Link always terminates in ADLS and exposes through serverless, but you don't need to provide users access there. Bolts down the raw lake and the workspace identity and use dedicated SQL where you need strong security controls.

let us know if you need more details.
MrFlinstone 706 Reputation points

2025-09-30T22:09:21.3866667+00:00

Hi Swapnesh Panchal

Thank you for the comment/thoughts, can you please point me towards ready made pipelines or solutions that allow for data to be transferred onto a dedicated SQL pool or a SQL server database for querying data, having extra pipelines might mean that there is further lag in the data and its not going to be near realtime.

Secondly, is there a way to avoid setting the storage account as publicly assesible, once the storage acount is set to restrict access the spark pool jobs no longer work, and it breaks the data flow from synapse link. From a security perspective, this is a concern.
Swapnesh Panchal 1,365 Reputation points Microsoft External Staff Moderator

2025-10-02T18:58:31.3266667+00:00

Hi @MrFlinstone,
There are two points:
first, landing the Synapse Link data into a dedicated SQL surface with very low lag
second, locking down the storage so it’s private without breaking Spark or Synapse Link.

For near-real-time into dedicated SQL (or SQL Server), treat the lake as a landing zone and micro-batch into curated tables. The most reliable pattern is event-driven: when Synapse Link drops new parquet files, the storage account raises a BlobCreated event; Event Grid triggers a Synapse pipeline; the pipeline runs a script in the dedicated SQL pool that does a COPY INTO to a small staging table and then a MERGE into your final table using Dataverse keys, honoring the delete flag. This usually gives a few minutes end-to-end. If you want something simpler, create external tables over the Link folders and run a scheduled insert or CTAS plus a MERGE; that’s higher latency but fewer moving parts. If your target is plain SQL Server, do the same pattern with ADF Copy or a small Spark job and finish with a MERGE.

For keeping the storage private without breaking anything, run Synapse in a managed VNet and create managed private endpoints from the workspace to your storage account for both dfs and blob; approve them on the storage side. If you use Key Vault or Event Grid, add private endpoints there too. Make sure private DNS zones are present and linked to the managed VNet for privatelink.dfs.core.windows.net and privatelink.blob.core.windows.net. In the storage account, disable public network access. Give the workspace managed identity the Storage Blob Data Contributor role only on the container used by Synapse Link, not at the whole account; add ADLS ACLs on folders if you want tighter scoping. If you haven’t moved Event Grid to private endpoints yet, you can temporarily allow trusted Microsoft services. After you lock things down, open Synapse Link and re-validate the connection.

If Spark or pipelines fail after hardening, check these in order: the dfs and blob private endpoints exist and show Approved; the two private DNS zones above are linked to the workspace VNet; the managed identity has the role at the correct container scope; and all linked services and scripts are using Managed Identity rather than account keys.
MrFlinstone 706 Reputation points

2025-10-06T07:38:54.9566667+00:00

Do you have an example of a solution that will synch a few tables to a dedicated SQL pool or a SQL server database, which one is better (dedicated SQL pool or SQL server database ?)
Swapnesh Panchal 1,365 Reputation points Microsoft External Staff Moderator

2025-10-07T00:13:47.27+00:00

Yes – you can keep a few Dataverse tables in sync using Synapse Link as the landing zone and a small pipeline on top.

Example pattern (Synapse Link → dedicated SQL pool / SQL Server)

Use Synapse Link for Dataverse to land the required tables into ADLS Gen2 (near real time).

Create a Synapse/ADF pipeline (triggered by schedule or Event Grid on new files).

Pipeline flow for each table:

Read new data from the Link folder (via serverless external table or dataset).

Load into a staging table in dedicated SQL pool or SQL Server.

Run a MERGE stored procedure to upsert into the final curated table (using the Dataverse primary key + delete flag).

Apply RLS / CLS / Dynamic Data Masking on the curated tables and give users access only there.

Which is better?

Dedicated SQL pool – better if you’re building a cloud analytics solution in Synapse, expect larger volumes and BI/reporting workloads, and want tight integration with Synapse (serverless, Spark, pipelines).

SQL Server – better if you have existing apps/reports that must use SQL Server and data volumes are more operational/moderate.

For a typical Dataverse + Synapse analytics scenario, dedicated SQL pool is usually the more natural and future-proof target.

1 answer

Your answer

MrFlinstone 706 Reputation points

2025-09-30T22:09:21.3866667+00:00

Hi Swapnesh Panchal

Thank you for the comment/thoughts, can you please point me towards ready made pipelines or solutions that allow for data to be transferred onto a dedicated SQL pool or a SQL server database for querying data, having extra pipelines might mean that there is further lag in the data and its not going to be near realtime.

Secondly, is there a way to avoid setting the storage account as publicly assesible, once the storage acount is set to restrict access the spark pool jobs no longer work, and it breaks the data flow from synapse link. From a security perspective, this is a concern.
MrFlinstone 706 Reputation points

2025-10-06T07:38:54.9566667+00:00

Do you have an example of a solution that will synch a few tables to a dedicated SQL pool or a SQL server database, which one is better (dedicated SQL pool or SQL server database ?)
Swapnesh Panchal 1,365 Reputation points Microsoft External Staff Moderator

2025-10-07T00:13:47.27+00:00

Yes – you can keep a few Dataverse tables in sync using Synapse Link as the landing zone and a small pipeline on top.

Example pattern (Synapse Link → dedicated SQL pool / SQL Server)

Use Synapse Link for Dataverse to land the required tables into ADLS Gen2 (near real time).

Create a Synapse/ADF pipeline (triggered by schedule or Event Grid on new files).

Pipeline flow for each table:

Read new data from the Link folder (via serverless external table or dataset).

Load into a staging table in dedicated SQL pool or SQL Server.

Run a MERGE stored procedure to upsert into the final curated table (using the Dataverse primary key + delete flag).

Apply RLS / CLS / Dynamic Data Masking on the curated tables and give users access only there.

Which is better?

Dedicated SQL pool – better if you’re building a cloud analytics solution in Synapse, expect larger volumes and BI/reporting workloads, and want tight integration with Synapse (serverless, Spark, pipelines).

SQL Server – better if you have existing apps/reports that must use SQL Server and data volumes are more operational/moderate.

For a typical Dataverse + Synapse analytics scenario, dedicated SQL pool is usually the more natural and future-proof target.

Answer 1

when you push Dataverse (Dynamics) out to a lake via Synapse Link you have created a new copy of the data that must be protected independently. Below I list the realities, the constraints (what is and is not supported), and concrete mitigations/architecture patterns you can apply focused on serverless SQL pool but covering general options as well.

Synapse Link writes Dataverse data into ADLSGen2 (lake) as files. Once files exist in the lake, any identity that can read those files can export or copy them. You must treat the lake as the primary protection boundary.
The Synapse workspace uses a workspace identity / MI (system principal) to read files and run serverless queries; by default that identity is commonly granted Storage Blob Data Reader on the container, if that role is overbroad then downstream users can gain access to data via lake database queries. Restrict the identity scope.
Dynamic data masking (DDM) and some other built-in SQL features are not supported on serverless SQL pools (DDM is supported on dedicated pools/Fabric SQL). Do not rely on serverless to provide DDM.
Row level security and column-level security have limited support for serverless external tables; serverless can use views to implement some column restrictions, but native RLS/DDM capabilities are fuller in dedicated SQL pools. Confirm requirements before choosing serverless.

Note:

If you require true enforced masking, row-level enforcement and enterprise policy that cannot be bypassed, do not rely on serverless SQL pools alone. Use a hardened curated endpoint (dedicated SQL/Fabric SQL) as the enforcement point. Serverless is great for exploration and low-cost queries but it is not a drop in replacement for a secured RLS/DDM capable engine.
The single biggest operational mistake I see is granting the workspace identity wide storage rights. Lock that down and your risk drops dramatically.

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

MrFlinstone 706 Reputation points

2025-09-13T17:58:03.5133333+00:00

Thank you for your proposed answer, how can one use a SQL pool and not use the serverless, the serverless appears to be the default option, when you connect to the serverless pool the data lake DB is loaded, how can one load this onto a dedicated SQL pool only ?
Vinodh247 40,031 Reputation points MVP Volunteer Moderator

2025-09-14T12:16:42.7333333+00:00

synapse Link for dataverse always lands data in ADLSgen2 and exposes it by default through the serverless SQL pool, but you are not forced to give users access there. Treat serverless as a staging layer: lock down the raw container, then use Synapse pipelines, PolyBase, or CETAS to copy or transform the landed files into a dedicated SQL pool. In the dedicated pool you can apply row-level security, column-level security, and dynamic data masking, which are not available in serverless. This way, end users only query curated tables in the dedicated pool under strict governance, while raw files and the serverless façade remain restricted.

Share via

Synapse link and security

1 answer

Your answer