Use a service principal with the Spark 3 connector for Azure Cosmos DB for NoSQL
In this article, you learn how to create a Microsoft Entra application and service principal that can be used with role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.
Prerequisites
- An existing Azure Cosmos DB for NoSQL account.
- If you have an existing Azure subscription, create a new account.
- No Azure subscription? You can try Azure Cosmos DB free with no credit card required.
- An existing Azure Databricks workspace.
- Registered Microsoft Entra application and service principal.
- If you don't have a service principal and application, register an application by using the Azure portal.
Create a secret and record credentials
In this section, you create a client secret and record the value for use later.
Open the Azure portal.
Go to your existing Microsoft Entra application.
Go to the Certificates & secrets page. Then, create a new secret. Save the Client Secret value to use later in this article.
Go to the Overview page. Locate and record the values for Application (client) ID, Object ID, and Directory (tenant) ID. You also use these values later in this article.
Go to your existing Azure Cosmos DB for NoSQL account.
Record the URI value on the Overview page. Also record the Subscription ID and Resource Group values. You use these values later in this article.
Create a definition and an assignment
In this section, you create a Microsoft Entra ID role definition. Then you assign that role with permissions to read and write items in the containers.
Create a role by using the
az role definition create
command. Pass in the Azure Cosmos DB for NoSQL account name and resource group, followed by a body of JSON that defines the custom role. The role is also scoped to the account level by using/
. Ensure that you provide a unique name for your role by using theRoleName
property of the request body.az cosmosdb sql role definition create \ --resource-group "<resource-group-name>" \ --account-name "<account-name>" \ --body '{ "RoleName": "<role-definition-name>", "Type": "CustomRole", "AssignableScopes": ["/"], "Permissions": [{ "DataActions": [ "Microsoft.DocumentDB/databaseAccounts/readMetadata", "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*", "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*" ] }] }'
List the role definition you created to fetch its unique identifier in the JSON output. Record the
id
value of the JSON output.az cosmosdb sql role definition list \ --resource-group "<resource-group-name>" \ --account-name "<account-name>"
[ { ..., "id": "/subscriptions/<subscription-id>/resourceGroups/<resource-grou-name>/providers/Microsoft.DocumentDB/databaseAccounts/<account-name>/sqlRoleDefinitions/<role-definition-id>", ... "permissions": [ { "dataActions": [ "Microsoft.DocumentDB/databaseAccounts/readMetadata", "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*", "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*" ], "notDataActions": [] } ], ... } ]
Use
az cosmosdb sql role assignment create
to create a role assignment. Replace<aad-principal-id>
with the Object ID you recorded earlier in this article. Also, replace<role-definition-id>
with theid
value fetched from running theaz cosmosdb sql role definition list
command in a previous step.az cosmosdb sql role assignment create \ --resource-group "<resource-group-name>" \ --account-name "<account-name>" \ --scope "/" \ --principal-id "<account-name>" \ --role-definition-id "<role-definition-id>"
Use a service principal
Now that you've created a Microsoft Entra application and service principal, created a custom role, and assigned that role permissions to your Azure Cosmos DB for NoSQL account, you should be able to run a notebook.
Open your Azure Databricks workspace.
In the workspace interface, create a new cluster. Configure the cluster with these settings, at a minimum:
Version Value Runtime version 13.3 LTS (Scala 2.12, Spark 3.4.1)
Use the workspace interface to search for Maven packages from Maven Central with a Group ID of
com.azure.cosmos.spark
. Install the package specifically for Spark 3.4 with an Artifact ID prefixed withazure-cosmos-spark_3-4
to the cluster.Finally, create a new notebook.
Tip
By default, the notebook is attached to the recently created cluster.
Within the notebook, set Azure Cosmos DB Spark connector configuration settings for the NoSQL account endpoint, database name, and container name. Use the Subscription ID, Resource Group, Application (client) ID, Directory (tenant) ID, and Client Secret values recorded earlier in this article.
# Set configuration settings config = { "spark.cosmos.accountEndpoint": "<nosql-account-endpoint>", "spark.cosmos.auth.type": "ServicePrincipal", "spark.cosmos.account.subscriptionId": "<subscription-id>", "spark.cosmos.account.resourceGroupName": "<resource-group-name>", "spark.cosmos.account.tenantId": "<entra-tenant-id>", "spark.cosmos.auth.aad.clientId": "<entra-app-client-id>", "spark.cosmos.auth.aad.clientSecret": "<entra-app-client-secret>", "spark.cosmos.database": "<database-name>", "spark.cosmos.container": "<container-name>" }
// Set configuration settings val config = Map( "spark.cosmos.accountEndpoint" -> "<nosql-account-endpoint>", "spark.cosmos.auth.type" -> "ServicePrincipal", "spark.cosmos.account.subscriptionId" -> "<subscription-id>", "spark.cosmos.account.resourceGroupName" -> "<resource-group-name>", "spark.cosmos.account.tenantId" -> "<entra-tenant-id>", "spark.cosmos.auth.aad.clientId" -> "<entra-app-client-id>", "spark.cosmos.auth.aad.clientSecret" -> "<entra-app-client-secret>", "spark.cosmos.database" -> "<database-name>", "spark.cosmos.container" -> "<container-name>" )
Configure the Catalog API to manage API for NoSQL resources by using Spark.
# Configure Catalog Api spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog") spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", "<nosql-account-endpoint>") spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.type", "ServicePrincipal") spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.subscriptionId", "<subscription-id>") spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.resourceGroupName", "<resource-group-name>") spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.tenantId", "<entra-tenant-id>") spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientId", "<entra-app-client-id>") spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientSecret", "<entra-app-client-secret>")
// Configure Catalog Api spark.conf.set(s"spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog") spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", "<nosql-account-endpoint>") spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.type", "ServicePrincipal") spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.subscriptionId", "<subscription-id>") spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.resourceGroupName", "<resource-group-name>") spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.tenantId", "<entra-tenant-id>") spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientId", "<entra-app-client-id>") spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientSecret", "<entra-app-client-secret>")
Create a new database by using
CREATE DATABASE IF NOT EXISTS
. Ensure that you provide your database name.# Create a database using the Catalog API spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format("<database-name>"))
// Create a database using the Catalog API spark.sql(s"CREATE DATABASE IF NOT EXISTS cosmosCatalog.<database-name>;")
Create a new container by using the database name, container name, partition key path, and throughput values that you specify.
# Create a products container using the Catalog API spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} USING cosmos.oltp TBLPROPERTIES(partitionKeyPath = '{}', manualThroughput = '{}')".format("<database-name>", "<container-name>", "<partition-key-path>", "<throughput>"))
// Create a products container using the Catalog API spark.sql(s"CREATE TABLE IF NOT EXISTS cosmosCatalog.<database-name>.<container-name> using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '<partition-key-path>', manualThroughput = '<throughput>')")
Create a sample dataset.
# Create sample data products = ( ("68719518391", "gear-surf-surfboards", "Yamba Surfboard", 12, 850.00, False), ("68719518371", "gear-surf-surfboards", "Kiama Classic Surfboard", 25, 790.00, True) )
// Create sample data val products = Seq( ("68719518391", "gear-surf-surfboards", "Yamba Surfboard", 12, 850.00, false), ("68719518371", "gear-surf-surfboards", "Kiama Classic Surfboard", 25, 790.00, true) )
Use
spark.createDataFrame
and the previously saved online transaction processing (OLTP) configuration to add sample data to the target container.# Ingest sample data spark.createDataFrame(products) \ .toDF("id", "category", "name", "quantity", "price", "clearance") \ .write \ .format("cosmos.oltp") \ .options(config) \ .mode("APPEND") \ .save()
// Ingest sample data spark.createDataFrame(products) .toDF("id", "category", "name", "quantity", "price", "clearance") .write .format("cosmos.oltp") .options(config) .mode("APPEND") .save()
Tip
In this quickstart example, credentials are assigned to variables in clear text. For security, we recommend that you use secrets. For more information on how to configure secrets, see Add secrets to your Spark configuration.