Compute configuration for Databricks Connect
Note
This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.
In this article, you configure properties to establish a connection between Databricks Connect and your Azure Databricks cluster or serverless compute. This information applies to the Python and Scala version of Databricks Connect unless stated otherwise.
Databricks Connect enables you to connect popular IDEs such as Visual Studio Code, PyCharm, RStudio Desktop, IntelliJ IDEA, notebook servers, and other custom applications to Azure Databricks clusters. See What is Databricks Connect?.
Requirements
To configure a connection to Databricks compute, you must have:
- Databricks Connect installed. For installation requirements and steps for specific language versions of Databricks Connect, see:
- A Azure Databricks account and workspace that have Unity Catalog enabled. See Set up and manage Unity Catalog and Enable a workspace for Unity Catalog.
- A Azure Databricks cluster with Databricks Runtime 13.3 LTS or above.
- The Databricks Runtime version of your cluster must be equal to, or above, the Databricks Connect package version. Databricks recommends that you use the most recent package of Databricks Connect that matches the Databricks Runtime version. To use features that are available in later versions of the Databricks Runtime, you must upgrade the Databricks Connect package. See the Databricks Connect release notes for a list of available Databricks Connect releases. For Databricks Runtime version release notes, see Databricks Runtime release notes versions and compatibility.
- The cluster must use a cluster access mode of Assigned or Shared. See Access modes.
Setup
Before you begin, you need the following:
- If you are connecting to a cluster, the ID of your cluster. You can retrieve the cluster ID from the URL. See Cluster URL and ID.
- The Azure Databricks workspace instance name. This is the Server Hostname value for your compute. See Get connection details for an Azure Databricks compute resource.
- Any other properties that are necessary for the Databricks authentication type that you want to use.
Note
OAuth user-to-machine (U2M) authentication is supported on Databricks SDK for Python 0.19.0 and above. Update your code project’s installed version of the Databricks SDK for Python to 0.19.0 or above to use OAuth U2M authentication. See Get started with the Databricks SDK for Python.
For OAuth U2M authentication, you must use the Databricks CLI to authenticate before you run your Python code. See the Tutorial.
OAuth machine-to-machine (M2M) authentication OAuth machine-to-machine (M2M) authentication is supported on Databricks SDK for Python 0.18.0 and above. Update your code project’s installed version of the Databricks SDK for Python to 0.18.0 or above to use OAuth M2M authentication. See Get started with the Databricks SDK for Python.
The Databricks SDK for Python has not yet implemented Azure managed identities authentication.
Configure a connection to a cluster
There are multiple ways to configure the connection to your cluster. Databricks Connect searches for configuration properties in the following order, and uses the first configuration it finds. For advanced configuration information, see Advanced usage of Databricks Connect for Python.
- The DatabricksSession class’s remote() method.
- A Databricks configuration profile
- The DATABRICKS_CONFIG_PROFILE environment variable
- An environment variable for each configuration property
- A Databricks configuration profile named DEFAULT
The DatabricksSession
class’s remote()
method
For this option, which applies to Azure Databricks personal access token authentication only, specify the workspace instance name, the Azure Databricks personal access token, and the ID of the cluster.
You can initialize the DatabricksSession
class in several ways:
- Set the
host
,token
, andcluster_id
fields inDatabricksSession.builder.remote()
. - Use the Databricks SDK’s
Config
class. - Specify a Databricks configuration profile along with the
cluster_id
field.
Instead of specifying these connection properties in your code, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed retrieve_*
functions to get the necessary properties from the user or from some other configuration store, such as Azure KeyVault.
The code for each of these approaches is as follows:
Python
# Set the host, token, and cluster_id fields in DatabricksSession.builder.remote.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.remote(
host = f"https://{retrieve_workspace_instance_name()}",
token = retrieve_token(),
cluster_id = retrieve_cluster_id()
).getOrCreate()
Scala
// Set the host, token, and clusterId fields in DatabricksSession.builder.
// If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
// cluster's ID, you do not also need to set the clusterId field here.
import com.databricks.connect.DatabricksSession
val spark = DatabricksSession.builder()
.host(retrieveWorkspaceInstanceName())
.token(retrieveToken())
.clusterId(retrieveClusterId())
.getOrCreate()
Python
# Use the Databricks SDK's Config class.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config
config = Config(
host = f"https://{retrieve_workspace_instance_name()}",
token = retrieve_token(),
cluster_id = retrieve_cluster_id()
)
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
Scala
// Use the Databricks SDK's Config class.
// If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
// cluster's ID, you do not also need to set the clusterId field here.
import com.databricks.connect.DatabricksSession
import com.databricks.sdk.core.DatabricksConfig
val config = new DatabricksConfig()
.setHost(retrieveWorkspaceInstanceName())
.setToken(retrieveToken())
val spark = DatabricksSession.builder()
.sdkConfig(config)
.clusterId(retrieveClusterId())
.getOrCreate()
Python
# Specify a Databricks configuration profile along with the `cluster_id` field.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config
config = Config(
profile = "<profile-name>",
cluster_id = retrieve_cluster_id()
)
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
Scala
// Specify a Databricks configuration profile along with the clusterId field.
// If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
// cluster's ID, you do not also need to set the clusterId field here.
import com.databricks.connect.DatabricksSession
import com.databricks.sdk.core.DatabricksConfig
val config = new DatabricksConfig()
.setProfile("<profile-name>")
val spark = DatabricksSession.builder()
.sdkConfig(config)
.clusterId(retrieveClusterId())
.getOrCreate()
A Databricks configuration profile
For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.
The required configuration profile fields for each authentication type are as follows:
- For Azure Databricks personal access token authentication:
host
andtoken
. - For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
. - For OAuth user-to-machine (U2M) authentication (where supported):
host
. - For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication:
host
,azure_tenant_id
,azure_client_id
,azure_client_secret
, and possiblyazure_workspace_resource_id
. - For Azure CLI authentication:
host
. - For Azure managed identities authentication (where supported):
host
,azure_use_msi
,azure_client_id
, and possiblyazure_workspace_resource_id
.
Then set the name of this configuration profile through the configuration class.
You can specify cluster_id
in a couple of ways:
- Include the
cluster_id
field in your configuration profile, and then just specify the configuration profile’s name. - Specify the configuration profile name along with the
cluster_id
field.
If you have already set the DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specify cluster_id
.
The code for each of these approaches is as follows:
Python
# Include the cluster_id field in your configuration profile, and then
# just specify the configuration profile's name:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate()
Scala
// Include the cluster_id field in your configuration profile, and then
// just specify the configuration profile's name:
import com.databricks.connect.DatabricksSession
import com.databricks.sdk.core.DatabricksConfig
val config = new DatabricksConfig()
.setProfile("<profile-name>")
val spark = DatabricksSession.builder()
.sdkConfig(config)
.getOrCreate()
Python
# Specify the configuration profile name along with the cluster_id field.
# In this example, retrieve_cluster_id() assumes some custom implementation that
# you provide to get the cluster ID from the user or from some other
# configuration store:
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config
config = Config(
profile = "<profile-name>",
cluster_id = retrieve_cluster_id()
)
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
Scala
// Specify a Databricks configuration profile along with the clusterId field.
// If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
// cluster's ID, you do not also need to set the clusterId field here.
import com.databricks.connect.DatabricksSession
import com.databricks.sdk.core.DatabricksConfig
val config = new DatabricksConfig()
.setProfile("<profile-name>")
val spark = DatabricksSession.builder()
.sdkConfig(config)
.clusterId(retrieveClusterId())
.getOrCreate()
The DATABRICKS_CONFIG_PROFILE
environment variable
For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.
If you have already set the DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specify cluster_id
.
The required configuration profile fields for each authentication type are as follows:
- For Azure Databricks personal access token authentication:
host
andtoken
. - For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
. - For OAuth user-to-machine (U2M) authentication (where supported):
host
. - For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication:
host
,azure_tenant_id
,azure_client_id
,azure_client_secret
, and possiblyazure_workspace_resource_id
. - For Azure CLI authentication:
host
. - For Azure managed identities authentication (where supported):
host
,azure_use_msi
,azure_client_id
, and possiblyazure_workspace_resource_id
.
Set the DATABRICKS_CONFIG_PROFILE
environment variable to the name of this configuration profile. Then initialize the DatabricksSession
class:
Python
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
Scala
import com.databricks.connect.DatabricksSession
val spark = DatabricksSession.builder().getOrCreate()
An environment variable for each configuration property
For this option, set the DATABRICKS_CLUSTER_ID
environment variable and any other environment variables that are necessary for the Databricks authentication type that you want to use.
The required environment variables for each authentication type are as follows:
- For Azure Databricks personal access token authentication:
DATABRICKS_HOST
andDATABRICKS_TOKEN
. - For OAuth machine-to-machine (M2M) authentication (where supported):
DATABRICKS_HOST
,DATABRICKS_CLIENT_ID
, andDATABRICKS_CLIENT_SECRET
. - For OAuth user-to-machine (U2M) authentication (where supported):
DATABRICKS_HOST
. - For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication:
DATABRICKS_HOST
,ARM_TENANT_ID
,ARM_CLIENT_ID
,ARM_CLIENT_SECRET
, and possiblyDATABRICKS_AZURE_RESOURCE_ID
. - For Azure CLI authentication:
DATABRICKS_HOST
. - For Azure managed identities authentication (where supported):
DATABRICKS_HOST
,ARM_USE_MSI
,ARM_CLIENT_ID
, and possiblyDATABRICKS_AZURE_RESOURCE_ID
.
Then initialize the DatabricksSession
class:
Python
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
Scala
import com.databricks.connect.DatabricksSession
val spark = DatabricksSession.builder().getOrCreate()
A Databricks configuration profile named DEFAULT
For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.
If you have already set the DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specify cluster_id
.
The required configuration profile fields for each authentication type are as follows:
- For Azure Databricks personal access token authentication:
host
andtoken
. - For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
. - For OAuth user-to-machine (U2M) authentication (where supported):
host
. - For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication:
host
,azure_tenant_id
,azure_client_id
,azure_client_secret
, and possiblyazure_workspace_resource_id
. - For Azure CLI authentication:
host
. - For Azure managed identities authentication (where supported):
host
,azure_use_msi
,azure_client_id
, and possiblyazure_workspace_resource_id
.
Name this configuration profile DEFAULT
.
Then initialize the DatabricksSession
class:
Python
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
Scala
import com.databricks.connect.DatabricksSession
val spark = DatabricksSession.builder().getOrCreate()
Configure a connection to serverless compute
Important
This feature is in Public Preview.
Databricks Connect for Python supports connecting to serverless compute. To use this feature, requirements for connecting to serverless must be met. See Requirements.
Important
This feature has the following limitations:
- This feature is only supported in Databricks Connect for Python.
- All of the Databricks Connect for Python limitations
- All of the serverless compute limitations
- Only Python dependencies that are included as part of serverless compute environment can be used for UDFs. See Serverless client images. Additional dependencies cannot be installed.
- UDFs with custom modules are not supported.
You can configure a connection to serverless compute in one of the following ways:
Set the local environment variable
DATABRICKS_SERVERLESS_COMPUTE_ID
toauto
. If this environment variable is set, Databricks Connect ignores thecluster_id
.In a local Databricks configuration profile, set
serverless_compute_id = auto
, then reference that profile from your code.[DEFAULT] host = https://my-workspace.cloud.databricks.com/ serverless_compute_id = auto token = dapi123...
Or use either of the following options:
from databricks.connect import DatabricksSession as SparkSession
spark = DatabricksSession.builder.serverless(True).getOrCreate()
from databricks.connect import DatabricksSession as SparkSession
spark = DatabricksSession.builder.remote(serverless=True).getOrCreate()
Note
The serverless compute session times out after 10 minutes of inactivity. After this, a new Spark session should be created using getOrCreate()
to connect to serverless compute.
Validate the connection to Databricks
To validate your environment, default credentials, and connection to compute are correctly set up for Databricks Connect, run the databricks-connect test
command, which fails with a non-zero exit code and a corresponding error message when it detects any incompatibility in the setup.
databricks-connect test
In Databricks Connect 14.3 and above, you can also validate your environment using validateSession()
:
DatabricksSession.builder.validateSession(True).getOrCreate()
Disabling Databricks Connect
Databricks Connect (and the underlying Spark Connect) services can be disabled on any given cluster.
To disable the Databricks Connect service, set the following Spark configuration on the cluster.
spark.databricks.service.server.enabled false