databricks_step Module

Contains functionality to create an Azure ML pipeline step to run a Databricks notebook or Python script on DBFS.

Classes

DatabricksStep

Creates an Azure ML Pipeline step to add a DataBricks notebook, Python script, or JAR as a node.

For an example of using DatabricksStep, see the notebook https://aka.ms/pl-databricks.

Create an Azure ML Pipeline step to add a DataBricks notebook, Python script, or JAR as a node.

For an example of using DatabricksStep, see the notebook https://aka.ms/pl-databricks.

:param python_script_name:[Required] The name of a Python script relative to source_directory. If the script takes inputs and outputs, those will be passed to the script as parameters. If python_script_name is specified then source_directory must be too.

Specify exactly one of notebook_path, python_script_path, python_script_name, or main_class_name.

If you specify a DataReference object as input with data_reference_name=input1 and a PipelineData object as output with name=output1, then the inputs and outputs will be passed to the script as parameters. This is how they will look like and you will need to parse the arguments in your script to access the paths of each input and output: "-input1","wasbs://[email protected]/test","-output1", "wasbs://[email protected]/b3e26de1-87a4-494d-a20f-1988d22b81a2/output1"

In addition, the following parameters will be available within the script:

  • AZUREML_RUN_TOKEN: The AML token for authenticating with Azure Machine Learning.
  • AZUREML_RUN_TOKEN_EXPIRY: The AML token expiry time.
  • AZUREML_RUN_ID: Azure Machine Learning Run ID for this run.
  • AZUREML_ARM_SUBSCRIPTION: Azure subscription for your AML workspace.
  • AZUREML_ARM_RESOURCEGROUP: Azure resource group for your Azure Machine Learning workspace.
  • AZUREML_ARM_WORKSPACE_NAME: Name of your Azure Machine Learning workspace.
  • AZUREML_ARM_PROJECT_NAME: Name of your Azure Machine Learning experiment.
  • AZUREML_SERVICE_ENDPOINT: The endpoint URL for AML services.
  • AZUREML_WORKSPACE_ID: ID of your Azure Machine Learning workspace.
  • AZUREML_EXPERIMENT_ID: ID of your Azure Machine Learning experiment.
  • AZUREML_SCRIPT_DIRECTORY_NAME: Directory path in DBFS where source_directory has been copied.
  (This parameter is only populated when `python_script_name` is used.  See more details below.)

When you are executing a Python script from your local machine on Databricks using DatabricksStep parameters source_directory and python_script_name, your source_directory is copied over to DBFS and the directory path on DBFS is passed as a parameter to your script when it begins execution. This parameter is labelled as –AZUREML_SCRIPT_DIRECTORY_NAME. You need to prefix it with the string "dbfs:/" or "/dbfs/" to access the directory in DBFS.