Connect to storage services on Azure with datastores

APPLIES TO: Python SDK azureml v1

APPLIES TO: Azure CLI ml extension v1

In this article, learn how to connect to data storage services on Azure with Azure Machine Learning datastores and the Azure Machine Learning Python SDK.

Datastores securely connect to your storage service on Azure, and they avoid risk to your authentication credentials or the integrity of your original data store. A datastore stores connection information - for example, your subscription ID or token authorization - in the Key Vault associated with the workspace. With a datastore, you can securely access your storage because you can avoid hard-coding connection information in your scripts. You can create datastores that connect to these Azure storage solutions.

For information that describes how datastores fit with the Azure Machine Learning overall data access workflow, visit Securely access data article.

To learn how to connect to a data storage resource with a UI, visit Connect to data storage with the studio UI.

Tip

This article assumes that you will connect to your storage service with credential-based authentication credentials - for example, a service principal or a shared access signature (SAS) token. Note that if credentials are registered with datastores, all users with the workspace Reader role can retrieve those credentials. For more information, visit Manage roles in your workspace.

For more information about identity-based data access, visit Identity-based data access to storage services (v1).

Prerequisites

  • An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning

  • An Azure storage account with a supported storage type

  • The Azure Machine Learning SDK for Python

  • An Azure Machine Learning workspace.

    Create an Azure Machine Learning workspace, or use an existing workspace via the Python SDK

    Import the Workspace and Datastore class, and load your subscription information from the config.json file with the from_config() function. By default, the function looks for the JSON file in the current directory, but you can also specify a path parameter to point to the file with from_config(path="your/file/path"):

    import azureml.core
    from azureml.core import Workspace, Datastore
    
    ws = Workspace.from_config()
    

    Workspace creation automatically registers an Azure blob container and an Azure file share, as datastores, to the workspace. They're named workspaceblobstore and workspacefilestore, respectively. The workspaceblobstore stores workspace artifacts and your machine learning experiment logs. It serves as the default datastore and can't be deleted from the workspace. The workspacefilestore stores notebooks and R scripts authorized via compute instance.

    Note

    Azure Machine Learning designer automatically creates a datastore named azureml_globaldatasets when you open a sample in the designer homepage. This datastore only contains sample datasets. Please do not use this datastore for any confidential data access.

Supported data storage service types

Datastores currently support storage of connection information to the storage services listed in this matrix:

Tip

For unsupported storage solutions (those not listed in the following table), you might encounter issues as you connect and work with your data. We suggest that you move your data to a supported Azure storage solution. This can also help with additional scenarios- - for example, reduction of data egress cost during ML experiments.

Storage type Authentication type Azure Machine Learning studio Azure Machine Learning  Python SDK Azure Machine Learning CLI Azure Machine Learning  REST API VS Code
Azure Blob Storage Account key
SAS token
Azure File Share Account key
SAS token
Azure Data Lake Storage Gen 1 Service principal
Azure Data Lake Storage Gen 2 Service principal
Azure SQL Database SQL authentication
Service principal
Azure PostgreSQL SQL authentication
Azure Database for MySQL SQL authentication ✓* ✓* ✓*
Databricks File System No authentication ✓** ✓ ** ✓**

Storage guidance

We recommend creation of a datastore for an Azure Blob container. Both standard and premium storage are available for blobs. Although premium storage is more expensive, its faster throughput speeds might improve the speed of your training runs, especially if you train against a large dataset. For information about storage account costs, visit the Azure pricing calculator.

Azure Data Lake Storage Gen2 is built on top of Azure Blob storage. It's designed for enterprise big data analytics. As part of Data Lake Storage Gen2, Blob storage features a hierarchical namespace. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.

Storage access and permissions

To ensure you securely connect to your Azure storage service, Azure Machine Learning requires that you have permission to access the corresponding data storage container. This access depends on the authentication credentials used to register the datastore.

Note

This guidance also applies to datastores created with identity-based data access.

Virtual network

To communicate with a storage account located behind a firewall or within a virtual network, Azure Machine Learning requires extra configuration steps. For a storage account located behind a firewall, you can add your client's IP address to an allowlist with the Azure portal.

Azure Machine Learning can receive requests from clients outside of the virtual network. To ensure that the entity requesting data from the service is safe, and to enable display of data in your workspace, use a private endpoint with your workspace.

For Python SDK users: To access your data on a compute target with your training script, you must locate the compute target inside the same virtual network and subnet of the storage. You can use a compute instance/cluster in the same virtual network.

For Azure Machine Learning studio users: Several features rely on the ability to read data from a dataset - for example, dataset previews, profiles, and automated machine learning. For these features to work with storage behind virtual networks, use a workspace managed identity in the studio to allow Azure Machine Learning to access the storage account from outside the virtual network.

Note

For data stored in an Azure SQL Database behind a virtual network, set Deny public access to No with the Azure portal, to allow Azure Machine Learning to access the storage account.

Access validation

Warning

Cross tenant access to storage accounts is not supported. If your scenario needs cross tenant access, reach out to the Azure Machine Learning Data Support team alias at [email protected] for assistance with a custom code solution.

As part of the initial datastore creation and registration process, Azure Machine Learning automatically validates that the underlying storage service exists and that the user-provided principal (username, service principal, or SAS token) can access the specified storage.

After datastore creation, this validation is only performed for methods that require access to the underlying storage container, not each time datastore objects are retrieved. For example, validation happens if you want to download files from your datastore. However, if you only want to change your default datastore, then validation doesn't happen.

To authenticate your access to the underlying storage service, you can provide either your account key, shared access signatures (SAS) tokens, or service principal in the corresponding register_azure_*() method of the datastore type you want to create. The storage type matrix lists the supported authentication types that correspond to each datastore type.

You can find account key, SAS token, and service principal information at your Azure portal.

  • To use an account key or SAS token for authentication, select Storage Accounts on the left pane, and choose the storage account that you want to register

    • The Overview page provides account name, file share name, container, etc. information
      • For account keys, go to Access keys on the Settings pane
      • For SAS tokens, go to Shared access signatures on the Settings pane
  • To use a service principal for authentication, go to your App registrations and select the app you want to use

    • The corresponding Overview page of the selected app contains required information - for example, tenant ID and client ID

Important

To change your access keys for an Azure Storage account (account key or SAS token), sync the new credentials with your workspace and the datastores connected to it. For more information, visit sync your updated credentials.

Permissions

For Azure blob container and Azure Data Lake Gen 2 storage, ensure that your authentication credentials have Storage Blob Data Reader access. For more information, visit Storage Blob Data Reader. An account SAS token defaults to no permissions.

  • For data read access, your authentication credentials must have a minimum of list and read permissions for containers and objects

  • Data write access also requires write and add permissions

Create and register datastores

Registration of an Azure storage solution as a datastore automatically creates and registers that datastore to a specific workspace. Review storage access & permissions in this document for guidance about virtual network scenarios, and where to find required authentication credentials.

That section offers examples that describe how to create and register a datastore via the Python SDK for these storage types. The parameters shown these examples are the required parameters to create and register a datastore:

To create datastores for other supported storage services, visit the reference documentation for the applicable register_azure_* methods.

To learn how to connect to a data storage resource with a UI, visit Connect to data with Azure Machine Learning studio.

Important

If you unregister and re-register a datastore with the same name, and the re-registration fails, the Azure Key Vault for your workspace may not have soft-delete enabled. By default, soft-delete is enabled for the key vault instance created by your workspace, but it may not be enabled if you used an existing key vault or have a workspace created before October 2020. For information that describes how to enable soft-delete, see Turn on Soft Delete for an existing key vault.

Note

A datastore name should only contain lowercase letters, digits and underscores.

Azure blob container

To register an Azure blob container as a datastore, use the register_azure_blob_container() method.

This code sample creates and registers the blob_datastore_name datastore to the ws workspace. The datastore uses the provided account access key to access the my-container-name blob container on the my-account-name storage account. Review the storage access & permissions section for guidance about virtual network scenarios, and where to find required authentication credentials.

blob_datastore_name='azblobsdk' # Name of the datastore to workspace
container_name=os.getenv("BLOB_CONTAINER", "<my-container-name>") # Name of Azure blob container
account_name=os.getenv("BLOB_ACCOUNTNAME", "<my-account-name>") # Storage account name
account_key=os.getenv("BLOB_ACCOUNT_KEY", "<my-account-key>") # Storage account access key

blob_datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                         datastore_name=blob_datastore_name, 
                                                         container_name=container_name, 
                                                         account_name=account_name,
                                                         account_key=account_key)

Azure file share

To register an Azure file share as a datastore, use the register_azure_file_share() method.

This code sample creates and registers the file_datastore_name datastore to the ws workspace. The datastore uses the my-fileshare-name file share on the my-account-name storage account, with the provided account access key. Review the storage access & permissions section for guidance about virtual network scenarios, and where to find required authentication credentials.

file_datastore_name='azfilesharesdk' # Name of the datastore to workspace
file_share_name=os.getenv("FILE_SHARE_CONTAINER", "<my-fileshare-name>") # Name of Azure file share container
account_name=os.getenv("FILE_SHARE_ACCOUNTNAME", "<my-account-name>") # Storage account name
account_key=os.getenv("FILE_SHARE_ACCOUNT_KEY", "<my-account-key>") # Storage account access key

file_datastore = Datastore.register_azure_file_share(workspace=ws,
                                                     datastore_name=file_datastore_name, 
                                                     file_share_name=file_share_name, 
                                                     account_name=account_name,
                                                     account_key=account_key)

Azure Data Lake Storage Generation 2

For an Azure Data Lake Storage Generation 2 (ADLS Gen 2) datastore, use theregister_azure_data_lake_gen2() method to register a credential datastore connected to an Azure Data Lake Gen 2 storage with service principal permissions.

To use your service principal, you must register your application and grant the service principal data access via either Azure role-based access control (Azure RBAC) or access control lists (ACL). For more information, visit access control set up for ADLS Gen 2.

This code creates and registers the adlsgen2_datastore_name datastore to the ws workspace. This datastore accesses the file system test in the account_name storage account, through use of the provided service principal credentials. Review the storage access & permissions section for guidance on virtual network scenarios, and where to find required authentication credentials.

adlsgen2_datastore_name = 'adlsgen2datastore'

subscription_id=os.getenv("ADL_SUBSCRIPTION", "<my_subscription_id>") # subscription id of ADLS account
resource_group=os.getenv("ADL_RESOURCE_GROUP", "<my_resource_group>") # resource group of ADLS account

account_name=os.getenv("ADLSGEN2_ACCOUNTNAME", "<my_account_name>") # ADLS Gen2 account name
tenant_id=os.getenv("ADLSGEN2_TENANT", "<my_tenant_id>") # tenant id of service principal
client_id=os.getenv("ADLSGEN2_CLIENTID", "<my_client_id>") # client id of service principal
client_secret=os.getenv("ADLSGEN2_CLIENT_SECRET", "<my_client_secret>") # the secret of service principal

adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(workspace=ws,
                                                             datastore_name=adlsgen2_datastore_name,
                                                             account_name=account_name, # ADLS Gen2 account name
                                                             filesystem='test', # ADLS Gen2 filesystem
                                                             tenant_id=tenant_id, # tenant id of service principal
                                                             client_id=client_id, # client id of service principal
                                                             client_secret=client_secret) # the secret of service principal

Create datastores with other Azure tools

In addition to datastore creation with the Python SDK and the studio, you can also create datastores with Azure Resource Manager templates or the Azure Machine Learning VS Code extension.

Azure Resource Manager

You can use several templates at https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.machinelearningservices to create datastores. For information about these templates, visit Use an Azure Resource Manager template to create a workspace for Azure Machine Learning.

VS Code extension

For more information about creation and management of datastores with the Azure Machine Learning VS Code extension, visit the VS Code resource management how-to guide.

Use data in your datastores

After datastore creation, create an Azure Machine Learning dataset to interact with your data. A dataset packages your data into a lazily evaluated consumable object for machine learning tasks, like training. With datasets, you can download or mount files of any format from Azure storage services for model training on a compute target. Learn more about how to train ML models with datasets.

Get datastores from your workspace

To get a specific datastore registered in the current workspace, use the get() static method on the Datastore class:

# Get a named datastore from the current workspace
datastore = Datastore.get(ws, datastore_name='your datastore name')

To get the list of datastores registered with a given workspace, use the datastores property on a workspace object:

# List all datastores registered in the current workspace
datastores = ws.datastores
for name, datastore in datastores.items():
    print(name, datastore.datastore_type)

This code sample shows how to get the default datastore of the workspace:

datastore = ws.get_default_datastore()

You can also change the default datastore with this code sample. Only the SDK supports this ability:

 ws.set_default_datastore(new_default_datastore)

Access data during scoring

Azure Machine Learning provides several ways to use your models for scoring. Some of these methods provide no access to datastores. The following table describes which methods allow access to datastores during scoring:

Method Datastore access Description
Batch prediction Make predictions on large quantities of data asynchronously.
Web service   Deploy models as a web service.

When the SDK doesn't provide access to datastores, you might be able to create custom code with the relevant Azure SDK to access the data. For example, the Azure Storage SDK for Python client library can access data stored in blobs or files.

Move data to supported Azure storage solutions

Azure Machine Learning supports accessing data from

  • Azure Blob storage
  • Azure Files
  • Azure Data Lake Storage Gen1
  • Azure Data Lake Storage Gen2
  • Azure SQL Database
  • Azure Database for PostgreSQL

If you use unsupported storage, we recommend that you use Azure Data Factory and these steps to move your data to supported Azure storage solutions. Moving data to supported storage can help you save data egress costs during machine learning experiments.

Azure Data Factory provides efficient and resilient data transfer, with more than 80 prebuilt connectors, at no extra cost. These connectors include Azure data services, on-premises data sources, Amazon S3 and Redshift, and Google BigQuery.

Next steps