DatasetConsumptionConfig Class

Reference

Represent how to deliver the dataset to a compute target.

Represent how to deliver the dataset to the compute target.

Inheritance: builtins.object

DatasetConsumptionConfig

Constructor

DatasetConsumptionConfig(name, dataset, mode='direct', path_on_compute=None)

Parameters

Name	Description
name Required	str The name of the dataset in the run, which can be different to the registered name. The name will be registered as environment variable and can be used in data plane.
dataset Required	AbstractDataset or PipelineParameter or OutputDatasetConfig The dataset that will be consumed in the run.
mode	str Defines how the dataset should be delivered to the compute target. There are three modes: 'direct': consume the dataset as dataset. 'download': download the dataset and consume the dataset as downloaded path. 'mount': mount the dataset and consume the dataset as mount path. 'hdfs': consume the dataset from resolved hdfs path (Currently only supported on SynapseSpark compute). Default value: direct
path_on_compute	str The target path on the compute to make the data available at. The folder structure of the source data will be kept, however, we might add prefixes to this folder structure to avoid collision. Use `tabular_dataset.to_path` to see the output folder structure. Default value: None
name Required	str The name of the dataset in the run, which can be different to the registered name. The name will be registered as environment variable and can be used in data plane.
dataset Required	Dataset or PipelineParameter or tuple(Workspace, str) or tuple(Workspace, str, str) or OutputDatasetConfig The dataset to be delivered, as a Dataset object, Pipeline Parameter that ingests a Dataset, a tuple of (workspace, Dataset name), or a tuple of (workspace, Dataset name, Dataset version). If only a name is provided, the DatasetConsumptionConfig will use the latest version of the Dataset.
mode Required	str Defines how the dataset should be delivered to the compute target. There are three modes: 'direct': consume the dataset as dataset. 'download': download the dataset and consume the dataset as downloaded path. 'mount': mount the dataset and consume the dataset as mount path. 'hdfs': consume the dataset from resolved hdfs path (Currently only supported on SynapseSpark compute).
path_on_compute Required	str The target path on the compute to make the data available at. The folder structure of the source data will be kept, however, we might add prefixes to this folder structure to avoid collision. We recommend calling tabular_dataset.to_path to see the output folder structure.

Methods

as_download

Set the mode to download.

In the submitted run, files in the dataset will be downloaded to local path on the compute target. The download location can be retrieved from argument values and the input_datasets field of the run context.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_download()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The download location can be retrieved from argument values
   import sys
   download_location = sys.argv[1]

   # The download location can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   download_location = Run.get_context().input_datasets['input_1']

as_hdfs

Set the mode to hdfs.

In the submitted synapse run, files in the datasets will be converted to local path on the compute target. The hdfs path can be retrieved from argument values and the os environment variables.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_hdfs()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The hdfs path can be retrieved from argument values
   import sys
   hdfs_path = sys.argv[1]

   # The hdfs path can also be retrieved from input_datasets of the run context.
   import os
   hdfs_path = os.environ['input_1']

as_mount

Set the mode to mount.

In the submitted run, files in the datasets will be mounted to local path on the compute target. The mount point can be retrieved from argument values and the input_datasets field of the run context.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_mount()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The mount point can be retrieved from argument values
   import sys
   mount_point = sys.argv[1]

   # The mount point can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   mount_point = Run.get_context().input_datasets['input_1']

as_download

Set the mode to download.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_download()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The download location can be retrieved from argument values
   import sys
   download_location = sys.argv[1]

   # The download location can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   download_location = Run.get_context().input_datasets['input_1']

as_download(path_on_compute=None)

Parameters

Name	Description
path_on_compute	str The target path on the compute to make the data available at. Default value: None

Remarks

When the dataset is created from path of a single file, the download location will be path of the single downloaded file. Otherwise, the download location will be path of the enclosing folder for all the downloaded files.

If path_on_compute starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the working directory. If you have specified an absolute path, please make sure that the job has permission to write to that directory.

as_hdfs

Set the mode to hdfs.

In the submitted synapse run, files in the datasets will be converted to local path on the compute target. The hdfs path can be retrieved from argument values and the os environment variables.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_hdfs()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The hdfs path can be retrieved from argument values
   import sys
   hdfs_path = sys.argv[1]

   # The hdfs path can also be retrieved from input_datasets of the run context.
   import os
   hdfs_path = os.environ['input_1']

as_hdfs()

Remarks

When the dataset is created from path of a single file, the hdfs path will be path of the single file. Otherwise, the hdfs path will be path of the enclosing folder for all the mounted files.

as_mount

Set the mode to mount.

In the submitted run, files in the datasets will be mounted to local path on the compute target. The mount point can be retrieved from argument values and the input_datasets field of the run context.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_mount()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The mount point can be retrieved from argument values
   import sys
   mount_point = sys.argv[1]

   # The mount point can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   mount_point = Run.get_context().input_datasets['input_1']

as_mount(path_on_compute=None)

Parameters

Name	Description
path_on_compute	str The target path on the compute to make the data available at. Default value: None

Remarks

When the dataset is created from path of a single file, the mount point will be path of the single mounted file. Otherwise, the mount point will be path of the enclosing folder for all the mounted files.

Attributes

name

Name of the input.

Returns

Type	Description
	Name of the input.

Share via

DatasetConsumptionConfig Class

Constructor

Parameters

Methods

as_download

Parameters

Remarks

as_hdfs

Remarks

as_mount

Parameters

Remarks

Attributes

name

Returns

Feedback

Additional resources