DatasetConsumptionConfig Class
Represent how to deliver the dataset to a compute target.
Represent how to deliver the dataset to the compute target.
- Inheritance
-
builtins.objectDatasetConsumptionConfig
Constructor
DatasetConsumptionConfig(name, dataset, mode='direct', path_on_compute=None)
Parameters
Name | Description |
---|---|
name
Required
|
The name of the dataset in the run, which can be different to the registered name. The name will be registered as environment variable and can be used in data plane. |
dataset
Required
|
The dataset that will be consumed in the run. |
mode
|
Defines how the dataset should be delivered to the compute target. There are three modes:
Default value: direct
|
path_on_compute
|
The target path on the compute to make the data available at. The folder structure
of the source data will be kept, however, we might add prefixes to this folder structure to avoid
collision. Use Default value: None
|
name
Required
|
The name of the dataset in the run, which can be different to the registered name. The name will be registered as environment variable and can be used in data plane. |
dataset
Required
|
Dataset or
PipelineParameter or
tuple(Workspace, str) or
tuple(Workspace, str, str) or
OutputDatasetConfig
The dataset to be delivered, as a Dataset object, Pipeline Parameter that ingests a Dataset, a tuple of (workspace, Dataset name), or a tuple of (workspace, Dataset name, Dataset version). If only a name is provided, the DatasetConsumptionConfig will use the latest version of the Dataset. |
mode
Required
|
Defines how the dataset should be delivered to the compute target. There are three modes:
|
path_on_compute
Required
|
The target path on the compute to make the data available at. The folder structure of the source data will be kept, however, we might add prefixes to this folder structure to avoid collision. We recommend calling tabular_dataset.to_path to see the output folder structure. |
Methods
as_download |
Set the mode to download. In the submitted run, files in the dataset will be downloaded to local path on the compute target. The download location can be retrieved from argument values and the input_datasets field of the run context.
|
as_hdfs |
Set the mode to hdfs. In the submitted synapse run, files in the datasets will be converted to local path on the compute target. The hdfs path can be retrieved from argument values and the os environment variables.
|
as_mount |
Set the mode to mount. In the submitted run, files in the datasets will be mounted to local path on the compute target. The mount point can be retrieved from argument values and the input_datasets field of the run context.
|
as_download
Set the mode to download.
In the submitted run, files in the dataset will be downloaded to local path on the compute target. The download location can be retrieved from argument values and the input_datasets field of the run context.
file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_download()
experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))
# Following are sample codes running in context of the submitted run:
# The download location can be retrieved from argument values
import sys
download_location = sys.argv[1]
# The download location can also be retrieved from input_datasets of the run context.
from azureml.core import Run
download_location = Run.get_context().input_datasets['input_1']
as_download(path_on_compute=None)
Parameters
Name | Description |
---|---|
path_on_compute
|
The target path on the compute to make the data available at. Default value: None
|
Remarks
When the dataset is created from path of a single file, the download location will be path of the single downloaded file. Otherwise, the download location will be path of the enclosing folder for all the downloaded files.
If path_on_compute starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the working directory. If you have specified an absolute path, please make sure that the job has permission to write to that directory.
as_hdfs
Set the mode to hdfs.
In the submitted synapse run, files in the datasets will be converted to local path on the compute target. The hdfs path can be retrieved from argument values and the os environment variables.
file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_hdfs()
experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))
# Following are sample codes running in context of the submitted run:
# The hdfs path can be retrieved from argument values
import sys
hdfs_path = sys.argv[1]
# The hdfs path can also be retrieved from input_datasets of the run context.
import os
hdfs_path = os.environ['input_1']
as_hdfs()
Remarks
When the dataset is created from path of a single file, the hdfs path will be path of the single file. Otherwise, the hdfs path will be path of the enclosing folder for all the mounted files.
as_mount
Set the mode to mount.
In the submitted run, files in the datasets will be mounted to local path on the compute target. The mount point can be retrieved from argument values and the input_datasets field of the run context.
file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_mount()
experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))
# Following are sample codes running in context of the submitted run:
# The mount point can be retrieved from argument values
import sys
mount_point = sys.argv[1]
# The mount point can also be retrieved from input_datasets of the run context.
from azureml.core import Run
mount_point = Run.get_context().input_datasets['input_1']
as_mount(path_on_compute=None)
Parameters
Name | Description |
---|---|
path_on_compute
|
The target path on the compute to make the data available at. Default value: None
|
Remarks
When the dataset is created from path of a single file, the mount point will be path of the single mounted file. Otherwise, the mount point will be path of the enclosing folder for all the mounted files.
If path_on_compute starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the working directory. If you have specified an absolute path, please make sure that the job has permission to write to that directory.
Attributes
name
Name of the input.
Returns
Type | Description |
---|---|
Name of the input. |